Managing costs is a challenge for organizations using public cloud services but also an opportunity to drive efficient consumption of IT. This research provides I&O technical professionals with a framework to manage costs of cloud-integrated IaaS and PaaS providers such as AWS and Microsoft Azure.
Focusing on efficient use of cloud services brings immediate and tangible financial benefits. Unfortunately, most organizations are unprepared to profit from this savings opportunity and they’re likely to overspend.
Cloud cost management is not just an operational concern. To be successful, it requires a tight collaboration among the disciplines of governance, architecture, operations, product management, finance and application development.
Correlating cloud costs to business KPIs allows organizations to manage spending in respect to its return on investment (ROI). It also enables organizations to assess the business impact of cost growth and optimization. Driving costs down as a principle must not be done at the expense of being unable to fully support the business goals.
Major cloud providers such as Amazon Web Services (AWS) and Microsoft Azure offer native tools to manage costs, albeit with basic functionality. Third-party tools continue to advance their cost management functionality with a multicloud approach.
Technical professionals in charge of IT operations and cloud management should:
Make cloud consumers accountable for what they spend. Provide them with the resources to create forecasts, monitor costs and pursue optimization opportunities.
Build cost-tracking foundations using both provider-native hierarchies and tags. Use them to organize resources around principles such as applications, departments or cost centers.
Drive cost optimization by monitoring utilization and capacity metrics. Schedule and rightsize allocation-based services. Leverage programmatic discounts. Modernize applications to make use of provider managed services when these are more cost-effective.
Deploy the cloud provider’s native tools to manage spending. Augment them with third-party tools if multicloud consistency, provider independence or a more integrated approach is needed.
The adoption of cloud computing introduces a number of new challenges, but managing cloud spending is proving to be one of the most difficult. When using public cloud IaaS and PaaS, organizations are billed continuously as consumption occurs, instead of once-off as it happens when they procure their data center capacity. In cloud computing, organizations are confronted with the difficulty of creating accurate cost estimates. They are often hit by bills that they apparently can’t explain and struggle to identify items that are responsible for spending. As a result, financial management is often overlooked until spend is out of control.
However, cloud computing also allows unprecedented visibility into IT costs that organizations can use to drive more efficient consumption of IT. Traditional enterprise data centers are equipped with finite, preprocured, capital expenditure (capex)-oriented capacity but a more efficient use of that capacity does not automatically translate into cost savings. The cloud computing model reverses this paradigm.
Furthermore, the following factors increase the complexity of managing costs of public cloud:
Complex and multifaceted pricing: Cloud providers use billing models and pricing structures with thousands of options and combinations. It can take significant time to understand all these combinations and learn how to select the best pricing option of each use case.
Extreme granularity of cloud bills: Issued bills can easily reach thousands of line items, even when consumption is low. This granularity complicates cost attribution, which is necessary to enable chargeback.
Ease of cloud service provisioning: The easy access to point-and-click web consoles and APIs in absence of capacity constraints can lead to “resource sprawl” and, consequently, unexpected charges.
Constant change in cloud offerings: Every year, cloud vendors announce dozens of new services, features, instance types and pricing reductions and sometimes even new pricing models. Organizations struggle to keep up with this pace and understand how each announcement affects their financials.
Excess of alternative architectures: The same application can be built using many different architectures, services and components that can result in very different costs. Organizations struggle to calculate and identify the most cost-effective alternative to deliver their requirements.
Lack of standardization between cloud platforms: Major cloud platforms such as AWS, Google Cloud Platform (GCP) and Microsoft Azure do not provide any standardization of billing models, billing formats, APIs or services.
To control the costs of public cloud IaaS and PaaS, prevent overspending and drive more efficient consumption of cloud services, organizations must develop financial management processes. These processes affect multiple roles and departments, including I&O, the cloud center of excellence (CCOE), finance and the firsthand consumers of cloud services. Ultimately, these processes translate into new management requirements and demand the adoption of new tools.
This research guides organizations through how to manage costs of public cloud IaaS and PaaS, to address the listed challenges and unlock new savings opportunities. Only via the thorough application of this research can organizations make their use of public cloud services cost-effective.
The Gartner Approach
Common guidance on public cloud cost management is often limited to providing a list of pragmatic tasks, such as turning off unused instances or deleting unused storage. While these practices are certainly recommended — and mentioned in this research — common guidance usually fails at establishing a strategic and comprehensive view on cost management.
Focusing just on operational tasks such as turning off unused instances can cause disruption and frustration among your cloud users, leading to shadow IT. The simple execution of cost reduction tasks won’t guarantee that spend will remain within expectations if nobody has previously set an expectation. Furthermore, without the ability to track and organize costs around cost centers and applications, you will not be able to provide the right visibility to make your cloud consumers care about how much they spend.
Gartner’s methodology provides a structured framework for public cloud cost management. It provides guidance not only on the operational aspects, but also on architecture, governance, application development and DevOps. Using this structured approach, you will be able to set your priorities, involve key stakeholders and determine the organizational changes required to develop and maintain these new capabilities. By applying this methodology, you will initiate the cultural shift that will make your cloud consumers more accountable of their IT spend. Ultimately, you will learn how to manage your costs in relation to the business value that cloud services generate.
Public cloud cost management is part of the broader cloud economics discipline. Cloud economics also includes aspects of total cost of ownership (TCO) and ROI calculation as organizations evaluate the adoption of cloud services. These aspects are out of the scope of this document but are covered in
The Guidance Framework
Managing cloud costs is a multifaceted, complex problem. First, organizations must learn how to forecast consumption and how to set budget expectations. Then, they must gain continuous visibility into what users are spending for each initiative, project or application. Once tracking is established, organizations must seek methods to reduce their monthly bill. Costs can be reduced by leveraging the achieved visibility to detect anomalies and drive corrective actions. As organizations mature through such capabilities, they can achieve scale by automating decision making where possible.
Gartner has created a framework to manage cloud spending on an ongoing basis. This guidance framework provides a series of capabilities that organizations must develop to budget, track and optimize cloud spend. The framework applies regardless of where your organization is in its cloud adoption journey and identifies best practices in each of the included components.
Managing cloud costs requires the development of capabilities in five distinct areas:
The guidance framework is depicted in Figure 1.
The logical flow between these five areas should not be interpreted as a mandate to implement them sequentially. Instead, organizations should apply an iterative approach and develop each area as independently as possible. Although there are obvious dependencies between areas, these shouldn’t block the development of subsequent capabilities. For example, you can start reducing your costs even if you don’t possess full visibility of your spending.
This framework is provider-neutral and can be applied to all major cloud providers, including AWS, GCP and Microsoft Azure. The described approach helps organizations develop a consistent multicloud governance model to achieve operational excellence in managing public cloud spending.
Before using the framework, technical professionals must complete some fundamental prework. The items described in this section must be completed prior to using any component in the framework. Specifically, you must:
Public cloud cost management is not only a concern for the I&O organization. Similarly, it’s not only the finance team that should care about budgets and cost containment. This guidance framework impacts several departments that should also cooperate in its implementation.
Organizations are transforming their culture and processes to adopt cloud computing. They are adopting a product-oriented delivery and DevOps. This transformation forces organizations to become more decentralized and require granular cost controls. The individuals and teams responsible for cloud costs also change as organizations move through different stages of this transformation.
You must determine the stakeholders to involve your cost management practice. As a minimum, the collaboration should embrace the following disciplines:
Governance: The governance discipline is in charge of defining the policies that govern the cost management practice. Policies apply throughout the guidance framework. Examples of policies include which resource tags must be applied to enable cost tracking or how long unused services will be left running before being shut down.
Architecture: The architecture discipline helps design architectures based on cost optimization principles. The efficient use of cloud IaaS and PaaS requires a precise architectural design that delivers on the requirements but not beyond them (see Architect With Cost in Mind in the Plan component).
IT Operations: IT operations is in charge of enforcing policies to reduce costs on an ongoing basis. It is also in charge of notifying cloud consumers about optimization opportunities they can pursue on their own. Lastly, it establishes and maintains reports and dashboards to build cost awareness throughout the organization.
ProductOwners: Product management defines the requirements that must be delivered by the application in terms of performance, availability, frequency of updates or expected utilization. Product owners are responsible to define the business KPIs that represent the value of applications. Ultimately, you cannot make cost management decisions without considering the impact on these business KPIs, as described in Correlate Costs to Business Value in the Evolve component.
Application Development and DevOps: Teams responsible for application development and DevOps directly design, build, run and operate aspects of their applications. Instead of relying on a ticketing system to gain access to resources, these teams are directly responsible for provisioning and managing cloud resources. Their architectural decisions have a direct impact on the cloud bill. They are the most empowered to make the changes required to keep costs under control (and often the least prepared to do so).
Finance: Finance provides input to the governance discipline for the definition of the policies that relate, for example, to budget approvals and cost allocation. Furthermore, finance consumes cost reports to produce forecasts and implement chargeback and showback models.
In many cases, there is a team in charge of both governance and architecture. This team is often the CCOE. Other common names for such a team include “cloud architecture” or “cloud custodians.” A CCOE is a centralized cloud computing governance function for the organization as a whole. It serves as an internal cloud service broker. It acts in a consultative role to the consumers of cloud services within the organization. It is a key ingredient for cloud-enabled transformation and is typically tasked with helping drive that transformation. For more information on the role and responsibilities of a CCOE and how to establish one, see
Figure 2 depicts the relationships of the CCOE with the identified stakeholders.
Appoint Ownership of the Cost Management Practice
The finance department does not own the cloud cost management practice. Finance just doesn’t possess the technical knowledge required to make decisions that are heavily tied to technical configurations and operational metrics. For example, resizing a compute instance requires a reboot, and finance isn’t equipped to make this decision while fully considering its technical impact.
Cost management is primarily a governance matter and its ownership naturally resides with the IT governance team or the CCOE — when the organization has one. The implementation of this framework starts from its translation into specific policies, which the CCOE is in charge of defining. In this context, IT operations acts as the enforcement arm of the CCOE for certain aspects of the framework, such as cost reduction techniques.
In other cases, when there is no CCOE or when I&O is accountable for all IT spending, the operations team can take full ownership of the cost management practice. I&O is already well-versed in monitoring and reporting metrics such as availability and performance. Cost would be simply another metric to track and act on. Although this is an acceptable approach to begin with, at some point the governance team must take over and take the practice beyond the simple operational perspective.
Owning the cost management process doesn’t mean having to execute firsthand each of the listed practices. This execution will actually be distributed among stakeholders. Ultimately, many of the cost management practices such as tagging and rightsizing will be executed directly by cloud consumers, as they take on more responsibility of their spending. Owning this process means making sure that the cost policies are successfully defined and enforced. While the CCOE makes the “final call” on some of these policies, their definition must be accomplished with a great degree of collaboration and transparency with identified stakeholders.
Define Cost Accountability
The traditional consumers of IT services within an organization have never been concerned about costs. In an enterprise data center, I&O owns and is accountable for the entire IT budget. The lack of chargeback models for the calculation of unit costs made IT consumers consider the data center literally as a “free” resource. However, in such a model, I&O is also in charge of provisioning resources and procuring capacity. Consequently, I&O is in full control of the costs of IT and can simply refuse requests when they are not within an established budget.
Conversely, the self-service nature of cloud services and the widely available self-service interfaces have shifted some of this control away from I&O. When using cloud services, I&O does not make all resource procurement decisions and should not be held accountable for entirety of the spending. Organizations must work on shifting this accountability toward the consumers of cloud services.
This shift requires a cultural change and does not happen overnight. This Gartner framework (see the Evolve component) helps lay out the foundations for this shift to happen over time. However, when starting to implement cost management, organizations can choose to keep spending ownership in the I&O team for an initial — and limited — time frame. This approach will help accelerate the implementation of potentially disruptive processes such as resource decommissioning or rightsizing. In the future, by applying this guidance framework in its entirety, the ownership of cloud spending becomes decentralized and distributed to all teams in charge of deploying cloud applications and projects.
Select Governance Model
Because cost management is primarily a governance concern, you must define the model you want to use to enforce cost policies. Gartner identifies two main approaches to cloud governance:
“In the way” governance: In this approach, central IT stands in the way between cloud consumers and cloud environments. It acts as a proxy by collecting requests for cloud services and executing firsthand provisioning tasks. This approach hides native cloud interfaces from users, and minimizes their autonomy but maximizes centralized control. It enforces policies by simply dismissing noncompliant requests.
“On the side” governance: In this approach, central IT allows cloud consumers to have direct access to the native cloud interfaces. Central IT configures these interfaces with policies to implement guardrails and guidelines. Cloud platforms enforce the configured policies at every provisioning request submitted through their native interfaces. This approach maximizes user autonomy, but lowers control for central IT.
The two models are depicted in Figure 3.
Gartner believes that cloud computing can be adopted at scale only by adopting an “on the side” governance model. Only this model can unlock benefits such as agility and speed that organizations seek from cloud technologies. If you are adopting cloud computing to accelerate business innovation, you must implement the “on the side” governance model as described in This framework was developed for organizations that apply such a model.
The “in the way” model has proven not to be effective with cloud computing. The centralized IT organization is typically not staffed to provide timely support to the growing number of provisioning and change requests from lines of business (LOBs). Cloud consumers increase pressure and demand on the IT organization, especially because they have the possibility to experience cloud directly. As a consequence, organizations that implement governance in the way typically experience a higher degree of shadow IT. See for the strength and weaknesses of different self-service approaches.
Once this prework is complete, organizations can move to implementing the cost management framework for integrated IaaS and PaaS. The rest of the document describes the framework in detail.
Organizations need to develop capabilities to produce an application budget and consumption forecast as accurately as possible. Setting an expectation upfront creates a baseline against which the organization can measure actual consumption. Develop this capability and run this process prior to deploying applications, projects and workloads in the public cloud. Create forecasts for each new application you deploy in public cloud environments and for each application you migrate from on-premises into a public cloud environment.
The Plan component of this guidance framework is depicted in Figure 4.
This capability is a consultative function provided by the CCOE to uncover the application’s nonfunctional requirements that have an impact on costs. The goal is to identify the precise outcomes that drive the subsequent cloud services design and avoid overarchitecting applications.
Collaborate with product owners and business stakeholder to understand the purpose of each application and the value it delivers to the organization. Determine the key metrics that demand the use of specific architectural principles. Even when your cloud consumers believe they have predetermined the required set of cloud services, challenge their assumptions and clarify what they’re trying to accomplish.
The more cloud consumers are accountable for their costs, the more they value the support of the CCOE in this panning activity. Consumers shouldn’t consider this activity as an administrative burden that slows down their cloud projects. Conversely, you must position the CCOE as a subject matter expert that helps cloud consumers accomplish their goal and determine the most cost-effective architecture to deliver their requirements. To make users more accountable for their cloud costs, see Shift Budget Accountability in the Evolve component.
It is easier to define requirements for existing applications because you have historical usage data to observe. New applications will require more assumptions about their expected utilization, and you’ll possibly need to employ an iterative approach. Regardless of the difficulty, defining upfront as many requirements as possible is instrumental to a precise cloud services design. Such services are available with many configuration options, each of them bearing its own cost. Certain options should be enabled only if truly needed. For example, you will choose to enable (and pay for) cross-region data replication only if the application is critical for the business continuity. Conversely, overarchitecting cloud services can only result in overspending.
Ask your cloud consumers the following questions to determine their workloads’ requirement:
Confidentiality, integrity and availability requirements: How confidential is the data handled by the workload? How critical is the integrity of the data? What are the consequences if data remains unavailable for a period? What are the consequences of data loss?
Business continuity requirements: Does the application require any SLAs or service-level objectives (SLOs)? What is the target recovery point objective (RPO) and recovery time objective (RTO)? Will the application need to run in more than one geography? What is the desired HA/DR architecture? Are there specific requirements related to MTTR?
Performance requirements: What is the desired performance target? How much compute and memory? How much storage space? How many storage (IOPS) and how much network bandwidth? Which maximum latency can we tolerate?
Compliance requirements: Does data used or created by the application require compliance with any industry-specific regulatory standard, such as Payment Card Industry Data Security Standard (PCI DSS) or Health Insurance Portability and Accountability Act (HIPAA)? Does it have any residency requirement such as never traveling outside of a country border?
Technology requirements: Which technology is the application based on? Does it require a commercial operating system (OS) like Red Hat Enterprise Linux (RHEL)? Or can it run on a free OS like Ubuntu? Does the application require a specific runtime such as .NET, Java, Node.js or GoLang? Does the application have external dependencies such as external APIs or cloud services?
Utilization requirements: Are utilization peaks expected? Does it need to be on 24 hours a day or only during business hours? How frequently will the data be accessed? Do you have active, predictable traffic? Do you have active traffic with large spikes? How much consumption are you expecting for your application and over which time frame? How many transactions? What’s the expected growth?
The definition of requirements is part of the extended function of IT as a broker of cloud services. The role of a service broker is to enable developers, consumers and lines of business to quickly access technology services while safeguarding the interests of the business through the application of centralized policies and procedures. Taking on this role involves a delicate balance between enabling agility and maintaining governance and control. For more information, see
Architect With Cost in Mind
You must use the defined requirements to help your cloud consumer design the most cost-effective architecture for their application. For example:
If the workload has a high availability target, the architecture must consider the deployment of services that may fail (such as compute or database instances) across multiple availability zones (AZs). If the availability target is low, the use of a single AZ may suffice.
If the workload has a high integrity target and you can’t afford to lose its data, the architecture should implement cross-region data replication for storage and database services. If the integrity target is low, same-region replication may suffice.
If the workload has a high confidentiality target, the architecture should implement encryption and other hardening security services or adopt specific components such as a storage gateway.
If the workload is always on and requires minimal tuning of the infrastructure, then the architecture should prioritize application PaaS over IaaS.
If the application has transient or volatile load, can manage latency and time constraints, and operate stateless, then the architecture should prioritize serverless services and a function PaaS (fPaaS).
All of the architectural components carry a cost tag. Therefore, you must design your cloud architectures with cost in mind. In the past, organizations used to design for availability, performance and security to be delivered from a finite set of resources. The cost of servers, storage, network and data center staff was already in the books and the efficiency goal was to maximize utilization and return on investment. Therefore, there was a natural tendency to overarchitect. The cloud reverses this paradigm and allows for a more precise design that is perfectly aligned to workload requirements.
Cloud services are provided with two different charging models: allocation-based and consumption-based services. Allocation-based services require users to preprovision capacity, and cloud providers charge for that provisioned capacity as long as it exists and regardless whether it is used. Conversely, consumption-based services don’t require preprovisioning and are billed based on units of consumption. Figure 5 illustrates the differences of these two charging models.
Organizations must design architectures that leverage cloud services based on the expected usage. An application with expected spikes in usage will possibly be more cost-effective if powered by consumption-based services. A stable application will be best for allocation-based models.
Architecting with cost in mind means picking the right services to deliver the exact set of known requirements and not more than that. Because not all requirements may be known at this stage and you may be making assumptions, you won’t be able to produce the definitive architecture at this stage. The final architecture will result from multiple iterations and optimizations that you’ll conduct in the first few weeks or months after deployment. However, designing an architecture as part of the planning capability is the fundamental starting point to produce your consumption forecast.
Choose Pricing Models
Cloud providers offer services through multiple pricing models. For example, you can buy the same instance with pay-as-you-go (PAYG) or by committing to consume it for a given term, such as one year or three years. The instance will be priced differently based on the chosen model and it will be delivered with a different service level.
Figure 6 is a screenshot from the Gartner Cloud Decisions tool that indicates the price range for an m5.xlarge instance on AWS. Note, there is an order of magnitude of difference from the cheapest option (spot, Linux OS, us-east-2 region) to the most expensive one (on demand, Dedicated Host, Linux, sa-east-1 region).
The difference in pricing shown in Figure 6 means that planning your cloud consumption using the wrong pricing model can have a huge impact on your cloud forecast. Your forecast can be up to 10 times more expensive — or 10 times cheaper — than the actual future spending. Therefore, it’s important that you learn how to make pricing model decisions upfront. Switching pricing model after deployment is possible and also recommended by this framework. However, choosing a model upfront allows you to produce a more accurate estimate and reduce the need for future budget adjustments.
In general, cloud provider pricing models vary based on the following attributes:
Embedded license-based software: Some cloud services include software components that are supplied by third parties with a traditional licensing model. Examples of such services are RHEL or Windows-based compute instances and database services with commercial engines such as Oracle or Microsoft SQL Server.
Service availability: Some providers offer cheaper pricing options in exchange for lower service availability. For example, preemptible instances can be purchased at a much lower price if customers can accept that instances may be terminated or stopped by the provider within short notice. Preemptible instances are great for batch workload or other fault-tolerant applications, but they’re not suitable for all use cases. See Use Preemptible Instances in the Optimize component for a description of this type of offering.
Data integrity: Sometimes, price depends on whether the provider copies the stored data in either single or multiple locations. For example, georeplicated database services carry a higher price tag due to the additional infrastructure that hosts the replicas.
Performance target: Some resources such as block storage volumes or CPUs may be priced differently based on the performance target they can deliver (IOPS and/or throughput). For example, a volume that can deliver up to 2,000 IOPS may be priced different from a volume that can deliver 200. Some cheaper CPU options may deliver lower performance, which can be temporarily improved using a credit-based mechanism.
Tenancy: Some providers offer dedicated infrastructure, such as a compute or storage node, to a single client organization. Certain organizations may have regulatory requirements that do not allow them to share infrastructure with other tenants. Cloud services on single tenant infrastructure are typically more expensive than the equivalent services on multitenant infrastructure.
Scrutinize your cloud provider’s price lists and learn the combination of attributes that influence pricing. You will likely end up with multiple pricing models applied to different groups of resources. Typically, organizations commit for longer terms for their baseline demand and use PAYG for temporary spikes or bursts.
If you’re not sure about which model to choose at this planning stage, prioritize models that allow you to retain maximum flexibility. Prioritizing flexible, low-commitment models will protect you from overbuying. It will also give you freedom to change configurations once you have a clearer picture of your needs.
However, the most flexible pricing model (PAYG) also comes at the highest price. If you’re expecting a general increase in the use of cloud services within your organization, you may opt for the more flexible commitment-based models such as AWS Savings Plans, Google CUDs or EA-based negotiated discounts. Such models can pay off as your cloud usage ramps up because they apply to a broader set of services and resources.
Once you’ve designed your architecture and chosen your pricing models, you can create your consumption forecast. Create a workload model with the data you’ve gathered and input this information into purpose-built tools that can calculate an estimate of your monthly charges. Such tools include cloud providers’ pricing calculators or third-party tools such as Gartner CloudMatch (part of the Gartner Cloud Decisions tool).
At this stage, your forecast is based on assumptions and, as a consequence, it probably won’t be a perfect match with the actual bill. On the contrary, it will contain a margin of error. This margin will likely lower in future as you learn how to better collect requirements and model your workloads. It is important to create forecasts even if you expect a wide margin of error because they’re fundamental to set macrolevel expectations of your cloud spending. Without such forecasts, you won’t be able to understand whether you’re spending less or more than you expected, and you won’t be able to improve your forecasting ability.
As you progress through this framework, you will learn how to adjust your cloud consumption forecasts when piloting your workload (see Deploy Pilot Application in the Plan component) or even once you start looking at your actual spending (see the Track component).
Automate Forecasting in the Continuous Integration/Continuous Delivery (CI/CD) Pipeline
Many organizations adopt a “shift left” mentality to place the onus for quality, reliability and uptime with application delivery teams that practice DevOps. Such organizations place increased expectations on such teams to forecast costs, optimize resources and implement continuous capacity management. This requires pulling traditionally manual processes for capacity management and forecasting into the automated CI/CD process.
In such a process, your CI build generates a release candidate of your application. The release candidate includes an application manifest with metadata such as versioning, system requirements, application configuration and potentially a run book. You can use this application manifest to make the CI/CD toolchain aware of the application components and the resources the application consumes. Application teams can use this additional metadata to produce a cost forecast of the application in a production environment.
Once produced, the forecast could be fed into an issue tracking system to manage the request for additional resources or to collect approvals from finance. In addition, this data can be utilized to baseline and measure the forecast of several releases and produce a historical view. This historical view will allow you to improve your forecasting ability, much in line with the agile principles used to measure the difficulty level of implementing user stories.
Although Gartner sees early adopter organizations experimenting with it, the practice of automating forecasts in the CI/CD pipeline is still emerging. The tools available to calculate forecasts are still maturing and they were not originally intended for automation pipelines.
Deploy Pilot Application
To improve the accuracy of your forecast and refine your estimate, you must deploy a pilot of your application before deploying it in production. The pilot stage allows you to detect misconfigurations early, as you realize they’re not suited for your actual demand.
Cloud cost calculators provide an initial baseline to create an estimate for the designed architecture. However, their forecasting accuracy is proportional to the one of the assumptions made when modeling the workload. Some people use the analogy “garbage in, garbage out” to describe such tools and highlight the importance of the quality of the inputs.
Deploying pilot applications before production is a best practice that serves many different purposes, such as finding bugs, discovering architectural issues or functional testing. Once completed, pilot deployments are often promoted to production, without the need to redeploy new infrastructure. However, few organizations include cost monitoring during the pilot stage.
To monitor consumption of your pilots comprehensively, you must first develop some of the capabilities described in the Track component of this framework. Once you have visibility into key metrics of your pilots, you can monitor utilization and cost and make the architectural adjustments that improve the accuracy of your consumption forecast. Look for utilization patterns and study the consumption trends. Use performance management tools to understand the behavior in terms of CPU, memory, IOPS and data transfer. Look for specific time frames when the application may not be utilized or when it would possibly require additional resources to deliver the required performance target.
The length of this phase will vary. Gartner typically observes organizations running pilot applications to monitor consumption from one to three months. The actual duration will heavily depend on your degree of familiarity with the involved cloud provider and the maturity of your cloud adoption. The more workload you deploy in the public cloud, the more accurate your planning ability will become, allowing you to shorten the length of your pilots.
Complete the Plan component by establishing a budget figure for each application, project or workload you’re deploying in a public cloud environment. This figure helps set expectations around cloud costs and lower the general anxiety of uncontrolled spending growth.
The cloud spending owners — whether I&O or the individual cloud consumer teams — should ask the finance organization for formal approval of this budget. Learning how to build budgets upfront and allowing the organization to approve spending before it occurs is fundamental for enabling cost governance. Having cloud consumers ask for budget approvals make them more accountable for their spend, as described in “Shift Budget Accountability” in the Evolve component.
Once established, the budget figure should be configured in a budget tracking system, which can compare actual spending to the established expectations and alert owners when needed. Examples on how to configure budget alerts for GCP and Microsoft Azure can be found in and
Developing cost planning capabilities is key to set agreed-upon expectations on cloud spending. Skipping this component of the framework and not establishing application budgets would cause concerns on the lack of cost discipline. Furthermore, without planning, organizations will struggle to make their cloud consumers accountable of their spending.
Once your budget is established and your application is deployed, you must maintain visibility into cloud spending. Many companies save money by simply gaining visibility into who is spending money and for which projects. With the right insights, your organization may begin to question whether deployed resources are adding value, and whether or not they are necessary.
Tracking spending requires the creation of an organized view of costs. Organizing cloud costs is not an activity to conduct on the bill itself. Trying to manually attribute each line item to your cost centers is likely to be unsuccessful due to the large amount of data to process. Furthermore, processing issued bills won’t give you a daily view into your spending, which is compulsory for containing waste and optimizing resources.
Native providers’ hierarchy and tags are the foundational mechanisms to organize resources on all major cloud providers. With well-organized resources and a cost allocation strategy in place, organizations can monitor cost and utilization metrics to detect anomalies and implement chargeback and showback.
The Track section of this guidance framework is depicted in Figure 7.
Design Native Hierarchy
All major cloud providers offer native mechanisms to classify resources in a hierarchical structure. For example, AWS offers “accounts” and “organizations.” Microsoft Azure offers “management groups,” “subscriptions” and “resource groups.” GCP offers “folders” and “projects.” The resource placement within these native constructs is mandatory at the time of provisioning.
The element of the hierarchy where a resource is placed appears in the provider’s bill, next to each line item, such as an hour of consumption of a compute instance. Therefore, technical professionals must also think about cost allocation when designing their resource placement strategy in a provider’s native hierarchy.
However, this hierarchy isn’t designed primarily to implement a cost structure. Rather, it is intended to provide for resource isolation and management at scale. An AWS account or a Microsoft Azure subscription bears many constraints that organizations must be aware of and that should take priority over cost allocation. For example, organizations shouldn’t be using one account per application only to have the ability to track how much each application costs. Using multiple accounts complicates resource management because each account acts as anchor for quotas, permissions and other policies. This management overhead largely outweighs the benefits when multiple accounts are used solely for cost attribution.
For more information about designing a provider’s governance structure and related constraints, see and
Although a provider’s native hierarchy is fundamental to enable basic cost allocation, organizations must complement it with other mechanisms such as tags or labels to implement cross-cutting resource metadata (see the following section, Implement Tagging Strategy).
Implement Tagging Strategy
Tags (or labels) are a fundamental governance construct offered by all major cloud providers, including AWS, GCP and Microsoft Azure. Tags implement metadata that apply across the elements of a native provider’s hierarchy. Tags appear in the provider’s bill next to each line item and can be used to break down cost reports. For example, by tagging development resources with the “environment = development” tag, organizations can group the spending of all development environments across all their accounts, subscriptions or projects. Tags provide maximum flexibility and minimal constraints to implement a multifaceted cloud resource classification strategy.
For cost tracking, Gartner recommends the use of tags in addition to and independently from other native governance constructs. Tags provide several advantages compared to solely relying on native hierarchical constructs, specifically:
Customizable naming convention: Tags allow for maximum customizability of naming conventions through user-defined values.
Multilateral structure: Tags provide a one-to-many relationship model. You can apply multiple tags to a single resource.
Cross-cutting applicability: You can apply the same tags to resources that belong to different accounts, subscriptions or projects.
Multicloud applicability: Organizations can implement the same multicloud tagging dictionary across providers without having to adopt a different naming convention per provider.
Absence of resource constraints: There are no dependencies or implications when choosing to tag or not to tag a certain resource. Tags do not inherit attributes such as resource quotas or other policies.
Easier maintainability: Changing a tag value or adding/removing a tag from a resource does not have implications such as affecting resource availability.
Tags will appear in your bill from the moment they’re implemented. They will not apply retroactively to bills that were issued prior to the application of tags. To enable cost tracking, implement your tagging strategy as soon as possible. Figure 8 shows an example of how the combination of native hierarchy constructs (“Account” in Figure 8) and tags (“Application” and “Environment” in Figure 8) leads to cost breakdown reports that allow organizations to gain insight in their spending.
Although most organizations understand the importance of tags, even beyond the cost allocation use case, Gartner inquiries show that many tagging initiatives fail. This is due to high management complexity, low maturity of providers’ tooling and a general low perceived value from cloud consumers.
To address these issues, Gartner has developed a guidance framework for This framework provides a sample tagging dictionary, and helps mitigate risks and avoid common pitfalls, setting organizations up for success in their tagging initiatives.
Gartner recommends defining a tagging dictionary and promoting it internally through workshops and other dissemination activities. Organizations must establish an audit process that allows them to detect and remediate mistagged resources. Furthermore, organizations must use automation to mitigate the administrative burden of implementing tags. Lastly, enforcement measures must be put in place to prevent resource provisioning when tags are not implemented in accordance to the guidelines.
Just like cost management, the CCOE leads the development of a tagging strategy. The CCOE should involve all stakeholders from the start, to allow them to grasp the value of tags and their use cases. For additional details on this Gartner approach to tagging, see
Allocate Costs of Shared Resources
In some situations, even the combination of tags with the provider’s native hierarchy is not enough to properly allocate spending across cost centers. This happens when resources are shared between multiple projects, departments or by the entire organization. For example, a single e-learning application may be used by multiple departments to train their teams. As another example, the network connection (e.g., AWS Direct Connect or Microsoft Azure ExpressRoute) between the organization’s data center and a public cloud provider would be used by everyone accessing cloud services. In these and other similar situations, organizations must determine how to split the costs of shared resources.
This cost allocation activity is typically handled manually and, therefore, it does not scale. Therefore, Gartner recommends minimizing the use of shared resources by:
Tagging higher up the stack: Some shared resources such as an AWS Direct Connect link support the creation of nested virtual resources, such as a “connection” or a “virtual interface,” which can be individually tagged. By tagging these virtual resources instead of the main service, you can achieve per-cost-center cost breakdown. However, this may create new management requirements for allowing the nested virtual resources to communicate.
Using dedicated infrastructure: Deploy a dedicated set of resources per project or application. For example, instead of using a large infrastructure footprint to host a single e-learning platform, deploy multiple copies of the platform on smaller sets of infrastructure resources. Tag each set with the department that’s using it. You can use infrastructure as code (IaC) to ease the deployment of duplicate infrastructure. IaC allows you to standardize the technology stack’s deployment and configuration and mitigate the administrative overhead. See for more information on IaC.
Sometimes, the effort required to allocate costs for shared resources may outweigh its benefits. Therefore, you may want to develop this strategy only for the most expensive shared resources that experience heavy unbalanced usage from your cloud consumers.
Define Metrics to Track
Once resources are classified with the desired set of metadata, organizations must establish visibility into cost metrics. Organizations must define the metrics they want to track to enable cost governance. Such metrics are used to:
Build dashboards and reports: Visibility into cost metrics can raise internal awareness of cloud spending. The right dashboards and reports can reduce spend by influencing behavior and choices. These reports can help identify spending trends and detect the most impactful changes, anomalies and waste.
Feed the automation workflows that optimize spending: Many cost reduction and optimization practices are based on the continuous observation of metrics. Policies and rules govern the decision to, for example, eliminate a resource or change a service allocation size based on a metric value. For this use case, metrics must be made available programmatically.
Besides the cost of services, organizations must track other related metrics, such as utilization, capacity, availability and performance. For example, to identify spending waste, it is fundamental to look at how much we’re using a resource and compare utilization with its provisioned capacity. Furthermore, if we take actions to reduce spend by reducing the infrastructure footprint, we want to make sure we’re not impacting availability and performance.
Gartner recommends building the following dashboards and reports and updating them at least daily. These reports should be available for each project, application, department and any other resource metadata:
Trending patterns daily, monthly, quarterly and annually
Actual versus planned spending
Percentage of the overall spending
Top spenders and least spenders
Estimated spending waste
You will be able to calculate the estimated spending waste once you’ll have developed some of the capabilities described in the Reduce and Optimize components of this framework. You can estimate the savings that would derive from the identified cost optimization opportunities and build reports that showcase the most and least disciplined teams and individuals. Ultimately, these reports will allow you to increase your consumer’s spending accountability. This practice is described more in details in the Incentivize Financial Responsibility section in the Evolve component of this framework.
Alert on Anomalies
Monitoring cloud spending can be overwhelming, especially when cost must be continuously correlated with metrics such as utilization or performance. Technical professionals should not spend time monitoring metrics when the metrics are simply portraying a “normal” situation. In light of this, organizations must introduce automation to detect and alert when there is a deviation from a normal trend, i.e., anomalies.
To do so, define the conditions that represent an anomaly by using a policy. For example, a department’s daily spending that is 10 times bigger than the day before can be the symptom of a problem and should be flagged as an anomaly. However, a rule-based policy might also trigger false positives, for example when anomalies are occurring on a regular basis. As a more sophisticated practice, you can build machine learning models that learn the normal trends by consuming historical data points and predict what the normal value range might look like over time (see the gray band in Figure 9). Once the model has predicted a normal value range with confidence, any metric value outside of that range would be flagged as an anomaly (the red line in Figure 9).
Anomalies on metric values should draw your attention and, therefore, you must trigger alerts when anomalies are detected. Notify resource owners, product teams, finance, the CCOE or any other individual or team that must be aware of the potential issue. For example, if the cloud estimate for a given project is $10,000 per month, and the consumption is already at $8,000 after the first week, the organization should be made aware of it. Alerting on anomalies enables organizations to promptly undertake corrective actions, instead of realizing the issue once the bill arrives.
In nonproduction environments, alerts can also be used to trigger other and more disruptive actions. Such actions may include the shutdown of all cost-accruing services until the cause of the anomaly is found and resolved.
Implement Chargeback and Showback
The unprecedented spending transparency provided by cloud services enables organizations to quickly implement chargeback and showback strategies. Resources classification with tags and other metadata allows you to precisely attribute costs to your internal departments and cost centers.
With chargeback, each department gets an internal bill with the cost of services they generate. Organizations use chargeback to enable spending accountability, visibility into the costs of IT from senior management and the ability to respond to unexpected demand. Showback is a form of chargeback that provides cost breakdown and visibility, without the need to issue an internal bill. In traditional data centers, both chargeback and showback have been very complex, due to the difficulty of calculating the costs of each infrastructure item. Cloud services solve this problem by providing granular cost metrics programmatically, making chargeback and showback much easier to implement.
In a chargeback scenario, IT acts as an internal service provider to the organization as a whole. Sometimes, IT chooses to apply service charges to each internal client to cover the cost of shared or centralized services. When implementing a chargeback strategy, you must choose whether you are charging services “at cost” or at a different price. To price differently you can:
Apply a markup: For each cost item that you are getting from the cloud provider, you apply a markup such as 4% or 20%. This markup is intended to cover centralized services such as support, brokerage, solution architecture, governance, security or other managed services. This model is the most predictable and transparent.
Rebill at list price: For line items, you reapply the list price of a cloud service and you retain possible discounts to cover the costs of centralized services. For example, IT may run a centralized practice to buy RI or other programmatic discounts. The benefits of such practice may be retained within IT and the department using the actual instances would continue to pay them using the on-demand pricing. This model is less predictable, harder to implement but it rewards the IT organization for the cost savings it generates through centralized spending management.
Ultimately, IT can choose a chargeback strategy that implements both models, possibly using a lower markup.
Developing cost-tracking capabilities is foundational to enable cost governance. Having visibility into cloud spending is fundamental to verify the correctness of expectations, detect anomalies, increase accountability and provide observability into the metrics that can drive costs down. Skipping this component of the framework would make the entire cost management initiative fail. Without visibility into each metric and their trends, organizations wouldn’t know if spending is under control and will likely overspend.
Using the gained visibility into spending metrics, you must seek opportunities to reduce your monthly bill. The importance of this and the following Optimize components are further reinforced by Gartner’s prediction that:
This framework component highlights common methods that organizations use to reduce their spending, as shown in Figure 10. These methods can be applied without the need to change the application architecture or code. Therefore, they are easier to implement and have an easy-to-calculate ROI. For example, you can estimate the savings from the rightsizing of a compute instance that has been overprovisioned for several weeks.
Dispose Unused Resources
You must look for resources that have been deployed but not being used at all. These would be allocation-based resources that, once provisioned, will accrue cost irrespective of their usage. Such resources require the specification of a certain capacity at provisioning time, and that capacity determines their cost. To detect such resources, look for extremely low utilization metrics along a period of time. Once found, initiate a workflow that ultimately disposes the unused resources to gain cost savings.
Although it sounds obvious, disposing unused resources is not a common practice within traditional data centers. On-premises, organizations operate resources in a finite, preprocured capacity. Money is spent upfront for procuring the overall capacity and not for its actual utilization. Furthermore, once capacity gets allocated to projects, people are reluctant to give it back, in case they won’t be able to obtain it once they’ll need it again. Consequently, organizations are not prepared to manage the disposal of unused resources.
In cloud computing, capacity allocations (such as the number of CPUs and gigabytes of RAM of a compute instance) are extremely granular, can be changed frequently and are billed down to one-second increments. Such characteristics of cloud computing makes the disposal of unused resources highly impactful to reduce your monthly bill.
The definition of what “unused”means is described using policies that define rules based on metric values. For example, a compute instance whose CPU has been used, on average, below 1% for at least 24 hours should be considered unused and should be disposed. To increase result accuracy, organizations should be refining this policy by using multiple metrics. For example, compute instances utilization can be determined by inspecting RAM, network bandwidth and Secure Shell (SSH) or Remote Desktop Protocol (RDP) login sessions when these are relevant (such as in the case of development instances).
There are several resource types that organizations must monitor and dispose of, if unused. As a minimum Gartner recommends looking for the following resource types:
Idle compute instances: Idle or abandoned instances are known to be one of the largest contributors to waste in cloud spending. Find and decommission them by looking for instances that have been extremely low in usage in a given period of time. This is very common in nonproduction environments like development/test, where developers are experimenting and trying out new services.
Unused storage volumes: Block storage volumes can be detached from an instance and will continue to accrue cost even if their data is inaccessible. Sometimes volumes are preserved “just in case” even after the corresponding instance is destroyed. Data is simply parked, often because users aren’t sure if they may need it again. Look for volumes that have not been attached to an instance for a period of time (for example, two weeks). When found, delete them or turn them into a cheaper snapshot that can be restored in case of necessity.
Old snapshots: Organizations must develop a snapshot retention strategy. Old snapshots may no longer be suitable for restoring instances because they contain old data. Identify and delete snapshots that are older than a period of time. Alternatively, you can move them to “colder” storage class that is designed for infrequent access, such as Amazon S3 Glacier.
Unassigned IP addresses: Public IP addresses deliver value only when attached to a service and are able to route traffic. Sometimes, IP addresses are unassigned and preserved for future reuse. Look for IP addresses that have not been assigned to any resource for a period of time and release them back to the provider. Mitigate the need for future reuse of IP addresses by abstracting them using DNS.
Unused application PaaS resources: Organizations commonly use application PaaS as a platform for developers to quickly deliver applications and code. Application PaaS need to be preprovisioned and allocated prior to use. You cannot “set it and forget it” as each of these environments is consuming (and mapped to) underlying compute resources. These should be monitored for usage and decommissioned where appropriate.
To minimize disruption, your disposition policy should encompass multiple stages. First, it should mark identified resources as unused. Then, it should notify the owners and solicit an action from them. In absence of change in the resource utilization pattern for a grace period (for example, 48 hours), the disposition policy should eventually execute an administrative deletion.
You may have resources that remain idle only in certain hours of the day or certain days of the week. This is the typical behavior of dev/test workloads that depend on the presence of developers at work. In this case, a retroactive detection mechanism that looks for idle resources is not efficient. With such a mechanism, you’d spend time and money just to deem resources “idle” before you can actually turn them off.
In such cases, you must schedule cloud services to be on and off based on expected utilization patterns. If you know what the expected utilization is, you can describe it using a “duty schedule” tag. Then you can use any cron-like scheduler to read that tag value and turn services on and off accordingly.
If you don’t know what to expect in terms of utilization, you can make assumptions on future behavior based on historical data. You can observe your cyclic workloads over a period of time and draw utilization patterns over a defined working cycle, such as a week or a month. Then, develop a scheduling policy that matches the identified patterns and that will proactively turn services off when the expected usage is low. As an example, Figure 11 shows a CPU-based utilization pattern of a compute instance over a week, with one-hour granularity.
When building utilization patterns, organizations should refine the policy that defines the boundary between the used/unused conditions using multiple metrics. For example, compute instances metrics should include CPU, RAM and network bandwidth, but also SSH/RDP login sessions, especially for development instances.
Scheduling services can be highly impactful. You can save up to 70% on development instances if you schedule them to be on only for eight hours a day and five days a week. If developers need to work off-business hours and they find their instances offline, you can allow them to turn them on manually but specifying how long they need this exception for. You should also cap the amount of time a developer can ask for this exception to a maximum number of hours.
Not all cloud services can be turned on and off while persisting data. Compute instances do persist data when the data is stored on a decoupled block storage volume. Conversely, other services such as Amazon Redshift do not persist data when its nodes are turned off. In such cases, you must build in your start/stop operations the required tasks to backup and restore the data from an external storage service.
Scheduling services is not recommended for production workloads. To address similar variable usage patterns of production workload, see the Optimize component.
Rightsize Allocation-Based Services
Allocation-based services require that you request a specific allocation at provisioning time. This allocation could be the number of CPUs, the amount of RAM or the maximum number of IOPS of the underlying infrastructure. You pay for this allocation irrespective of whether you’re using it or not. Often, you end up using resources at a much smaller percentage than what they can deliver.
When this happens, you must rightsize your resources to reduce costs. Rightsizing is the practice of adjusting a cloud service allocation size to the actual workload demand.
Examples of allocation-based services that are good candidates for rightsizing are:
Compute services such as Amazon Elastic Compute Cloud (Amazon EC2), Microsoft Azure Virtual Machines (VMs) or Google Compute Engine (GCE). Client organizations specify the allocation size by choosing a provider-supplied instance flavor (such as “m5.large”), which is delivered with specific sizing parameters.
Storage services such as an Amazon Elastic Block Store (Amazon EBS), Microsoft Azure managed disks or a Google Persistent Disks. Client organizations specify an allocation of storage space and, sometimes, performance (IOPS) and durability targets.
Database services such as Amazon Relational Database Service (Amazon RDS), Microsoft Azure SQL Database or Google Cloud SQL.
Container services such as Amazon Elastic Kubernetes Service (Amazon EKS), Microsoft Azure Kubernetes Service (AKS) or Google Kubernetes Engine (GKE). Such services deploy nodes on top of the provider’s compute instances with a customer-chosen allocation size.
Application PaaS services such as AWS Elastic Beanstalk, Microsoft’s Azure App Service or Google’s App Engine. These environments need to be preprovisioned and are paid for whether they are being used or not.
Traditional data centers are commonly underutilized. Deployed resources are often overprovisioned because consumers compete for the same finite IT capacity. Consumers tend to provision bigger resources than they need just to secure that capacity for their projects in light of an expected (or hoped for) future growth. This practice is well known to I&O departments. But because a data center’s capacity is procured upfront, driving more efficient usage of data centers does not have an immediate impact on cost. On the contrary, more efficient usage would translate into larger wasted capacity and questionable investments.
Cloud computing reverses this paradigm. Client organizations can count on infinite capacity and can focus on managing virtual resources on demand. Cloud providers bill organizations based on the provisioned virtual resources and allow their clients to adjust service allocations with an immediate impact on billing. As a consequence, rightsizing cloud resources can have a huge impact on reducing your monthly bill.
To implement rightsizing, you must monitor resource utilization over a defined period (for example one week), compare it with provisioned capacity and change the allocation size if a resource is found larger than necessary. Because your demand may vary over time, you must be ready to rightsize in both directions — down and up — by also increasing resource size when performance is suffering. Your ultimate goal is to develop a continuous rightsizing process that can enforce a defined target utilization threshold. Continuous rightsizing is no different to what is also known in the industry as “vertical autoscaling.” For example, if you set your utilization threshold for compute instances at 70% on the CPU metric, your rightsizing process would change size accordingly to keep the utilization line as flat as possible, as depicted in Figure 12.
Rightsizing is one of the most effective cost optimization best practices for public cloud IaaS and PaaS. Together with unused resources, overprovisioned service allocations are among the top contributors to public cloud spending waste. If you need to quickly reduce your costs, Gartner recommends that you prioritize rightsizing among the capabilities to develop.
When developing rightsizing capabilities, Gartner recommends:
Rightsizing for peaks: Determine your workload demand by considering the peaks that occur during your observation period and not the average utilization. You don’t want to end up in a situation when your resized instances are not able to handle peak workload anymore. If you experience peaks that are much higher than the average, consider serving such peaks by scaling out and distributing workload across multiple smaller resources. This practice is described in the Optimize component of this framework.
Assessing constraints: As you change your allocation size, the new size may be subject to constraints that you need to be aware of. Select only allocation sizes that are compatible with your workload requirements. For example, compute instance flavors are provided with a number of CPUs and an amount of RAM, but also network and storage bandwidth. If you are using only half the CPU but the entire storage bandwidth, sizing down an instance may negatively impact its storage performance. Similarly, if you’re running a 64-bit operating system, you can’t select a 32-bit instance, even if this is cheaper and can still deliver the performance you need.
Mitigating availability risk: Some services like compute instances require a disruptive operation (a reboot) to change size. Conversely, application PaaS services have options for zero-downtime upgrades so that incoming requests aren’t dropped. For example, Amazon Elastic Container Service (Amazon ECS) can be configured to do zero-downtime updates via subscribing to AMI updates. For those services requiring disruptive operations, factor in availability risks and mitigate them by executing rightsizing only during maintenance windows and limiting rightsizing activities to once a week or once a month.
Mitigating performance risk: As you change your allocation size, the new size may not be able to deliver enough performance to serve your workload demand. Mitigate performance risk by inspecting application metrics from application performance management (APM) tools. Alternatively, rightsize in multiple steps and measure the performance impact at each step. Implement continuous rightsizing and be ready to size up as you detect performance issues.
Starting with the top wasters: If you found a large number of rightsizing opportunities, start with resources that have the highest costs and lowest utilization. Calculate a ratio between the two metrics values and order the identified overprovisioned resources based on that ratio in descending order. Tackle the list from the top down.
Rightsizing is an efficient capacity management practice for any allocation-based cloud service. This practice is necessary to achieve savings because cloud providers ask their client organizations to choose an allocation size for their provisioned services. However, as providers increase their serverless capabilities, the concern for dynamic capacity management will be shifted to the cloud providers themselves.
Providers that implement continuous rightsizing will be providing serverless capabilities that dynamically scale services based on observed demand. Microsoft Azure SQL Database Serverless is an example of a cloud service for which the cloud provider implements continuous rightsizing behind the scenes, discharging clients from this concern and unlocking cost benefits for dynamic workloads. More information on using serverless technologies for cost optimization can be found in the Use Serverless Technologies section in the Optimize pillar.
Leverage Discount Models
Not all workloads benefit from the flexibility of the low-commitment PAYG pricing model. Some workloads are stable and their future utilization is predictable. To address such situations, cloud providers offer discounted prices in exchange for the client’s commitment to use their services for a period of time. There are two types of discount models for cloud services:
Negotiated discounts are part of an enterprise agreement that your organization may sign with cloud providers. The primary purpose of signing an EA is to receive better terms and conditions than those offered by the standard provider’s click-through agreement. One such condition can be a discount applied to the billed cloud services.
If your organization doesn’t have an EA in place with your cloud provider, ask your procurement and vendor management department to negotiate one. Contact your sales representative to initiate the discussion. Although EAs are negotiated, cloud providers have a pretty standardized framework for their discount models. Discounts are applied as a percentage of reduction (such as 5% or 20%) and can cover your entire bill or a specific set of services that have a higher volume of utilization. In exchange for a negotiated discount, you will need to commit to a certain minimum spend along the validity of the EA.
Cloud providers offer discounts that can be purchased programmatically. Such discounts do not require a negotiation with the provider’s sales team. Client organizations can purchase such discounts in the form of “vouchers” using a management operation, which can be automated. The purchased discount is normally billed with a one-time charge and has a specific time validity, after which it expires.
Example of programmatic discounts are:
Amazon EC2 RI, available as “standard” or “convertible”
AWS RIs for Amazon RDS, Amazon Redshift, Amazon ElastiCache, Amazon DynamoDB and Amazon EMR
AWS Savings Plans
GCE committed use discounts (CUDs)
Microsoft Azure Reserved VM Instances
Microsoft Azure reserved capacity for SQL Database and Azure Cosmos DB
During its validity, any existing cloud resource that matches the discount conditions can “consume” it in exchange of receiving a zero-dollar charge for a specific billing period (normally one hour). Discounts are not purchased for a specific resource, but they can match multiple ones during their period of validity. In the hypothesis of always finding a matching resource through its validity, a programmatic discount can make the actual resource costs up to 70% lower than the PAYG model. When a purchased discount exists but no matching resource is found, that would constitute spending waste. When more resources exist than those covered by purchased discounts, cloud providers bill them using the standard PAYG pricing.
Cloud providers offer several types of programmatic discounts, which differ based on their applicability such as a specific service, a provider’s region or a resource type. All discounts provide a trade-off between flexibility and benefits. The more stringent conditions organizations are willing to commit to, the more benefits they will appreciate, such as higher discount levels. The more flexible discount will increase the likelihood that they will match your actual usage.
Some discounts require you to manually change their flexible attributes to match your utilization. For example, AWS Convertible RIs require that you convert them to leverage their flexibility. Other discounts, such as AWS Savings Plans or Google CUDs, automatically apply across a wider spectrum of resources. Because AWS RIs and Savings Plans offer similar discount levels, Gartner recommends prioritizing Savings Plans over RIs due to their wider applicability.
Programmatic discounts can significantly reduce your cloud bill. Determine your baseline and purchase enough discounts that allow you to cover your stable, predictable workloads. If you are unsure about how much to commit, you can observe your past utilization and use it to make assumptions about the future. Then, you can decide your level of “aggressiveness,” bearing in mind that more aggressive commitments bear higher risks of spending waste, as shown in Figure 13.
Managing your programmatic discounts centrally — and not by workload or department — will improve the accuracy of your utilization estimates. It will also increase the likelihood of consuming purchased discounts and will reduce the risks of spending waste.
Deciding to sign up for programmatic discounts and managing your discount portfolio to ensure maximal coverage is a very complex matter. Although cloud providers are introducing more simplification, Gartner recommends relying on tools that help determine your baseline and suggest discount purchase and modifications.
When managing programmatic discounts, Gartner recommends:
Managing expirations: Purchased discounts have a term. After that term, billing switches back to PAYG and becomes more expensive. Track expirations and set up alerts ahead of time. Allow yourself enough time to decide on the potential renewal.
Analyzing past coverage: As you approach expiration, analyze how well your discounts have performed. Track how much they were consumed or wasted. Build an ROI, calculate the break-even point and the hourly cost of covered resources. This analysis serves to improve your future purchase decisions.
Developing a discount allocation strategy: Develop a strategy that defines how you’re allocating the costs of purchased discounts, especially if you’re managing them centrally. Once you purchase a discount, the matching resource will have a zero-dollar charge. However, the bill’s line items will also indicate the discount ID that was consumed in that billing cycle. A common allocation practice is to split the discount cost by the hour for the length of its term and to reapply the hourly cost to the resources that consumed it. The potential wasted capacity would be attributed to the entity that has made the purchase decision.
Programmatic discounts are a cost reduction practice that can quickly drive your cost down. Together with deleting unused resources and rightsizing, this practice should be on your priority list if you urgently need to reduce your monthly bill. However, you shouldn’t rush your commitment decisions because they bear consequences for a medium-to-long term.
Although they’re both very effective in reducing your bill, negotiated and programmatic discounts also bear the risk of causing the wrong consumption behavior. It’s easy to fall into the trap of buying larger commitments and then artificially driving utilization up just to match those prepurchased commitments. That’s exactly the same logic that we used to apply in traditional data centers and that causes many inefficiencies.
For example, Gartner does not recommend changing a compute instance size to a bigger one just to match an unused RI that’s sitting in your portfolio. If the RI is unused, it’s probably because you have overcommitted, and you should take this fact into account when deciding on the RI renewal. If you size an instance to match the RI, this RI would be considered as consumed and you’ll end up overcommitting again in the future.
Upgrade Instance Generation
Over the course of the years, cloud providers such as AWS and Microsoft Azure have refreshed their compute platform a few times. The newer platform is based on new hardware, and new processor and memory technologies, and usually comes with storage and networking updates. At every refresh, cloud providers also launched new instance types from the new platform, grouped under a new “generation.” The new generation instances are supposed to address the same use cases as their previous generation, but with renewed power.
Often, these new instance types are less expensive because they are more efficient. Figure 14 shows some metrics tracked by Gartner Cloud Decisions over the years. It indicates the relative progression of price, CPU, memory and network performance for three generations of AWS’s “M” general-purpose instance. The chart shows that, while the price has been decreasing slightly, the CPU and memory performance have been increasing over time, especially with the introduction of the Nitro technology in the fifth generation.
Whenever a new instance generation is available, consider the performance increase as your ability to achieve more by spending the same amount of money. As a consequence, develop your compute instance rightsizing practice to also work across instance families. You may be able to save money by choosing a smaller size for a new instance generation and deliver the same performance. For more information on rightsizing, see Rightsize Allocation-Based Services in the Reduce component of this framework.
Establish a DevOps Feedback Loop
CI/CD platforms are configured with the metadata about software releases describing the infrastructure components that an application need to run. For example, a Kubernetes deployment manifest contains the number of pods, RAM and CPU. Such deployment manifests are managed through a version control system and maintained as part of the CI/CD process.
Organizations must establish a DevOps feedback loop between cost reduction methodologies and the CI/CD pipeline. Such feedback loop allows CI/CD platforms to be aware of the changes made by the cost management practice. Otherwise, it would be counterproductive to rightsize resources and then have a CI/CD platform overprovision them again at the next release.
A robust CI/CD process and application platform includes the capability to track metrics such as utilization and capacity of resources as they move from development into production. Organizations must make these metrics available to the application development teams or to whoever is responsible for the deployment manifests. Furthermore, the cost management practice can publish the recommended sizes in public repositories that can be automatically read by the CI/CD platform at the time of the software release.
Developing cost reduction capabilities is the quickest way to access cost savings. Because these practices do not require architectural changes to your applications, they are more easily applicable to a large set of use cases. Skipping this component of the framework will make your organization overspend for cloud services and won’t allow you to profit from the elasticity of cloud computing.
Optimizing cloud spending goes beyond the tactical cost reduction techniques mentioned in the previous Reduce component. Conversely, strategic optimization techniques often require application architectural changes to reduce the need for resources. Cloud computing has inspired modern application architectures that are also referred to as “cloud-native.” Such architectures are designed around the native features of cloud services and can often deliver more favorable ROIs compared to traditional ones. The Optimize component of the framework (depicted in Figure 15) illustrates optimization best practices that you can adopt to optimize your monthly bill.
Use Preemptible Instances
Sometimes, the choice of desired service availability determines the price of a resource. Some cloud providers offer compute instances at a much lower price compared to the standard PAYG model. However, their availability is also lower. Preemptible instances are based on a provider’s spare capacity and can be terminated by the provider at any time when standard demand raises.
Examples of preemptible instances are:
Assess your application’s architecture and find those components and use cases that may be suitable for infrastructure that might become suddenly unavailable. For example, batch workloads may simply pause when infrastructure goes down and restart once provider’s spare capacity becomes available again. Also, stateless workloads can take advantage of preemptible instances, leaving it up to load balancers to handle the sudden unavailability of nodes.
You can further mitigate the risk of unavailability by:
Reacting to the provider’s notice: A couple of minutes before terminating or stopping a preemptible instance, the provider normally sends a notice that can be programmatically read. For example, AWS publishes this notice as part of the instance metadata. By intercepting this notice in time, you can initiate actions to ensure a smoother transition of your workloads.
Leveraging purpose-built tools: Some tools, such as Spotinst Elastigroup, provide AI-based prediction on the availability of a provider’s spare capacity. Thanks to such insights, these tools can proactively migrate instances from a preemptible to a standard offering when they predict that infrastructure is about to become unavailable.
Leverage preemptible instances to gain significant cost benefits if your workload can adapt to their limitations and if you can mitigate the risk of unavailability.
Set Up Data Storage Life Cycle Policies
Organizations can use multiple cloud services to store data. You can use not only traditional block or file storage, but also object storage or database services designed for different use cases. The line between all of these data storage services is blurring, but besides their functional differences, they also come at different costs.
Each storage service may also be provided with different tiers at a different price. Storage tiers provide equivalent functionality, but can differ based on their degree of availability, redundancy and retrieval latency. Selecting a low-latency, georedundant tier with 99.99% availability for data that is not critical for your organization may be a waste of money. Selecting the right storage service and tier is key to make cloud services cost-effective.
However, some data may be characterized by usage patterns that differ at each phase of its life cycle. For example, some data may become less frequently accessed as time goes by (for example, a social network timeline). In this case, you can optimize your costs by moving older data to less expensive tiers or services. Other times, you may need the ability to real-time query-only data that’s older than a number of months. In such cases, you may want to use a mix of database and object storage services at different phases of the data life cycle. But while changing a service tier is a fairly simple management operation, changing service type is much more complex as it may require data transformation.
Table 1 shows the main pricing and functional differences between the tiers of Amazon S3 object storage service.
Develop a strategy to select the right service and tier at each phase of your data life cycle. Optimize storage tiers by automatically moving objects across tiers based on detected usage patterns.
Because this practice is highly cost-effective — yet it also constitutes operational overhead — some cloud providers have started to offer it as a managed service. For example, AWS launched the Amazon S3 Intelligent-Tiering S3 class, which automatically optimizes object placement in storage tiers based on observed access frequency.
Implement Horizontal Autoscaling
Cloud platforms provide elasticity that enable applications to grow and shrink the resource footprint in response to both internal and external events. Such behavior is called “autoscaling” and is governed by metric-based policies. Leveraging autoscaling can optimize your costs because it dynamically aligns your resource footprint to workload demand.
Autoscaling is either “vertical” — making a single instance bigger — or “horizontal” — adding more instances of the same type and distributing workload across. This section provides best practices for horizontal autoscaling. Vertical autoscaling is covered in the Rightsize Allocation-Based Services section in the Reduce component.
Horizontal autoscaling requires specific design principles in the application architecture. Specifically, it requires the application to allow multiple instances to run in parallel. The application must also be able to start and shutdown gracefully and it doesn’t have to rely on local dependencies. explains how to design an application that allows for horizontal autoscaling.
Autoscaling is triggered by policy-based thresholds that instruct a cloud platform to automatically scale applications by adding instances, typically as a reaction of an increase in load. The policy includes a limit for the scaled resources (such as the maximum number of instances to automatically provision) and thresholds to remove instances once the load goes down.
Horizontal autoscaling can be grouped into four categories that differ based on the metrics used to trigger events. These categories are:
Infrastructure-metrics-based autoscaling: This technique uses autoscaling policies that rely on infrastructure performance metrics. Such metrics include the percentage of CPU load, the amount of memory in use and network latency.
Middleware-metrics-based autoscaling: This technique uses autoscaling policies that rely on metrics gathered by middleware components, such as queues and databases. Such metrics may include a measurement of how far behind the journaling system is, or how large the queue length is, which is used to delegate work to background processing.
Time-based (scheduled) scaling: If you know exactly when to expect your increase in load, you can use this technique to autoscale applications using a time-based schedule. The policy automatically scales the application in advance of a predicted peak in usage and scales it back once the peak is over.
Business-metrics-based autoscaling: This technique uses autoscaling policies based on application-specific or business metrics. Examples of such metrics include the number of nonempty shopping carts or the number of open user chat sessions. This type of metric requires custom support within the application code.
Horizontal autoscaling in cloud platforms can function both at the IaaS level (more coarse-grained) and at the application PaaS and container as a service (CaaS) levels (more fine-grained). For certain services, autoscaling capabilities are natively built into a cloud platform and are handled automatically by the cloud provider. In such case, this “zero-touch” autoscaling becomes an inherent characteristic of what is called “serverless computing,” described in the next section.
Horizontal autoscaling is an effective cost optimization practice that leverages the elasticity of cloud computing. It is in line with cloud-native architectural principles and also makes applications more resilient and scalable. Horizontal autoscaling should be used in conjunction with rightsizing, because these two techniques normally apply to different sets of applications.
Balance Usage of Consumption-Based Services
Many cloud services are billed with a consumption-based model, whereby you don’t pay for the provisioned capacity. Conversely, you pay for each handled request and the amount of data transferred. While this is the ideal PAYG model, it also adds new challenges in predicting and controlling how much you will spend.
Examples of consumption-based services are:
Network data transfer: You are charged a certain dollar amount for each chunk of data you transfer between your services. This amount can differ depending on the direction, destination and volume of transferred data.
API gateway services: You are charged per request handled by an API gateway service. Such services receive inbound requests, authenticate them and pass them to other services such as a message queue or a function platform as a service (fPaaS) such as AWS Lambda.
Load-balancing services: You are charged per request handled by a load-balancing service. Such service receives inbound requests and distributes them over a pool of nodes based on configured load distribution algorithms. Load balancers also provide failover mechanisms for unresponsive nodes through native health checks.
Serverless computing or fPaaS: Services such as AWS Lambda, Google Cloud Functions or Microsoft Azure Functions charge an amount each time your functions are invoked and will not charge you just for publishing code in the platform, if this is not run. See the next section, Use Serverless Technologies, for more information.
Database platform as a service (DBPaaS): Some services charge you based on the amount of data you read and write on your tables. Other services may charge you based on the time it takes to complete a query.
Optimizing consumption-based services for cost is more complex because you do not control the capacity provisioning. Because charges are directly tied to usage, you can optimize your costs by reducing the use of such services. To achieve this, you must transform your application behavior and architecture. For example, design your application to make use of compute as close as possible to where data resides.
In traditional data centers, certain resources (such as network bandwidth) were literally considered free of charge. As a consequence, changing your application architecture to reduce the use of consumption-based services can be especially effective for applications migrated from on-premises data centers.
Use Serverless Technologies
In serverless computing, you are relinquishing control, flexibility and ownership of the application infrastructure to the cloud provider. In return, you get a more dynamic deployment experience, zero-touch autoscaling, increased efficiencies in resource utilization and no more need for capacity management. The unit of compute is more fine-grained than a virtual machine or a container as it is scoped to a single unit of custom application logic.
Figure 16 illustrates the key characteristics of serverless technologies.
Serverless technologies introduce a microbilling model by which you pay only for the number of transactions and the memory and CPU of the compute instance handling the transaction for the time it takes to execute it. Furthermore, you’re billed for additional services that may be required to execute the function (e.g., API Gateway that collects inbound requests).
Serverless computing services such as AWS Lambda, Azure Functions or Google Cloud Functions may seem like the more cost-effective solution because you pay only for what you use. However, there is a tipping point where it becomes cost-prohibitive and you reach a position of diminishing returns. As a consequence, organizations should adapt their application architectures to leverage serverless technologies only for appropriate use cases.
Two case studies illustrate opposed returns for the use of fPaaS serverless computing:
Cost-effective use for fPaaS. This case study is documented by Troy Hunt, who built some utility code to prevent DDoS attacks using Microsoft Azure Functions. The transactions were low in memory usage and had short execution time. Troy’s solution didn’t require HTTP or the use of an API Gateway to trigger the functions. The total cost of the service amounted to zero dollars because its usage was so low that it fell under the Microsoft Azure free tier. It should be noted that Google and Amazon also provide a free tier for their serverless computing services.
Cost-prohibitive use for fPaaS. Cory O’Daniel documented this case study. He wrote a service running on AWS Lambda using Amazon API Gateway and Amazon Kinesis. The service was designed to collect an event stream from web browsers with different metrics and ingest them into a greater ETL system. The cost of the service amounted to about $16,000 per month. O’Daniel rewrote the service in Elixir and ran it on three self-managed EC2 instances, making the cost drop to about $150 per month. In both cases, the service processed 12 million requests per hour with subsecond latency and managed roughly 300GB of throughput per day.
In the cost-prohibitive case study, the self-managed option turned out to be far more affordable than using fPaaS. However, the author also had the luxury of having in-house expertise and people to manage the software components required to run the service in a self-managed environment. For modern organizations that don’t have IT operations in place, the premium of serverless computing may still be a better choice than hiring a full team.
If you’re unsure whether your application will be more or less cost-effective when using serverless technologies, you can build estimates using purpose-built tools. Aside from the cloud provider’s cost calculators, Serverless Cost Calculator and Servers.LOL are two community projects that help build a forecast for serverless. Use these calculators to mimic your application usage and assess whether the adoption of serverless computing may serve to optimize your cloud costs. Factor in operational costs as you make your comparison with self-managed alternatives.
Modernize Your Application for PaaS
Many organizations start consuming cloud computing by rehosting (aka lift and shift) applications from their on-premises data centers to a public cloud provider.
A rehost migration strategy does not require changes in the application architecture. Despite being easier to migrate, rehosted resources are typically unable to leverage key characteristics of cloud computing, such as elasticity and on-demand. As a consequence, rehost strategies tend to have a low-to-negative ROI. provides a framework for selecting a migration strategy that aligns to your goals in terms of speed of migration, ROI and other desired benefits.
Rehosted applications primarily make use of IaaS services such as virtual machines and storage volumes. These services provide dedicated allocation-based resources that organizations pay for, regardless of their usage. Furthermore, IaaS services have an operational overhead. Organizations must pay for the team in charge of managing the software running on top of operating systems.
Conversely, platform services, such as application PaaS, databases, load balancing, caching and message queuing services, include a management layer that cloud providers offer in an as a service model. Modernizing your applications for PaaS allows you to optimize costs due to:
Reduced operational overhead: You don’t need skills and people to manage the technologies underneath PaaS services. Cloud providers are in charge of a larger part of your application stack.
Consumption-based billing: PaaS services benefit from the provider’s economies of scale. Some services provide consumption-based billing that corresponds to their actual usage. Having an application that accrues cost only when used allows you to better align its cost to the value it generates.
To modernize your application for PaaS you can, for example, replace the instances that host load balancer virtual appliances with Amazon Elastic Load Balancing (ELB). You can use Microsoft Azure SQL Database to replace the instances that host Microsoft SQL Server and reconfigure your connection strings without changing much of the application code. Amazon Kinesis Data Stream or Microsoft Azure Event Hubs are typically more cost-effective than provisioning, maintaining and exposing APIs from self-maintained Kafka clusters. Kafka is complex open-source software that requires high degrees of availability and reliability, which is difficult to set up and maintain for most IT departments.
However, just like for serverless technologies, using PaaS does not imply a cost reduction compared to an equivalent self-managed option. Use cost calculators and mimic your application usage to assess whether the adoption of a PaaS may serve to optimize your cloud costs. Include an estimate of the reduction of your operational costs as that is key to making PaaS more attractive.
Take advantage of PaaS by modernizing your application to better operate in the context of cloud computing. Analyze your application dependencies and seek opportunities to replace them with PaaS where requirements and constraints allow.
Developing cost optimization through changes in your application architecture allows you to modernize your applications and better align them to cloud-native principles. Although such optimizations may take longer to materialize compared the techniques in the Reduce component, they come with other side benefits such as increased resiliency and scalability. By skipping this framework component, you will not fully maximize your savings opportunities and you may leave behind the cost benefits that derive from the adoption of cloud-native principles.
The Evolve component of this framework (see Figure 17) illustrates the strategic capabilities to apply the cost management practice throughout the organization. You must adopt the right set of tools for financial management. You must drive cost optimization through optimal workload placement between multiple cloud providers. You will continue to shift budgeting accountability to your cloud consumer and incentivize them to take more financial responsibility. Ultimately, you will identify which business KPI you can correlate with your cloud costs to measure the return of your investments in cloud services. This component of the framework brings to fruition the rest of this cost management framework and evolves the practice to achieve scale.
To implement financial management processes, you must use purpose-built tools. The high dynamism and the scale of cloud deployments do not make cost management suitable for spreadsheet-based management. You must employ real-time tools that can read metrics from APIs and provide the automation required for this practice to scale. You must adopt the management tools that cloud providers provide natively. But you also must augment them with third-party tools and possibly develop your own extensions when necessary. See for the Gartner methodology on developing your management tooling strategy.
Adopt Native Tooling
Major public cloud platforms are equipped with a broad set of native management tools. Such tools are highly integrated with the cloud platform and provide a high depth of functionality. For example, Amazon CloudWatch and Microsoft Azure Monitor can gather unique metrics about their respective cloud platforms that no other tools can aspire to collect. Native tools are available to all client organizations with no additional deployment effort required. Some of these tools come free of charge, while others may be charged with a consumption-based model. Cloud providers continue to invest in their native management toolset with frequent additions of new features and services. For providers, management tools are also a vehicle to make their cloud platform stickier by improving their customer experience.
Due to their depth of functionality, integration and readiness, Gartner recommends organizations to develop their cloud management strategy starting from the adoption of cloud provider’s native tools. Such tools include cost management functionality. Figure 18 provides an example list of the native cost management tools of AWS, GCP and Microsoft Azure with reference to three components of this framework.
AWS provides cost management through a series of tightly scoped and loosely coupled tools. Microsoft Azure has strengthened its native functionality by acquiring the multicloud cost management tool Cloudyn in June 2017. Microsoft intends to continue migrating Cloudyn functionality into the Azure native portal and rebrand it as Azure Cost Management. However, at the time of the writing, the migration hasn’t been completed and Cloudyn continues to be available as a stand-alone tool. Google provides minimal tooling for cost management, and client organizations need to rely primarily on BigQuery and Data Studio to get a handle on their costs.
Cloud providers’ native tools come with some limitations, for instance:
Some tools may lack functionality. With native tools, organizations may not be able to fully contain spending waste or maximize their savings.
Native tools deliver minimal functionality outside of their own cloud platform. Although with some exceptions (Cloudyn supports both AWS and GCP to some extent), cloud providers continue to prioritize the development of functionality for their own platform. This reduces the value of native tools for organizations that want a consistent, multicloud cost management strategy.
Cloud providers may have a conflict of interest in providing tools that make their clients spend less money using their platform. Cloud providers claim that helping their clients run applications as cost-effectively as possible improves retention. However, this potential conflict of interest may not allow organizations to maximize their savings.
Despite these limitations, native tools continue to be the fastest route to start controlling your costs. You must prioritize the adoption of these native tools before considering the addition of third-party or in-house developed solutions. However, once mastering these native capabilities, you must conduct a functionality gap and identify the cost management requirements that remain unaddressed. To help with this identification, Gartner has assessed and compared the native cost optimization capabilities of major cloud providers in
Adopt Third-Party Tooling
To address functionality gaps in native tools, for multicloud management or simply to gain an independent point of view, you may want to adopt a third-party cost management tool. Managing costs and reducing the cloud bill is a compelling functionality that third-party vendors have built within their product set. Such functionality helps build a tool’s ROI because it provides tangible financial results to the investment in the tool’s adoption. Due to such quantifiable returns, third-party cost management tools have achieved good market traction and have driven several M&A events (see Note 1).
However, organizations must be wary of vendor promises and should thoroughly assess a tool’s added value. Many third-party tools in the market provide functionality that’s barely equivalent to what AWS or Microsoft Azure already natively provide. Sometimes, their multicloud capabilities suffer from poor feature parity between supported providers. Therefore, organizations must thoroughly assess the capabilities for each provider they intend to use provides several cost management and resource optimization criteria that organizations can use to develop their evaluation framework.
Organizations can find aspects of cost management functionality in the following types of tools:
Cost management: Purpose-built point tools that provide end-to-end cost management. Their capabilities map closely to this framework and span budget management, cost tracking, allocation, reporting and optimization. Examples of such tools include Apptio Cloudability, CloudCheckr, Flexera Optima and VMware-CloudHealth Technologies.
Cost optimization: Purpose-built point management tools that focus on reducing and optimizing the cloud bill. Such tools do not provide cost tracking and reporting but often excel at their optimization capabilities. Examples of such tools include Densify and Turbonomic.
Cloud governance: Tools with broader management functionality that aim to holistically address the cloud governance space. Such tools provide policy-based management across the domains of IAM, security, configuration and cost management. Examples of such tools include cloudtamer.io, Turbot and TrendMicro-Cloud Conformity.
Monitoring: Tools that provide availability and performance monitoring, but have extended capabilities to also manage cost metrics. Some of these tools simply provide cost reporting; others have gone further and provide optimization recommendations. Examples of such tools include Datadog and New Relic.
Cloud management platforms (CMPs): Tools with broader coverage of the cloud management space and that incorporate aspects of cost management. Because their functionality is broader, their cost management coverage tends to be not as featureful as other cost management tools. Examples of such tools include Embotics (now Snow Software), HyperGrid, Morpheus Data and Scalr.
To help organizations assess the depth of functionality of cloud cost optimization tools, Gartner has published a report that compares five vendors, selected based on Gartner client interest. provides an assessment based on the same set of criteria as those used for assessing the cloud provider’s native functionality in Gartner clients can use the two reports jointly to determine the best combination of tools.
Third-party cost management tools provide functionality that can exceed what cloud providers natively implement. Furthermore, their support for multiple cloud platforms allows organizations to implement a multicloud management strategy. The compelling story and provider-independence of such tools will allow them to continue to receive investments in the near future. Assess the addition of a third-party tool as part of your management strategy to extend the cloud provider’s native functionality and to gain independence.
Although in rapid movement and expansion, the cloud cost management market is far from mature. Cloud providers’ native tools are just beginning to build functionality. Sometimes, cloud providers take a “building blocks” approach, leaving it up to client organizations to develop what’s necessary to glue blocks together. Even if more advanced, third-party tools still focus primarily on IaaS and are just starting to address the PaaS space.
As a consequence, you may have to develop your own extensions when functionality is not available or when it requires integration. Gartner doesn’t recommend developing an entire cost management system in-house. When developing your own extensions, you must:
Leverage the available building blocks. Building blocks include the cloud provider’s APIs, CLIs, fPaaS services and open-source projects. Keep your coding efforts to the minimum.
Keep your scope small. Don’t try to achieve too much with your code. Keep the scope of your extension tightly focused on a single function with clearly scoped boundaries.
Treat it like software. Document your code, version it and add it to your repositories. It is software that you’re building. Your extensions should not be scripts that only the author knows about.
For example, you may want to develop code that terminates all the cost-accruing services within a development environment as it violates a budget policy. Other times, you may want to develop a policy that deletes unused capacity of a cloud service that is not supported out of the box by the tools you’re using.
Although I&O technical professionals have traditionally operated with point-and-click interfaces, cloud computing makes code increasingly important for cloud management. Certain cloud providers’ functionality may be accessible only through coding, such as policies written in JSON. You can also drive automation by developing code through fPaaS as described in Although third-party tools may abstract the need for coding, they also may introduce constraints on the available functionality. As a consequence, you must learn how to code to develop the extensions required to implement this framework in its entirety.
Onboard New Providers
Cloud providers offer similar services but with different capabilities and prices. Although their services are designed to address similar use cases, the differences in implementation may result in cost savings when running an application in one provider versus another. Organizations that want the most cost-effective provider for each workload must develop multicloud strategies, which start with the onboarding of new providers.
Cloud technologies are faster to adopt than data center technologies. The adoption of cloud services does not require lengthy vendor selection, procurement processes, capacity allocation and contract negotiation. However, the adoption of a new cloud vendor that is suitable to run enterprise-grade workload still requires a number of onboarding tasks, including:
Familiarization with the vendor offering, services, pricing models and SLAs
Review of the provider’s terms and conditions
Establishment of vendor-specific KPIs
Building knowledge of the provider’s native management tools
Establishment of a provider-specific competence center with certified engineers
Definition of provider-specific governance policies
Establishment of a direct link between the on-premises data center and the cloud provider’s network
Most organizations are already using or planning to use multiple providers and Gartner expects that most enterprises will end up in a multicloud scenario for both IaaS and PaaS. Developing the competency to operate alternative providers allows you to mitigate concentration and lock-in risks. Furthermore, multicloud strategies allow you to define workload placement policies based on cost drivers as described in the next section, Broker Cloud Services. For more information on the benefits of multicloud adoption, see
Broker Cloud Services
Multicloud strategies require you to develop a workload placement policy. This policy governs the decision of the target cloud provider for your applications. As part of this framework, you must develop the cost-based policies that allow you to place workloads in the most cost-effective platform.
Comparing costs between cloud providers is no easy task. Often, an application requires a different architecture in each provider to deliver the same set of requirements, in terms of performance, availability, integrity and confidentiality. This is due to different technology platforms, available services, design principles, HA strategies, security and SLAs. Before producing a comparative forecast, you may have to adapt your application architecture using the principles described in the Architect With Cost in Mind section of this framework.
Sometimes, organizations also use cost drivers to govern placement decisions between public cloud providers and on-premises data centers. In such cases, to build an “apples-to-apples” comparison, Gartner recommends developing an on-premises cost model as described in Furthermore, you should not focus simply on the pure infrastructure cost comparison. Gartner recommends building a multiyear TCO and ROI that takes into account future cost savings from cloud-induced efficiencies. This process is described in
More information on developing workload placement framework for multicloud and hybrid cloud can be found in
Shift Budget Accountability
The self-service nature of cloud services is fostering a decentralized approach in IT resource procurement. Many organizations experience scenarios in which LOBs, departments and government agencies independently start IT projects using cloud services without involving central IT. As IT service consumers become more autonomous, they also must take partial responsibility for disciplines that were once at the remit of central IT only. These disciplines include monitoring and security and should also include cost. This shift in responsibility does not mean that central IT will eventually no longer be relevant. On the contrary, it will continue to act as an enabler and as “second line of defense” to protect the business from risk.
In this scenario, having more autonomous users means they can make procurement decisions that you don’t control but that have an impact the cloud bill. Once the bill arrives, it is central IT that gets called out to manage the economics of cloud usage. You can certainly apply a centralized cost reduction practice to remove detected waste. While effective, this practice can also be disruptive and a potential source of frustration. A centralized-only practice does not scale when more users gain power to provision resources.
As a consequence, you must shift budget accountability to cloud consumers to influence their provisioning decisions. An accountable user would be motivated to more precisely size resources or remove those that aren’t necessary.
Shifting accountability requires a cultural change and does not happen overnight. To initiate this process and lay out the foundations for this shift to happen, you must:
Formalize the budget approval as you enable user access to cloud services: Application or project owners who request cloud services for their workloads must commit to an amount of monthly spend and be held accountable for it. As you create cloud accounts for your consumers, help them build a spending forecast as described in the Plan component of this research. Then, formalize the budget approval based on the produced forecast. Having users ask for the authorization to spend money helps shift accountability more than having IT force a budget upon them. You can automate the budget request and approval using a service management platform as described in
Provide consumers with visibility into their spending: Provide them with dashboards and reports that help track their actual spending on a daily basis and compare it against their commitments. Set up alerts when their spend is on track to exceed the approved budget. See Alert on Anomalies in the Track component for more information on alerts that help proactively address spending issues.
Cloud consumers who feel accountable for their spend will consider cost optimization as key to hit their budget goals. In this context, central IT must not present itself as a law enforcement body.
With this mindset, providing visibility into spending and recommending actions that help drive cost down will be well received by your cloud consumer.
Shifting accountability does not mean you won’t have to centrally control costs and reduce spending waste. But having more accountable cloud consumers will lower the need for centralized corrective actions, allowing you to drive efficiency at scale.
Incentivize Financial Responsibility
Some cloud governance bodies don’t possess the authority to centrally remediate spending issues. This authority may sit exclusively with the resource owners. In other cases, your measures for shifting budget accountability may not be sufficient to make cloud consumers care about how much they spend. For such situations or simply to accelerate the shift in budget accountability, you must further incentivize your cloud consumers to take ownership of their spend.
As an example, you can “gamify” the cost management practice and create healthy competition between the teams in charge of cloud provisioning. You can maintain and share leaderboards that rank the several teams based on their spending discipline. The position of a team within the leaderboard can trigger behavioral changes, making users more attentive about what they spend and how they reduce their waste.
The leaderboard should contain the following metrics for the current month and for each team:
Total amount of spending
Unaddressed spending waste (absolute value and as a percentage of the total spend)
Number of notified cost optimization opportunities and potential savings
Number of pursued cost optimization opportunities and realized savings
All metrics should also include an increase/decrease indication from the previous month’s value. You can also define scoring rules based on tracked values and establish a rank. These rules should consider that the absence of spending waste is preferable compared to a high number of pursued cost optimization opportunities. Lastly, you can award winners with team dinners, team-building activities and other incentives.
Correlate Costs to Business Value
The ultimate goal of a cost management practice is to correlate cloud costs to business value. Driving costs down as a principle must not be done at the expense of being unable to fully support the business goals. To avoid this, you should stop considering cloud costs as such and start considering them as investments. Then, you must correlate them to business KPIs and calculate the return of these investments.
For example, Netflix measures its business value by the total number of active streams; that is, how many people are currently watching content online. Correlating that KPI to their cloud costs allows Netflix to ensure spending growth does not outpace one of their active streams. Figure 19 shows Netflix’s “normalized cost per active stream” over time and the goal to keep that line as flat as possible. The growth of such metric would be a signal of cost inefficiencies. The dropping value of this metric would be a sign of better economies of scale.
Depending on your industry and organizational goals, you must identify which KPIs you can correlate to cloud costs. For example, a KPI could be the number of billable air miles per seat for an airline, the number of monetary transactions for a bank or the number of issued passports for a government agency. Even if you’re adopting cloud to increase your internal efficiency, this efficiency must eventually translate into the growth of business KPIs.
Developing capabilities to operationalize and evolve your cost management practice is the last fundamental component to control your cloud costs. Defining your tooling strategy and evolving the process to embrace multicloud spending decentralization and correlation to business metrics will allow you to develop a more strategic approach to cost governance.
Risks and Pitfalls
I&O technical professionals in charge of managing cloud costs must be wary of the following risks and pitfalls:
Lack of visibility: Although you’re applying this framework in its entirety, you may still experience a certain degree of spending waste and uncontrolled growth. This may be due to the lack of visibility of certain portions of your cloud deployments. Without complete visibility, you may be applying this framework only to part of your cloud spending. Mitigate this risk by working with your cloud providers to uncover all of your accounts, subscriptions and projects that may have been independently created by different individuals in your organizations.
Lack of authority: You may not be able to pursue all the identified reduction or optimization opportunities due to the lack of authority to change resource attributes. For example, you may not be able to rightsize a resource because your role doesn’t have the required permissions. In such case, you can provide resource owners with recommended cost optimization instructions, but you can’t be sure that they will act upon them. Having accountable resource owners who independently optimize their costs is certainly desirable. However, it may take months before they adopt the right mindset. If you want to quickly reduce your spending, you must claim authority to make changes throughout your organization resources.
Inability to scale: You may struggle to scale your cost management practice. This may be due to large and complex bills. Also, a heavily centralized practice may become a bottleneck. Mitigate this risk by using automation as much as possible. When automation isn’t desirable, decentralize your cost management practice and push more responsibility to your cloud consumers.
Frequent disruptions: The actions required to reduce spend may disrupt your resource availability. For example, rightsizing compute instances requires a reboot. Or you may accidentally delete resources that aren’t actually unused due to a loose policy with consequences on your team’s productivity. Identify disruptive remediations and coordinate their execution with cloud consumers. Align remediations to maintenance windows. Setup communication channels that notify cloud consumers of the upcoming remediations and allow them to schedule actions in a different time frame.
The following documents constitute an integral part of this research:
provides a framework for the successful implementation of a tagging strategy to manage resource metadata for cost allocation.
provides a comprehensive guidance on implementing cloud governance. This research discusses policy definition and enforcement with preventative and retrospective approaches.
assesses and compares the cost optimization capabilities of Apptio, CloudCheckr, Flexera, Turbonomic and VMware against a common set of criteria.
assesses and compares the capabilities of the native cost optimization tools of the leading public cloud providers against a common set of criteria.
Gartner Recommended Reading
Some documents may not be available as part of your current Gartner subscription.
The following M&A events have occurred in the public cloud cost management space in the recent months (in chronological order of the announcement):
“HPE Acquires Cloud Cruiser In Stepped Up Flexible Capacity Hybrid Cloud Pay-As-You-Go Offensive,” CRN.
“Microsoft Confirms Cloudyn Acquisition, Sources Say Price Is Between $50M and $70M,” TechCrunch.
“VMware Acquires CloudHealth Technologies for Multicloud Management,” TechCrunch.
“Flexera Acquires RightScale to Combine Software Asset, Cloud Management,” ZDNet.
“Apptio Buys Portland’s Cloudability, Adding More Cloud Financial Analysis Tools to Its Belt,” GeekWire.
“ParkMyCloud Has Been Bought by a Well-Funded Boston Firm,” Washington Business Journal.
“On-Prem IT Management Firm Snow Software Acquires Embotics to Expand Into Multicloud,” SiliconANGLE.