How to Address The Cloud Cost Challenge in AWS
In 2020, Gartner predicted that the public cloud market would grow to $257.9B for the year and companies were projected to collectively waste $90.8B in public cloud spend. Gartner also suggested in a recent press release that the cloud market will reach 14% of total global enterprise IT expenditures by 2024, up from 9% in 2020. Approximately 35% of all cloud spending each year in unnecessary. This tells us that many businesses appear to be failing quite badly with the challenge of cloud cost optimization. Some waste money by overprovisioning, others don’t pay attention to their data lifecycles, some may neglect to turn off their VMs when they are not in use etc.
This article will not entreat to you the cost benefits of being in the cloud. You already know that, otherwise you wouldn’t be reading this article; You have already swapped the fixed expense or your own data center for the variable, pay as you consume model. But… how do you deal with the inevitable cloud sprawl and associated costs that comes with allowing development teams to self serve in the cloud? Scale that challenge up to the enterprise level and the wastage has the potential to become truly eye watering very quickly.
I love developers. They fuel everything we do in the world of software. However, they themselves may be the first to coquettishly admit that peeerhaaaaps they may not always have the best housekeeping habits. Developers are consumers of technology. Virtual servers are left unused or underutilized. Old, long forgotten backups and images sit gathering dust. Unattached virtual hard drives drift through the ether. AWS accounts needs to be monitored at all times to identify when assets are being under-utilized (or not utilized at all). When opportunities exist to reduce costs by deleting, terminating or releasing zombie assets, you must leap into action.
Such cost control measures might be the responsibility of the FinOps team, or the Cloud Management team or the SRE team, or Ops team etc. But regardless of the team name, the mechanisms and process for managing costs in the cloud remains the same. This article will seek to outline reliable and proven methods of controlling costs in AWS.
Due to the scope of the subject at hand, this article will be broken down into five parts:
1. Set Cloud Cost Control Standards for your Organisation.
2. Using native AWS systems for cost control.
3. Cost control strategies for common AWS services.
4. Examples of AWS cost control automation.
5. How to avoid cost surprises in AWS with alerting.
We’ll start in this article with talking about defining some cloud cost control standards.
Part 1 — Set Cloud Cost Control Standards for your Organisation.
Defining cloud cost control standards for you organisation help you to focus on where to identify various cost optimization opportunities, thereby maximizing the business value of cloud use. You can use the cloud cost standard examples outlined below to ensure that you are getting the best value for money from commonly used services in AWS. I will talk more about using native AWS systems for achieving your cost control standards in the next article.
- Compute right-sizing (instance size and type)
There is nothing so sad and disturbing in this world as a public cloud virtual server which is too large and expensive for it’s given task. This is just pure cost wastage. AWS furnish you with recommendations for EC2 downsizing candidates in their Trusted Advisor service, so there is really no excuse to have oversized virtual servers. What’s that? You need more computing power available for spikes in demand? Then scale horizontally, using autoscaling. The process of selecting the appropriate instance sizing and, if necessary, redeploying becomes even simpler if you use IaC.
Downsizing underutilized EC2 instances to fit the actual capacity need at the lowest cost represents an efficient strategy to reduce your monthly AWS costs. Consider updating the instance type to a more modern instance family at the same time (for example, from m4 up to m5) to realise further performance versus cost improvements. Using an instance type which uses AWS’s own Graviton processor type will provide even better cost value. Analyse performance data to right size your EC2 instances, or simply follow Amazon Trusted Advisor’s sage advice.
Fun fact: Do not try and right-size burstable instance families because these families are designed to typically run at low CPU percentages for significant periods of time.
- Effective management of Reserved Instances and Savings Plans
If you have a level of predictability with your EC2 usage, you will certainly benefit from purchasing reserved instances. In return for your commitment to pay for all the hours in a one-year or three-year term, the hourly rate is lowered significantly. When I say significantly I MEAN significantly. You can pay for your reserved instances all upfront, partially upfront or with no upfront payment at all. As you might expect, the percentage of savings reduce is you make a lesser up-front payment commitment. So you’ll receive better savings (perhaps as much as 70% compared to on-demand) when you pay all up front. Note it is also possible to purchase convertible reserved instances, but the saving is quite a bit lower.
Remember though, once you have purchased reserved instances you cannot cancel, so plan carefully. The only option for ditching unwanted reserve instances is to trade it on the Reserved Instances Marketplace. I had to do this once and it made me feel strangely seedy and unclean. If in doubt, purchase less than you think you may need — you can always top up later, or fill the gap with savings plans. AWS does provide tools to aid you in understanding your reserved instance situation, be it current utilization and coverage, or potential purchases.
Savings plans are similar to reserved instances, except that the commitment is against spend rather than utilization.
Fun facts: You can also purchase RIs for RDS, Elasticache and Opensearch. Savings Plans can also be applied to the amount spent on AWS Fargate and Lambda.
- Implement a pause / resume strategy
You wouldn’t leave your lights on at home if you were out (unless you were worried about burglars). So why leave virtual servers running when they are not in use? The easiest way to reduce AWS EC2 compute costs is to turn off instances that are not being used. This may be just for “out of office hours” plus the weekends. Or you might find instances that have been idle for some time, say, more than two weeks. In which case you should consider stopping or even terminating them. Though before you get too slap-happy, you should seek confirmation from the owner of the virtual server of course.
Managing downtime for EC2 instances is really very straightforward. Multiple tools are available to help with this process, but the “Amazon EC2 Scheduler” is simplicity itself. Try it!
Fun fact: Be mindful that terminating an instance will delete the attached EBS volume by default.
- Take advantage of spot instances
AWS spot instances can reduce your EC2 on demand instance cost by up to a staggering 90%. To allow for spikes in customer demand, AWS have a certain amount of idle infrastructure in their data centres. AWS provides this excess kit to customers as spot instances. Your infra must be stateless, fault tolerant workloads though, as spot instances can be taken from you. Short lived workloads such as those created by a CICD process are also ideal candidates for spot.
As long as you can manage the potential interruptions that spot will bring, there is no better way of saving on your EC2 costs.
Fun fact: Being flexible with instance types and which availability zone they are hosted in will give Spot a better chance to find and allocate your required amount of compute capacity. Don’t let OCD win!
- Delete unattached EBS volumes
The number of volumes in use, especially in large or enterprise AWS deployments can quickly spiral out of control, leading to increased cloud storage costs. Removing unattached volumes keeps unnecessary storage costs down and reduces the risk of exposing old / forgotten sensitive data. Tracking down all these unused resources can be a time-consuming task. So you should consider automating the removal of unattached EBS volumes using lambda functions.
- Delete obsolete snapshots
Automatically delete old snapshots, that are not managed by any specific backup policy and are older than 30 days old. AWS provide simple tools to help you manage your snapshot life cycles and retention periods. AWS Backup console allows you to create backup policies that automate backup schedules and retention management. Alternatively, you could use AWS Data Lifecycle Manager to automate the creation, retention, and deletion of snapshots for EBS volumes. Or you could just write your own clean-up script of course…
Fun fact: Periodic snapshots of the same volume are incremental, so lots of snapshots doesn’t necessarily mean lots of data consumed.
- Move infrequently-accessed data to lower cost tiers
It can be very easy to use the noble S3 bucket as a hoarding area. A dumping ground, if you will, for all types of data under the sun. Keeping all of your S3 data in the default standard storage tier is just a pure waste of money if you do not intend to retrieve it regularly. There are several different storage tiers in S3, with each cheaper tier taking longer to retrieve than the previous. It would be a full time job to try and place data in the appropriate tier yourself, so AWS have thankfully provided S3 “intelligent tiering” to do this for us. It is an option you can simply switch on for any given S3 bucket. This utility will automatically move data to the most cost-effective access tier when access patterns change. If objects remain un-accessed, they will continue to plunge downwards through the storage layers… Down through the infrequently accessed storage tiers all the way into the murky depths of the Glacier deep archive tiers. Where, presumably, you data is buried in the artic under permafrost.
I am not entirely sure what mysterious algorithms are used to calculate intelligent tiering cost savings. But I can say that I personally have seen a 32% annual saving on S3 costs where intelligent tiering is enabled. That is at an Enterprise size organisation.
Fun fact: The maximum amount of data your can store in an S3 bucket is… unlimited.
- Minimize cross-region Data flows
We don’t always pay much attention to the movement of data in the cloud. For some AWS services, the cost for moving data in or out is accounted for within the cost of the service itself (rather than being billed as a separate data transfer fee). Sometimes this means that there won’t be a distinct data transfer cost in either direction, such as with our old friend, AWS Kinesis. One “data gotcha” is the transferral of data between AWS services across regions has the same cost structure (although the rates are a lot lower) as transferring data between AWS and the internet.
So, if possible, keep all traffic within the same region. If traffic needs to exit a region check and choose the region with the lowest transfer rates that makes most sense for your business needs. Remember all traffic within the same AZ and the same VPC, using AWS Private IPs, is free. So, try keeping your resources within the same AZ and the same VPC using private IPs as much as possible.
Fun fact: Data transfer fees are mostly unidirectional i.e. only data that is going out of an AWS service is subject to data transfer fees.
Release Unattached Elastic IP Addresses
A simple cost reduction exercise that should be added to any cost control policy; EIPs are free of charge until they are unattached.
Fun fact: EIPs will also incur a charge if the instance they are attached to is not running.
- Establish data lifecycle policies
In addition to protecting valuable data by enforcing a regular backup schedule, using data lifecycle policies will reduce storage costs by deleting outdated backups. After all, if you don’t manage your data properly it will just keep growing and you will keep repaying in perpetuity for data which is of no value to you.
Two obvious data candidates which are just begging for lifecycles management are EBS snapshots and our old friend, the S3 bucket. You can use Amazon Data Lifecycle Manager to automate EBS snapshot lifecycles. Subsequent snapshots are incremental so generally take up very little storage. When it comes to S3 data cost control, using intelligent tiering goes a long way to ensure you pay less for infrequently accessed data. But you can also set S3 Lifecycle rules to limit the number of versions of an object which are retained to achieve further storage savings. If you are feeling particularly destructive, you can even setup a policy to remove data after a certain period has elapsed. Do, you really need to keep those dev environment logs for more than 7 days? Come on, give them up. You can do it!
- Remove zombie resources
One of the biggest drains on your AWS bill is continuing to run unused resources that are billed continuously. This may seem staggering obvious. But it is not always clear in a large cloud estate which resources are not in use. It would be highly beneficial to devise a scripted solution for removing out of use or dormant components in AWS. For example; EC2 instances which are providing no benefit, idle or unused load balances, idle RDS instances, unattached EBS volumes, un-associated elastic IP addresses etc etc. Defining a policy which sets out a clear definition of what constitutes a zombie resource is key to keeping your estate cost efficient. Not only do these unused resources cost money, but they will impact on your ability to scale within an AWS account. If left unchecked, zombie resources may cause you to hit hard quota limits for some AWS services. You may think there is no chance of that happening now, but what about in 12 months? Or 5 years? Don’t wait until you hit hard limits to solve the problem.
For each of these cost control standards, it will also be useful to define additional metadata which supports the management of each cost standard. For example:
Description: Provide a detailed description of the cost standard so it is clear exactly what is in scope and what it is supposed to achieve.
Discovery method: What tool or system will you use as a reliable and repeatable method to report on which components fall in scope of your cost standard? This might be an AWS native system such as Trusted Advisor, or your own bespoke automated solution.
Resolution method: What mechanism is used to purge the items which you have discovered? Outline the solution which resolves findings. We’ll hear more on this subject in subsequent postings for this groups of articles.
Report location and method: Ensure everyone knows how costs savings are reported on and where the data can be found. This is critical to ensure that everyone can see what a great job you are doing at saving your organisation money!
So there you have it. this is not an exhaustive list of cost control standards by any means — you’ll find more depending on the services you are using of course. It is much easier to identify and measure your cost savings efforts if you have clear cloud cost control standards to hold yourself and your fellow cloud users to. Thanks for reading!
Part 2 will follow soon. We will examine practical and automated methods of using native AWS systems for cost control.