How To Keep SageMaker AI Cost Under Control and Avoid Bad Billing Surprises when doing Machine Learning in AWS

* Price calculations using AWS Price List API

First, what is Amazon SageMaker?

Amazon SageMaker is a group of AWS services focused on relevant areas related to Machine Learning and AI. It covers stages ranging from ML models and Gen AI application development, data analytics/processing as well as governance. Amazon SageMaker AI is the managed service mostly focused on automating the whole cycle related to implementing and deploying Machine Learning models. It provides features to launch compute and data storage infrastructure for selecting, building, training, deploying and using ML models to execute predictions. It also provides a number of tools to access custom and public ML models. Developers can bring their own ML algorithms or select existing ones from the extensive library made available by SageMaker. In addition to the expected SDK interfaces, the service also provides very useful Graphic User Interface tools for building, training, deploying and accessing ML models, regardless of team members’ technical skillset.

I think it’s fair to say that SageMaker offers the whole range of tools both engineering and non-technical teams need in order to develop, evaluate, use and maintain Machine Learning technology for live production systems.

How AWS pricing works in SageMaker AI.

As with any AWS service, SageMaker cost is pay-as-you-go and driven by the cloud resources consumed by the executed tasks. In this case, these are compute capacity, data processing and data storage. Below are the main areas that result in SageMaker cost.

So, let’s start with what’s typically the top usage type by cost…

Instance Types managed by SageMaker

SageMaker provides EC2 compute capacity for a wide range of ML tasks. This means developers can launch a large number and variety of EC2 instances managed by SageMaker, depending on the task at hand. For example: Notebooks (development and evaluation tasks), Processing (pre/post and model evaluation tasks), Data Wrangler (data aggregation and processing tasks), Training (run ML training jobs), Real-Time Inference (hosting real-time predictions), Asynchronous Inference (asynchronous predictions, as opposed to real-time), Batch Transform (data processing in batches) and Jumpstart (launches Training and Inference instances to evaluate models available from a public library managed by SageMaker).

When it comes to EC2-based compute capacity in SageMaker, application owners pay per second, depending on the instance family and size and how long the instance is left running. Machine Learning processes are usually compute intensive and long-running, which means compute cost is the main expense for most applications that rely on ML algorithms.

It’s important to note that SageMaker introduces a cost premium compared to regular EC2 instances. For example, for most instance types, SageMaker cost is approximately 20% higher compared to their EC2-managed counterparts. In the chart below, select any instance and component type and you will see the comparison between the two services:

Even though cost is approximately 20% higher in SageMaker compared to EC2, in most cases this cost difference is compensated by the savings in engineering time that SageMaker delivers. That being said, it’s important to evaluate situations where applicable tasks can be deployed in instances 100% managed by EC2, especially for compute-intensive processes that result in high compute usage.

Instance Families are a very important factor to consider when it comes to performance and price. Some examples of Instance Families are: Standard (i.e. t2, t3, m5, m5d, m6g, m6gd, m7i, etc.), Compute Optimized (i.e. c6g, c6gd, c6gn, c7g, c7i, etc.), Memory Optimized (i.e. r5, r5d, r6g, r7i, etc.), Training (i.e.trn1, trn1n), Inference (i.e.inf1, inf2), and Accelerated Computing / GPU (i.e.g4, g4dn, g5, p2, p3, p4d, p5, etc). Each Instance Family is optimized on specific compute parameters, such as: vCPUs, memory, GPU (Graphics Processing Unit), AWS Accelerators (Trainium, Inferentia), network bandwidth and EBS bandwidth.

It’s important to determine the right instance size and family based on the ML job to be performed. In cases where ML tasks require intensive and specific compute requirements it might make sense to choose a more expensive instance family, while other tasks could be completed efficiently with a lower cost instance family. Given that some instance families (e.g. p2, p3 instances) can be 5x-10x more expensive compared to the lowest cost ones, it is important to deploy the right balance of performance, availability and cost for the task at hand. In order to achieve this goal, monitoring CloudWatch metrics on a regular basis, in areas such as CPU/memory utilization, latency, network, invocations and errors needs to be done.

The chart below displays a comparison regarding similar instance sizes, but different instance families.

Stopping or terminating instances that are not in use is a basic cost reduction strategy that is particularly important when working with SageMaker infrastructure; this is critical when using very large and powerful instances, such as 24xlarge, 32xlarge or 48xlarge, which can quickly reach many thousands of dollars in compute cost.

Component Types are an important configuration in SageMaker, from a functional perspective. However, in most cases there’s no difference in compute cost across different Component Types. This means Instance family and size are the main factors to consider when comparing cost for SageMaker instances. Below you can see a comparison among different SageMaker component types:

Instances managed by SageMaker Studio can be configured to automatically shut down after a pre-configured idle period of time. This is particularly useful for large development teams and for applications that require significant compute capacity during the development stages.

SageMaker Auto Scaling is also a feature that can save significant money when hosting an inference endpoint. It can be configured to add or remove instances based on available CloudWatch metrics, such as the ones related to invocations per instance or CPU/Memory utilization. It can also be configured to add or decrease compute capacity based on a schedule, which is a useful feature for workloads that have a predictable schedule pattern.

When building and deploying ML models, using a feature such as SageMaker Neo can deliver significant performance optimizations for inference tasks. AWS claims performance optimizations of up to 25x when using Neo, which can result in considerable cost savings for real-time and asynchronous inference tasks. When using Neo, it is important to deploy models to applicable instance families and in some cases to execute compilation tasks according to the selected instance family.

Serverless Inference Compute

In addition to EC2 instances, SageMaker offers Serverless Inference, which deploys ML models and makes them available to client applications without launching EC2 instances. This feature is a good alternative for applications with a non-consistent workload, low volume, steep spikes or long idle periods, since it offers application owners an alternative to paying for EC2 compute capacity that would not scale fast enough, or that could potentially sit idle or under-utilized for long periods of time. If configured properly, it can also be a good solution for high volume workloads.

Serverless Inference uses AWS Lambda components and is very similar to Lambda functions in the sense that developers configure memory allocation for a serverless endpoint and pay according to the number of execution seconds spent by the endpoint as a result of task processing. In addition, Serverless Inference also incurs cost based on the amount of data ingestion and retrieval.

Similar to AWS Lambda, developers can also configure Provisioned Concurrency, which consistently allocates a baseline of compute capacity to a serverless endpoint and prevents latencies as a result of cold starts, which are a common situation in Lambda compute infrastructure. Application owners pay for the number of units and minutes of Provisioned Concurrency and for the amount of processing time (as a result of requests and their duration) when Provisioned Concurrency is enabled.

In the chart below, you can select different scenarios for Serverless Inference and select different percentages of allocated Provisioned Concurrency and request rates. The chart displays a comparison between Provisioned and On Demand for the selected request rate and only for the amount of time covered by the selected percentage in a given month (i.e. if 10% is selected, it only calculates Provisioned usage and cost for that amount of time in one month). The chart calculates the number of Provisioned Concurrency Units required to cover the selected usage, based on the request rate and the average duration.

Memory (GB):
Minimum Provisioned Concurrency units:
Percentage of hours with Provisioned Concurrency enabled:
Average request rate:

In many cases, it’s important to note that Provisioned Concurrency can actually lower cost if allocated in the right amount during periods of high utilization. The optimal amount of Provisioned Concurrency Units depends on the request rate and average duration of each request, in order to ensure all requests are covered. For example, if an application has 5 requests per second and each request takes 1 second, it will require at least 5 Provisioned Concurrency Units to cover all requests during that period. If only 1 Unit is provisioned, 4 requests per second will be assigned on-demand capacity and only 1 will be assigned Provisioned Concurrency. Therefore, it’s important to calculate the right amount of Provisioned Concurrency Units based on usage volume.

Memory allocation in Serverless Inference is a very important parameter to consider, given that it can increase cost significantly if it’s over-allocated. However, the right amount of memory allocation can reduce execution times and compensate for the additional memory cost and even result in cheaper Serverless Inference cost. Therefore, it’s very important to execute load tests to determine the right amount of memory allocation, based on the expected usage volume.

The chart below displays a monthly cost comparison approximation based on memory allocation.

Provisioned Concurrency units:
Percentage of hours with Provisioned Concurrency enabled:
Average request rate - during On Demand period:
Average request rate - during Provisioned Concurrency period:
Data processed per request - IN: - OUT:

It is highly recommended to use the Amazon SageMaker Serverless Inference Benchmarking Toolkit, which helps developers find an optimal serverless configuration and compare it against a relevant hosting instance. Also, regularly monitoring CloudWatch metrics related to concurrency, invocations and memory utilization is a best practice in order to ensure the right capacity is allocated to serverless endpoints.

Instance Storage

Machine Learning instances require local storage in order to process data locally. This value is configurable and it varies according to application needs. The storage cost varies according to the instance’s Component Type.

Depending on the number of instances and allocated storage, this price dimension can result in significant cost; especially considering that ML processes typically access large amounts of data that often need to be available in local storage for performance reasons. Also, it’s important to mention that even if an ML Instance is stopped, storage cost is still incurred until the instance is terminated. It is common to encounter situations where important data is stored locally, therefore it’s not very practical to terminate certain ML instances. Local SSD storage is an important aspect to measure when managing fleets of ML instances.

The chart below displays different cost scenarios based on the configured Component/Volume Type.

Number of ML instances:
Storage allocation per instance (GB):
Region:

Data Transfer and Data Processed

When hosting Machine Learning models, there is a cost associated with the volume of Data Processed. At per TB of data processed IN and per TB of data processed OUT, this is typically not one of the top cost items incurred in SageMaker (usage is billed per GB). Data Transferred within the same AWS region does not incur in AWS cost. However, it’s always highly recommended to launch processing infrastructure in the same AWS region where data is stored, for cost and performance reasons.

Feature Store

Features are a key component in many Machine Learning applications. They store relevant information that is used by models as data inputs during training and inference tasks. SageMaker Feature Store provides a managed repository to store and retrieve this data and to make it available to ML models. Data is available in a feature catalog managed by AWS, which can be used by multiple models and tasks.

SageMaker Feature Store pricing is based on data storage plus the number of reads and writes as well as the amount of data in each read or write request. There is also a Provisioned Capacity mode for read and write requests. There are two types of data storage: standard and in-memory. Standard is charged per GB-month, while in-memory is charged per GB-hour.

Reads and Writes are measured in Read/Write Capacity Units. For on-demand, reads are charged per 4KB of data and writes are charged per 1KB. In-memory is charged per 1KB of data for both reads and writes. The minimum hourly charge per hour is 5GB for the in-memory online store. Data Transfer in and out between Feature Store and other AWS services within the same region does not incur cost. However, data transferred across different AWS regions does incur cost (not included in the calculations below).

The charts below show the monthly cost for two on-demand configurations (standard and in-memory), assuming storage is constant for the whole month.

Average Read Request size: Average Write Request size:
Number of Read Requests per month:
Number of Write Requests per month:
Storage allocation (GB):
Region:

It is important to note how in-memory storage can quickly result in thousands of dollars, since it’s charged by the hour and the minimum data storage charge is 5GiB per hour. Therefore, it is critical to implement processes that constantly adjust the amount of data stored for in-memory requests, otherwise the monthly cost can quickly reach thousands of dollars.

Price Differences Across Regions

The chart below displays a cost comparison across a subset of relevant AWS regions. There are regions where cost is >60% more expensive compared to the lower cost ones. Regions such as N. Virginia, Ohio and Oregon are the best options, from a cost perspective. Therefore, it is essential to select the right region for an ML workload and data storage, given potential higher cost in some regions and the fact that there can be substantial charges related to inter-region data transfer.

This article contains more details about relevant differences across AWS regions.

Differences between SageMaker Savings Plans and On Demand

SageMaker Savings Plans are similar to Compute or EC2 Savings Plans. They offer a discount based on an hourly spending commitment. They deliver savings that can range from approximately 25% to as much as 65%, depending on the component and instance type as well as the payment option (no upfront, partial upfront, all upfront) and commitment period (1 year or 3 years). In SageMaker, Savings Plans only apply to compute capacity deployed on EC2 infrastructure; they don’t yet support other compute options, such as Serverless Inference.

When choosing Savings Plans, it is important to note that because the commitment is applied on an hourly basis, they’re only a good fit for deployments that are constantly active with a minimum capacity at least equal to the hourly dollar commitment. For applications that launch compute components without consistent capacity, it is important to calculate the right amount of Savings Plans hourly commitment based on the minimum amount of deployed compute capacity. If an application has several hours without any compute capacity deployed, then Savings Plans are not necessarily a good fit.

Even though Savings Plans cover most instance types, there are some rare combinations of instance and component types that don’t support Savings Plans discounts. In the chart below, you can select instance and component types in relevant regions and see cost savings for 1 and 3 year commitment periods:

Spot Instances are also an option to consider for workloads that can handle instance termination at any point in time. In some cases, Spot Instances can offer up to 90% savings compared to On Demand pricing, however jobs running on those instances can be interrupted at any time, without warning, based on the bidding of available EC2 instances. Therefore, they must be assigned to tasks that can handle this situation without affecting the end user experience or critical processes.

SageMaker does not support the equivalent of EC2 Reserved Instances, only Savings Plans. If you want to know more details about how to save money with Savings Plans in general, here is an article I wrote about this topic.

To Summarize - SageMaker Cost Savings Strategies

Choose the right Instance Family for the type of processing requirements, given there can be substantial price differences for instances of similar size, but different families (in some cases, close to 10x).
Ensure data storage capacity is configured optimally in EC2 instances (avoid over-provisioning SSD storage).
Identify long-running simple tasks that could potentially be implemented with relatively low engineering effort in EC2 instances (20% compute cost savings). Ensure engineering cost will not compensate for the potential EC2 lower cost.
Evaluate Serverless Inference vs. EC2-based compute available in SageMaker. Take into consideration the usage patterns and required compute capacity. For example, heavy, long running tasks might be a better fit for SageMaker-managed EC2 instances vs. applications with steep spikes in usage for short periods of time (likely a better fit for Serverless Inference). Consider using Amazon SageMaker Serverless Inference Benchmarking Toolkit, which helps find optimal serverless configurations that can result in cost savings.
If using Serverless Inference, evaluate usage patterns and consider configuring Provisioned Concurrency. If done properly for a usage pattern, this feature can improve performance and lower cost for Serverless Inference tasks.
For real-time and asynchronous inference, consider using SageMaker Neo to optimize models according to the hosting instance family. This can significantly increase performance and lower cost for inference tasks.
Configure SageMaker Studio instance automatic shut down after a pre-determined idle period of time. This is particularly important for organizations with a significant number of team members using SageMaker Studio.
For inference endpoints, configuring Auto Scaling based on a schedule or usage metrics can optimize compute infrastructure cost.
As a general rule, configuring CloudWatch Billing Alarms is a best practice that should be implemented, depending on the organizational budget and expected cost. There should be multiple alarms configured for thresholds ranging from moderate to far above the expected AWS cost.
Choose the right Savings Plans hourly spend commitment. Don’t commit to SP if there are long periods of time without deployed compute capacity. If load varies significantly, determine the minimum amount of required compute capacity, in order to determine the right Savings Plans commitment.
Constant monitoring of CloudWatch Metrics is essential, in order to ensure the right capacity is allocated to either serverless or EC2-based processes. CloudWatch Dashboards and Alarms must be configured in order to identify over-provisioned deployments and to ensure application stability in general. Load tests can help find the right compute capacity allocation and identify high cost situations early in the release cycle.
Iterate. Cost optimization best practices will likely require constant adjustments as the application features and usage evolve, which is an ongoing process through the application lifecycle.

Do you need help implementing cost-effective Machine Learning solutions in AWS? Or lowering your AWS cost?

Don’t overspend when implementing AWS Machine Learning solutions. If you need help lowering your AWS bill and keeping costs under control, I can save you thousands of dollars per month (I do it on a regular basis). Click on the button below to schedule a free consultation or use the contact form.

MiserBot, developed by Concurrency Labs, can help you monitor cost on a regular basis and easily identify the top usage types and resources by cost, including SageMaker components.

First, what is Amazon SageMaker?

How AWS pricing works in SageMaker AI.

Differences between SageMaker Savings Plans and On Demand

To Summarize - SageMaker Cost Savings Strategies

Do you need help implementing cost-effective Machine Learning solutions in AWS? Or lowering your AWS cost?

Ernesto Marquez