Part III: EMR - The Ultimate Guide to Saving Money with AWS Reserved "Anything"

* Price calculations using AWS Price List API

Recently I’ve written posts about saving money by purchasing AWS Reserved capacity. In two previous articles, I wrote about EC2 and RDS, but there are other services that can save you a lot of money if you go the Reserved route. In the third article in this series, I take a look at a potentially very expensive service: EMR.

Services like EMR make it possible to analyze large amounts of data, which is the core of many applications that run on AWS. One problem is that analyzing a large volume of data intrinsically requires significant compute resources. That’s why launching an EMR cluster can easily result in a very high AWS cost - I’m talking several thousands of dollars worth of compute infrastructure per month.

That’s why it’s crucial to understand the cost savings opportunities, particularly Reserved instances. But first let’s take a look at how EMR pricing works. I’ll focus on compute cost, since that’s the portion that can be optimized with Reserved purchases.

EMR Pricing

AWS Elastic MapReduce is a managed service that supports a number of tools used for Big Data analysis, such as Hadoop, Spark, Hive, Presto, Pig and others. EMR basically automates the launch and management of EC2 instances that come pre-loaded with software for data analysis. With EMR, you can access data stored in compute nodes (e.g. HDFS) or in external data storage systems, such as S3.

There are two main pricing components in EMR:

EC2 compute. Since EMR launches EC2 instances, you pay for the same compute price dimension as any other EC2 deployment: per-second compute time based on instance type. You don’t pay for Operating System fees, since EMR instances run on Amazon Linux. You don’t pay for License fees either, since the software that runs on EMR is open source - the only exceptions are some MapR distributions.
EMR fee. This is a management fee that EMR charges based on the EC2 instance type and compute time consumed in the cluster. It also supports per-second billing and it’s a fraction of EC2 compute cost. It typically amounts to approximately 20% of the EC2 compute cost.

The chart below shows the On Demand EMR and EC2 monthly cost for some common EC2 instances supported in EMR (this is just a subset of instances as there are over 350 supported instance types in EMR):

Select a region:

In addition to compute fees, you also pay for the usual EC2 price dimensions, such as EBS storage and data transfer (out to the internet, inter-region and intra-region). Even though price dimensions such as data storage and data transfer can be significant, only compute fees are relevant when it comes to optimizing cost with Reserved purchases.

EMR Reserved Options

Terms and Payment Options

EMR doesn’t offer Reserved purchases, per se. If you need to make a Reserved purchase for your EMR cluster, you actually have to make an EC2 Reserved purchase. This means you have access to the same options available to any EC2 Reserved decision. The main difference compared to EC2 is that you have to pay the EMR fee associated with the selected EC2 instance type, as mentioned above.

You can purchase reserved compute capacity for 1-year and 3-year terms, and you save more money by committing to a 3-year term. Below you can see the savings for different instance types and terms (don’t forget to hover on the charts to see more details).

The following payment options are available for EC2 instances in EMR:

All Upfront. 1-time payment covering the full term (1 year or 3 years). This is the option where you save the most money.
Partial Upfront. A portion of the full amount, also paid upfront and the rest is paid monthly.
No Upfront. You commit to the full term and pay a fee every month.

You save more money over the whole term if you pay All Upfront. Just like other services that support Reserved pricing, sometimes the difference between No Upfront vs. All Upfront isn’t substantial (3%-5%). In those cases, it might be worth it to choose No Upfront and pay monthly instead of spending a potentially large amount of money on day one.

Given that EMR clusters are typically used to execute heavy computational tasks, they tend to need powerful EC2 instances and multiple compute nodes. This means going All Upfront can easily turn into a big investment worth several thousands of dollars.

Below is a comparison of the different purchase options, including the EMR fee. You can choose any of the supported EMR instances as well as AWS region in the drop-down:

Standard vs. Convertible, Instance Size Flexibility and Scope

Since EMR uses the same parameters for EC2 instances, you can select either Standard or Convertible Offering Class.

Standard means you have to keep a particular instance type for the duration of the Reserved purchase period (1 year or 3 years). This is the default option.
Convertible gives you the flexibility to switch to a different instance family and/or size. You save less money if you choose this option.

Instance Size Flexibility means your Reserved discount is applied to instances of the same family, proportional to their size. For example, a Reserved discount for one m5.4xlarge can be applied to 4 m5.xlarge EC2 instances. This feature is available to both Standard and Convertible offering classes.

Scope is another variable to consider. It doesn’t affect Reserved pricing, but it affects the priority you’re given when launching new EC2 instances and how your Reserved discount is applied:

Regional. Reserved discount gets assigned to any applicable EC2 instance within an AWS region, but there is no guaranteed compute capacity. If there’s a capacity shortage for a particular EC2 instance type in a given region, you might have to wait until compute capacity is available. This restriction is typically not an issue for applications that execute asynchronous data processing tasks. It’d also be a rare situation that EC2 runs out of compute resources. This is the default option and in most cases, the one that I recommend.
Availability Zone. You get guaranteed EC2 capacity within an AZ that you specify, but you can’t update your EC2 instance size without losing your Reserved discount. You gain certainty that compute resources will be provisioned when you need them, in exchange for flexibility. I’d recommend this option only for very specific use cases that include strict availability and compute capacity requirements.

AWS Region Comparison

EMR Reserved cost as well as the percentage of savings compared to On Demand varies by AWS Region. The following chart compares EMR Reserved vs. On Demand across a number of frequently used regions for Standard All Upfront.

Feel free to select different instance types from the drop-down. You can also hover on the chart to see more details.

Visualizing All Options

The chart below compares total EMR compute cost when purchasing Reserved instances and presents different Purchase Options (All Upfront, Partial Upfront, No Upfront) and terms (1 year vs. 3 years). You can select a different instance type, region and term.

The chart displays how cost accumulates throughout your Reserved purchase period and it shows Months to Recover (MtR - how much time you have to wait before you start to see savings compared to On Demand). You can hover over the chart to see more details.

One important thing to notice is that due to EMR’s compute fee (which doesn’t support Reserved), purchasing Reserved EC2 instances for an EMR cluster results in a lower savings percentage compared to a pure EC2 calculation. For example, while an EC2-only Reserved purchase can result in about 40% yearly savings, the same purchase in EMR can result in about 30% savings for the total compute cost. These numbers will of course vary depending on the region and instance type, but it’s something to consider when estimating Reserved savings in EMR.

How to find an optimal Reserved instance type for your EMR deployments

EMR steps are very similar to EC2, with a few differences, so below are some special considerations that apply to EMR:

Estimate resource consumption patterns based on deployed EMR applications

EMR supports a number of widely used open source data analysis projects, such as Hadoop, Spark, Hive, HBase, MXNet, Pig, Presto, Tez and others. These tools have their own resource consumption patterns. For example, some frameworks are memory-intensive, while others are compute-intensive. Some require fast access to data stored in local disk storage and others will require high network throughput.

I won’t get into details regarding each application’s resource consumption in this article, but it’s important to know there are different instance types that will be optimal based on different resource consumption use cases. For example, you might find that an R5 instance is a better fit for a memory-intensive application, or an I3 is optimal for applications that need high disk I/O performance. Perhaps an M5 would be a better fit for a more balanced workload between compute and memory, etc.

Whether you’re deploying a new cluster or working with an existing one, CloudWatch metrics will provide visibility into the different resource consumption patterns.

Understand data sources and how they affect EMR cluster elasticity

EMR can analyze data stored in multiple data sources, such as S3, DynamoDB, RDS, Redshift, and Hadoop Distributed File System (inside the EMR cluster or hosted elsewhere) as well as any data sources supported by EMR applications. Where data under analysis is stored will impact system requirements such as memory consumption, disk I/O or network throughput.

Data storage will also impact the elasticity of your EMR cluster. For example, it’s harder to adjust the size of an EMR cluster that also stores data in its own HDFS. This is because adjusting the size of an HDFS often requires re-loading or re-distributing data across the whole cluster. On the other hand, an EMR cluster that accesses data stored in S3 is easier to expand or reduce in size according to usage. This will significantly affect the number of compute hours your EMR cluster will incur in your monthly AWS bill.

The main trade-off is performance. Typically, storing data in the cluster HDFS delivers better performance, but it makes cluster elasticity (and therefore cost savings) more difficult to achieve.

Configure Auto Scaling and calculate baseline monthly usage of the EMR cluster

Once you determine the elasticity requirements and capabilities in your EMR cluster, it’s time to configure EMR Automatic Scaling and Application Auto Scaling Scheduled Actions. Keep in mind these two AWS features are not exactly EC2 Auto Scaling, but they achieve the same result.

Based on the ability to expand or shrink an EMR cluster and the number of nodes that will be deployed in the cluster, you will have to calculate the baseline number of monthly compute hours in your EMR cluster.

Since EMR and EC2 Reserved steps are very similar, I recommend taking a look at the steps described in the first article of this series, regarding EC2 Reserved.

Conclusions

When buying EMR Reserved, you’re actually purchasing EC2 Reserved instances, therefore the rules for EC2 Reserved are applicable (i.e. Instance Size Flexibility, Standard vs. Convertible, Scope, etc.)
EMR has a per-second compute fee in addition to EC2 fees. It amounts to approximately 20% of the total compute cost and it doesn’t support Reserved pricing. As a result, when committing to 1-year or 3-year Reserved purchase for an EMR cluster, the percentage of savings is slightly lower over the whole period compared to a traditional EC2 Reserved purchase. For example, when a pure EC2 Reserved purchase would save you 40%, if you included the EMR fee you’d be looking at around 30%. The final numbers vary according to instance types and regions.
EMR clusters generally don’t incur license fees, since they run on Amazon Linux OS and open source software. The only exception is some MapR distributions, which are supported in EMR for an extra fee.
It’s easier to save money in EMR if your data is stored externally to the EMR cluster (e.g. S3). This allows you to shrink/expand your cluster based on usage or pre-configured schedules. Storing data in the HDFS makes it more difficult to update the cluster size since data typically has to be reloaded or redistributed across the cluster, which can be cumbersome. A trade-off is that performance is typically better when data is accessed in the local cluster HDFS.
Choosing the right Reserved purchase for your EMR cluster includes a number of steps, such as identifying the optimal data storage, resource consumption, instance type, usage patterns, Reserved options and Auto Scaling configurations.

Do you want to lower your AWS cost?

Avoid overspending on AWS. If you’re not sure how to lower your AWS cost, or simply don’t have time, I can help you save a lot of money. Click on the button below to schedule a free consultation or use the contact form.