Part III: EMR - The Ultimate Guide to Saving Money with AWS Reserved "Anything"
* Price calculations using AWS Price List API
Recently I’ve written posts about saving money by purchasing AWS Reserved capacity. In two
previous articles, I wrote about EC2 and RDS, but there are other services that can save
you a lot of money if you go the Reserved route. In the third article in this series, I take
a look at a potentially very expensive service: EMR.
Services like EMR make it possible to analyze large amounts of data, which is the core of
many applications that run on AWS. One problem is that analyzing a large volume of data
intrinsically requires significant compute resources. That’s why launching an EMR cluster
can easily result in a very high AWS cost - I’m talking several thousands of dollars worth
of compute infrastructure per month.
That’s why it’s crucial to understand the cost savings opportunities, particularly Reserved
instances. But first let’s take a look at how EMR pricing works. I’ll focus on compute cost,
since that’s the portion that can be optimized with Reserved purchases.
EMR Pricing
AWS Elastic MapReduce is a managed service that supports a number of tools used for Big Data
analysis, such as Hadoop, Spark, Hive, Presto, Pig and others. EMR basically automates the launch
and management of EC2 instances that come pre-loaded with software for data analysis. With
EMR, you can access data stored in compute nodes (e.g. HDFS) or in external data storage
systems, such as S3.
There are two main pricing components in EMR:
EC2 compute. Since EMR launches EC2 instances, you pay for the same compute price dimension
as any other EC2 deployment: per-second compute time based on instance type. You don’t pay for Operating
System fees, since EMR instances run on Amazon Linux. You don’t pay for License fees either, since
the software that runs on EMR is open source - the only exceptions are some MapR distributions.
EMR fee. This is a management fee that EMR charges based on the EC2 instance type and compute time
consumed in the cluster. It also supports per-second billing and it’s a fraction of EC2 compute cost. It typically
amounts to approximately 20% of the EC2 compute cost.
The chart below shows the On Demand EMR and EC2 monthly cost for some common EC2 instances
supported in EMR (this is just a subset of instances as there are over 350 supported instance
types in EMR):
Select a region:
In addition to compute fees, you also pay for the usual EC2 price dimensions, such as EBS storage
and data transfer (out to the internet, inter-region and intra-region). Even though price dimensions such as data storage
and data transfer can be significant, only compute fees are relevant when it comes to optimizing cost with Reserved purchases.
EMR Reserved Options
Terms and Payment Options
EMR doesn’t offer Reserved purchases, per se. If you need to make a Reserved purchase for your
EMR cluster, you actually have to make an EC2 Reserved purchase. This means you have access
to the same options available to any EC2 Reserved decision. The main difference compared to
EC2 is that you have to pay the EMR fee associated with the selected EC2 instance type, as
mentioned above.
You can purchase reserved compute capacity for 1-year and 3-year terms, and you save more
money by committing to a 3-year term. Below you can see the savings for different instance
types and terms (don’t forget to hover on the charts to see more details).
The following payment options are available for EC2 instances in EMR:
All Upfront. 1-time payment covering the full term (1 year or 3 years). This is the
option where you save the most money.
Partial Upfront. A portion of the full amount, also paid upfront and the rest is paid monthly.
No Upfront. You commit to the full term and pay a fee every month.
You save more money over the whole term if you pay All Upfront. Just like other services
that support Reserved pricing, sometimes the difference between No Upfront vs. All Upfront
isn’t substantial (3%-5%). In those cases, it might be worth it to choose No Upfront and
pay monthly instead of spending a potentially large amount of money on day one.
Given that EMR clusters are typically used to execute heavy computational tasks, they tend
to need powerful EC2 instances and multiple compute nodes. This means going All Upfront
can easily turn into a big investment worth several thousands of dollars.
Below is a comparison of the different purchase options, including the EMR fee. You can
choose any of the supported EMR instances as well as AWS region in the drop-down:
Standard vs. Convertible, Instance Size Flexibility and Scope
Since EMR uses the same parameters for EC2 instances, you can select either Standard or
Convertible Offering Class.
Standard means you have to keep a particular instance type for the duration of the Reserved
purchase period (1 year or 3 years). This is the default option.
Convertible gives you the flexibility to switch to a different instance family and/or
size. You save less money if you choose this option.
Instance Size Flexibility means your Reserved discount is applied to instances of the same
family, proportional to their size. For example, a Reserved discount for one m5.4xlarge can be applied to 4
m5.xlarge EC2 instances. This feature is available to both Standard and Convertible offering classes.
Scope is another variable to consider. It doesn’t affect Reserved pricing, but it affects the
priority you’re given when launching new EC2 instances and how your Reserved discount is applied:
Regional. Reserved discount gets assigned to any applicable EC2 instance within an
AWS region, but there is no guaranteed compute capacity. If there’s a capacity shortage for
a particular EC2 instance type in a given region, you might have to wait until compute
capacity is available. This restriction is typically not an issue for applications that
execute asynchronous data processing tasks. It’d also be a rare situation that EC2 runs
out of compute resources. This is the default option and in most cases, the one that I recommend.
Availability Zone. You get guaranteed EC2 capacity within an AZ that you specify,
but you can’t update your EC2 instance size without losing your Reserved discount. You gain
certainty that compute resources will be provisioned when you need them, in exchange for
flexibility. I’d recommend this option only for very specific use cases that include strict
availability and compute capacity requirements.
AWS Region Comparison
EMR Reserved cost as well as the percentage of savings compared to On Demand varies by AWS
Region. The following chart compares EMR Reserved vs. On Demand across a number of frequently used regions
for Standard All Upfront.
Feel free to select different instance types from the drop-down. You can also hover on the chart
to see more details.
Visualizing All Options
The chart below compares total EMR compute cost when purchasing Reserved instances and
presents different Purchase Options (All Upfront, Partial Upfront, No Upfront) and terms
(1 year vs. 3 years). You can select a different instance type, region and term.
The chart displays how cost accumulates throughout your Reserved purchase period and it shows Months
to Recover (MtR - how much time you have to wait before you start to see savings compared to On Demand). You can hover
over the chart to see more details.
One important thing to notice is that due to EMR’s compute fee (which doesn’t support Reserved),
purchasing Reserved EC2 instances for an EMR cluster results in a lower savings percentage compared to
a pure EC2 calculation. For example, while an EC2-only Reserved purchase can result in about 40%
yearly savings, the same purchase in EMR can result in about 30% savings for the total compute cost. These
numbers will of course vary depending on the region and instance type, but it’s something
to consider when estimating Reserved savings in EMR.
How to find an optimal Reserved instance type for your EMR deployments
EMR steps are very similar to EC2, with a few differences, so below are some special
considerations that apply to EMR:
Estimate resource consumption patterns based on deployed EMR applications
EMR supports a number of widely used open source data analysis projects, such as Hadoop,
Spark, Hive, HBase, MXNet, Pig, Presto, Tez and others. These tools have their own resource
consumption patterns. For example, some frameworks are memory-intensive, while others are
compute-intensive. Some require fast access to data stored in local disk storage and others
will require high network throughput.
I won’t get into details regarding each application’s resource consumption in this article,
but it’s important to know there are different instance types that will be optimal based on
different resource consumption use cases. For example, you might find that an R5 instance
is a better fit for a memory-intensive application, or an I3 is optimal for applications
that need high disk I/O performance. Perhaps an M5 would be a better fit for a more balanced
workload between compute and memory, etc.
Whether you’re deploying a new cluster or working with an existing one, CloudWatch metrics
will provide visibility into the different resource consumption patterns.
Understand data sources and how they affect EMR cluster elasticity
EMR can analyze data stored in multiple data sources, such as S3, DynamoDB, RDS, Redshift,
and Hadoop Distributed File System (inside the EMR cluster or hosted elsewhere) as well as
any data sources supported by EMR applications. Where data under analysis is stored will
impact system requirements such as memory consumption, disk I/O or network throughput.
Data storage will also impact the elasticity of your EMR cluster. For example, it’s harder
to adjust the size of an EMR cluster that also stores data in its own HDFS. This is because
adjusting the size of an HDFS often requires re-loading or re-distributing data across the whole
cluster. On the other hand, an EMR cluster that accesses data stored in S3 is easier to expand
or reduce in size according to usage. This will significantly affect the number of compute
hours your EMR cluster will incur in your monthly AWS bill.
The main trade-off is performance. Typically, storing data in the cluster HDFS delivers
better performance, but it makes cluster elasticity (and therefore cost savings) more difficult to achieve.
Configure Auto Scaling and calculate baseline monthly usage of the EMR cluster
Once you determine the elasticity requirements and capabilities in your EMR cluster, it’s time
to configure EMR Automatic Scaling
and Application Auto Scaling Scheduled Actions.
Keep in mind these two AWS features are not exactly EC2 Auto Scaling, but they achieve the same result.
Based on the ability to expand or shrink an EMR cluster and the number of nodes that will
be deployed in the cluster, you will have to calculate the baseline number of monthly compute
hours in your EMR cluster.
Since EMR and EC2 Reserved steps are very similar, I recommend taking a look at the steps described
in the first article of this series, regarding EC2 Reserved.
Conclusions
When buying EMR Reserved, you’re actually purchasing EC2 Reserved instances, therefore the
rules for EC2 Reserved are applicable (i.e. Instance Size Flexibility, Standard vs. Convertible,
Scope, etc.)
EMR has a per-second compute fee in addition to EC2 fees. It amounts to approximately 20%
of the total compute cost and it doesn’t support Reserved pricing. As a result, when committing
to 1-year or 3-year Reserved purchase for an EMR cluster, the percentage of savings is slightly
lower over the whole period compared to a traditional EC2 Reserved purchase. For example, when
a pure EC2 Reserved purchase would save you 40%, if you included the EMR fee you’d be looking
at around 30%. The final numbers vary according to instance types and regions.
EMR clusters generally don’t incur license fees, since they run on Amazon Linux OS and open
source software. The only exception is some MapR distributions, which are supported in EMR for an extra fee.
It’s easier to save money in EMR if your data is stored externally to the EMR cluster (e.g. S3). This
allows you to shrink/expand your cluster based on usage or pre-configured schedules. Storing
data in the HDFS makes it more difficult to update the cluster size since data typically has to be reloaded
or redistributed across the cluster, which can be cumbersome. A trade-off is that
performance is typically better when data is accessed in the local cluster HDFS.
Choosing the right Reserved purchase for your EMR cluster includes a number of steps, such
as identifying the optimal data storage, resource consumption, instance type, usage patterns,
Reserved options and Auto Scaling configurations.
Do you want to lower your AWS cost?
Avoid overspending on AWS. If you’re not sure how to lower your AWS cost, or simply don’t
have time, I can help you save a lot of money. Click on the button below to schedule a
free consultation or use the contact form.
Ernesto Marquez
I am the Project Director at Concurrency Labs Ltd, ex-Amazon (AWS), Certified AWS Solutions
Architect and I want to help you run AWS optimally, so your applications reliably
generate revenue for your business.
Running an optimal AWS infrastructure is complicated - that's why I follow a methodology that makes it simpler to
run applications that will support your business growth.
Do you want to learn more? Do you have other questions related to AWS? Click on the button below to schedule a free 30-minute consultation.