As an application owner, it is increasingly important to have the ability to analyze large amounts of data. Thankfully, cloud-based Big Data Analytics tools are a widely adopted way to achieve this purpose and AWS offers multiple solutions in this area, particularly the Elastic Map Reduce (EMR) service. EMR delivers a managed solution to deploy tools such as Hadoop, Spark, Hive and Presto, among others, and deploy them in cloud-based compute infrastructure such as EC2 or EKS, or even on-premise servers.
In this article I will focus on Apache Spark, managed by the EMR service, running on EC2 instances.
Infrastructure and Data Setup
Below are the infrastructure and data specifications I used for these tests:
- Test data compliant with the TPC-H benchmark specifications. Tests were executed against 10TB and 30TB datasets in Parquet format. The 10TB TPC-H dataset has close to 87 billion records, while the 30TB dataset has close to 260 billion records (for a total of 347 billion records queried in these tests).
- Data was stored in S3 in the N. Virginia region.
- EMR version 6.11.0 (Spark 3.3.2).
- For EMR Spark to access data, a metadata catalog is required. This was achieved by launching a Hive Metastore, using AWS EMR v.6.11.0 (Hive 3.1.3), with an m5.large EC2 instance and backed by a db.t3.medium AWS RDS MySQL DB instance.
- Tests were executed with clusters consisting of 1 primary node and a different number of executor nodes, depending on the test scenario, all running on r5.8xlarge EC2 instances. Below are the cluster sizes used for these tests:
- 1 Coordinator (r5.8xlarge) + 40 executor nodes (r5.8xlarge)
- 1 Coordinator (r5.8xlarge) + 100 executor nodes (r5.8xlarge)
- For the 10TB tests, the default EBS configuration was applied: 4 128 GB gp2 volumes per node.
- For the 30TB tests, there was 1 EBS volume attached to each node, with the following settings:
VolumeType=gp3, SizeInGB=1000, Iops=3000, Throughput=512 MiB/s
- Configured default Spark settings and EMR configuration
maximizeResourceAllocation
was set totrue
. EMR delivers an automated way to launch and manage the compute infrastructure of data analytics clusters. It’s important to highlight that cluster launch and management automation can be implemented using the AWS CLI, SDK, or even tools such as CloudFormation.
EMR delivers an automated way to launch and manage the compute infrastructure of data analytics clusters. It’s important to highlight that cluster launch and management automation can be implemented using the AWS CLI, SDK, or even tools such as CloudFormation.
Test Scenarios and Execution
Each test ran the 22 queries defined in the TPC-H benchmark. The whole set of 22 queries was executed 3 times, in a sequential way (queries 1 through 22, executed 3 times). For each query, the average of the 3 executions is the result displayed in the tables and charts in the Test Results section below.
The following test scenarios were executed:
- 10TB - 1 primary node + 40 executor nodes
- 10TB - 1 primary node + 100 executor nodes
- 30TB - 1 primary node + 40 executor nodes
- 30TB - 1 primary node + 100 executor nodes
Test Results
All queries completed successfully for the scenarios in scope. Below are the results for the 10TB scenarios:
For the 10TB, 40-executors scenario, the average query execution time was 158 seconds.
For the 10TB, 100-executors scenario, the average query execution time was 93 seconds (62% the average execution time compared to using 40 executors).
Below are the results for the 30TB scenarios:
For the 30TB, 40-executors scenario, the average query execution time was 501 seconds (3.17x compared to the 10TB scenario with the same number of nodes).
For the 30TB, 100-executors scenario, the average query execution time was 217 seconds (49% the average execution time compared to using 40 executors). 30TB took 2.33x more time to execute compared to 10TB, when using 100 executor nodes.
The table below displays the individual query execution times, in seconds, per scenario:
QID | 10TB - 40 executors |
10TB - 100 executors |
30TB - 40 executors |
30TB - 100 executors |
---|---|---|---|---|
q01 | 108 | 62 | 193 | 91 |
q02 | 62 | 51 | 118 | 81 |
q03 | 140 | 70 | 480 | 231 |
q04 | 208 | 120 | 381 | 217 |
q05 | 246 | 139 | 888 | 362 |
q06 | 83 | 45 | 140 | 66 |
q07 | 137 | 84 | 341 | 169 |
q08 | 184 | 97 | 457 | 242 |
q09 | 275 | 139 | 857 | 362 |
q10 | 143 | 76 | 393 | 171 |
q11 | 116 | 109 | 475 | 319 |
q12 | 94 | 64 | 216 | 103 |
q13 | 125 | 78 | 299 | 169 |
q14 | 91 | 56 | 219 | 98 |
q15 | 163 | 82 | 325 | 146 |
q16 | 58 | 50 | 143 | 95 |
q17 | 254 | 149 | 1167 | 440 |
q18 | 245 | 148 | 1444 | 429 |
q19 | 114 | 62 | 206 | 95 |
q20 | 118 | 68 | 258 | 140 |
q21 | 439 | 233 | 1841 | 620 |
q22 | 75 | 57 | 189 | 118 |
Average | 158 | 93 | 501 | 217 |
Total (mins) | 58 | 34 | 184 | 79 |
Below is a comparison between the total average execution time in each scenario, relative to other scenarios:
relative to: | ||||
---|---|---|---|---|
10TB - 40 executors | 10TB - 100 executors | 30TB - 40 executors | 30TB - 100 executors | |
10TB - 40 executors | N/A | 171% | 32% | 73% |
10TB - 100 executors | 59% | N/A | 18% | 43% |
30TB - 40 executors | 317% | 541% | N/A | 232% |
30TB - 100 executors | 137% | 234% | 43% | N/A |
AWS Cost Analysis
The following tables compare the cost of running a cluster consisting of 40 executors + 1 coordinator as well as 100 executors + 1 coordinator and the default EMR configuration consisting of 512 GB of gp2 EBS storage per node, in the N. Virginia region.
40 executors + 1 coordinator:
Component | Price Dimension | Usage | Hourly Cost | Monthly Cost | |
---|---|---|---|---|---|
Data store | S3 Standard Storage | 2.7TB (Parquet 10TB) + 7.8TB (Parquet 30TB) = 10.5TB | $0.34 | $242 | |
Coordinator Node | EC2 Compute | 1 r5.8xlarge instance | $2.02 | $1,452 | |
Executor Nodes | EC2 Compute | 40 r5.8xlarge instances | $80.64 | $58,061 | |
EMR Fee - Coordinator | $0.27/hr (r5.8xlarge) | 1 r5.8xlarge instance | $0.27 | $194 | |
EMR Fee - Executors | $0.27/hr (r5.8xlarge) | 40 r5.8xlarge instances | $10.80 | $7,776 | |
Hive Metastore | EC2 Compute + EMR fee | 1 m4.large | $0.12 | $86 | |
RDS MySQL | Hive Metastore storage | 1 db.t3.medium | $0.07 | $49 | |
EBS Storage | EBS Volume Usage - gp2 | 512GB x 41 = 20992GB | $2.92 | $2,099 | |
Data Transfer | S3 to EC2 - Intraregional Data Transfer | N/A | $0.00 | $0.00 | |
Total | $97.16 | $69,959 |
100 executors + 1 coordinator:
Component | Price Dimension | Usage | Hourly Cost | Monthly Cost | |
---|---|---|---|---|---|
Data store | S3 Standard Storage | 2.7TB (Parquet 10TB) + 7.8TB (Parquet 30TB) = 10.5TB | $0.34 | $242 | |
Coordinator Node | EC2 Compute | 1 r5.8xlarge instance | $2.02 | $1,452 | |
Executor Nodes | EC2 Compute | 100 r5.8xlarge instances | $201.60 | $145,152 | |
EMR Fee - Coordinator | $0.27/hr (r5.8xlarge) | 1 r5.8xlarge instance | $0.27 | $194 | |
EMR Fee - Executors | $0.27/hr (r5.8xlarge) | 100 r5.8xlarge instances | $27.00 | $19,440 | |
Hive Metastore | EC2 Compute + EMR fee | 1 m4.large | $0.12 | $86 | |
RDS MySQL | Hive Metastore storage | 1 db.t3.medium | $0.07 | $49 | |
EBS Storage | EBS Volume Usage - gp2 | 512GB x 101 = 51712GB | $7.18 | $5,171 | |
Data Transfer | S3 to EC2 - Intraregional Data Transfer | N/A | $0.00 | $0 | |
Total | $238.59 | $171,786 |
30TB tests had a different EBS configuration compared to the 10TB tests. Instead of 512GB gp2 EBS storage per node: 1,000 GB gp3 storage, 3,000 IOPS and 512 MB/s throughput.
Storage | IOPS | Throughput | |||
---|---|---|---|---|---|
Cluster Size | 1,000 GB | 3,000 | 512 MB/s | Total / month | Delta vs.default EBS configuration |
40 executors + 1 coordinator | $3,280 | $615 | $840 | $4,735 | $2,635 |
100 executors + 1 coordinator | $8,080 | $1,515 | $2,068 | $11,663 | $6,492 |
It is important to mention that intra-regional Data Transfer was zero, since all components were launched in the same AWS region and VPC subnet, otherwise there would be significant data transfer incurred, which would likely result in hundreds of dollars per each TPC-H test set.
The always-on pricing is for illustration purposes only, since the recommendation is to dynamically scale in and out based on usage.
Below you can visualize how the hourly cost for a particular cluster size (i.e. 40 vs. 100 executors) had an impact on the total execution time for each dataset size.
For these tests, it can be seen that a larger cluster size had a more significant impact on how long the whole set of queries took to execute, as the dataset increased (i.e. a higher hourly cost had a better return on the 30TB dataset vs. 10TB)
AWS Cost Management Recommendations
- Use the EMR Managed Scaling feature in order to automatically resize clusters based on usage. Leaving an always-on large cluster will definitely result in thousands of dollars every month in EC2 and EMR cost. Dynamic scaling can significantly lower AWS cost by allocating capacity based on usage patterns.
- Using automation tools, such as the AWS CLI or SDK, significantly simplifies the launch and management of EMR clusters, which can be used to implement custom, automated processes to reduce cost.
- Consider using EMR automatic cluster termination after a configurable idle period, in order to avoid situations where large clusters are accidentally left running for an unnecessary amount of time.
- EC2 Spot Instances are a good option to consider in order to reduce cost, given they can save up to 90% compared to EC2 On Demand pricing. Be aware that Spot instances can be terminated at any point in time; therefore it’s important to ensure they’re used to execute fault-tolerant, asynchronous data analytics processes.
- Allocate storage (S3) and compute resources in the same AWS region, in order to avoid inter-regional data transfer fees between S3 and EC2. Most inter-regional data transfer fees within the US result in $20 per TB, but there are regions outside the US where it can cost as high as $147 per TB (e.g. Cape Town), which would result in several thousands of dollars of additional data transfer cost for the datasets included in these tests. In addition to increased cost, having data and compute resources in different AWS regions can have a significant negative impact on performance, especially in large datasets.
- If there are no strict availability requirements, deploying all EC2 instances in the same VPC subnet can also reduce cost, given that Intra-regional Data Transfer In and Out costs $10 USD per terabyte of data transferred. For a large dataset, such as the one used for these tests, intra-regional data transfer can result in hundreds of dollars, depending on the number and type of queries being executed. For always-on applications with strict availability requirements, it is important to consider deployments in multiple AZs, but it is highly recommended to calculate cost in advance in order to avoid unexpected data transfer charges.
- Once the usage patterns are predictable, it is highly recommended to purchase Reserved Instances. You can refer to this article I wrote, which details how to approach Reserved Instances in the context of EMR workloads.
Conclusions
- EMR Spark successfully executed the full set of 22 TPC-H queries in both datasets (10TB and 30TB).
- Queries showed consistent execution timings across all 3 sequential executions.
- It is always recommended to use automation tools, such as the AWS CLI, SDK or CloudFormation, to launch and manage EMR clusters.
- The average execution time does not necessarily increase proportionally to the size of the dataset. For example: 30TB with 40 executors took 3.17x to execute compared to 10TB with the same number of executor nodes. However, 30TB with 100 executors took 2.34x to execute compared to 10TB and 100 executor nodes.
- Adding more nodes, in this case, resulted in better scalability when increasing the size of the dataset.
- It is important to measure performance and calculate cost on larger datasets as the number of nodes is increased. There are situations where adding more nodes for larger datasets has a positive return on the overall query performance (potentially delivering cost savings). However, there is a point where higher cost might stop delivering proportionally better performance. Therefore, it’s important to execute tests with multiple cluster sizes and measure performance accordingly, in order to find the best cost/performance ratio for a particular application.
Do you need help optimizing your Big Data workloads in the cloud?
I can help you optimize your Big Data workloads in the cloud and make sure they deliver the right balance between performance and cost for your business. Click on the button below to schedule a free consultation or use the contact form.