This post discusses the process to determine the right node size and cluster topology for your Amazon ElastiCache workloads, and the important factors to consider. This post assumes you have a good knowledge of Redis and its commands and have an understanding of Amazon ElastiCache for Redis and its features such as online cluster resizing, scaling, online migration from Amazon EC2 to ElastiCache, general-purpose and memory-optimized nodes, and enhanced I/O.
For entry-level, small (2,000 or less TPS and 10 GB or less data size) and medium (TPS between 2,000 and 20,000, and data size between 10 GB and 100 GB) cache workloads, including the ones that may also experience temporary spikes in use, choose a cache node from the T3 family, which are the next generation general-purpose burstable T3-Standard cache nodes. If you’re just starting to use ElastiCache for your workloads, start on a T3.micro cache node because it offers a free tier. You can go up to T3.medium cache nodes as you increase the load.
For moderate to high (more than 20,000 TPS and 100 GB data size) workloads, choose a cache node from the M5 or R5 families, because the newest node types support the latest generation CPUs and networking capabilities. These cache node families can deliver up to 25 Gbps of aggregate network bandwidth with enhanced networking based on the Elastic Network Adapter (ENA) and over 600 GiB of memory. The R5 node types provide 5% more memory per vCPU and a 10% price per GiB improvement over R4 node types. In addition, R5 node types deliver an approximate 20% CPU performance improvement over R4 node types.
If T3.medium is no longer sufficient, you can move to one of the following:
- M5 cache nodes if you need more throughput with some increased memory
- R5 cache nodes if you need more throughput and up to 35%–51% higher memory per cache node
To further narrow down the node size and cluster topology suitable for your workloads, you need to do the following:
- Determine your five workload characteristics
- Run your benchmark testing
Determining your five workload characteristics
You can determine most workloads characteristics using your application metrics, Redis’s INFO command, or from Amazon CloudWatch metrics. For more information about maximum node memory, see Redis Node-Type Specific Parameters.
When determining your ElastiCache node requirements, consider the following:
- Spare or reserved memory
The following considerations can help you start identifying potential node sizes:
- Identify your full datastore size, key, and value data sizes. You can get an approximate estimate of the amount of cache memory you need by multiplying the size of items you want to cache by the number of items you want to keep cached at once.
- Determine your intention to retain the data or use TTL to expire the keys in the cache. TTL enables explicit memory management on the node.
- Identify the existing and preferred cache hit rate if this is an important metric for you, such as in cache-only use cases. You want to make sure that your cluster has the desired hit rate, or that keys aren’t evicted too often. You can achieve this with more memory capacity.
Spare or reserved memory
You should keep at least 25% of the node size apart from containing your database size. Replication uses some memory from the primary node. In addition, the cache nodes should have spare memory of approximately 10%–15% for unexpected load peaks and early detection via CloudWatch alarms of increasing memory footprint. You can use this early detection to determine whether to scale up or scale out depending on your specific requirements.
Write-heavy applications can require significantly more available memory that data doesn’t use. You need this spare memory when taking snapshots or failing over to one of the replicas.
If you need your cluster to be available to service your customer requests, consider setting up replication groups with one primary, at least two replicas, and Multi-AZ enabled. This helps protect your data and the cluster continues to serve traffic if the primary fails for any reason. When that happens, one of the replicas becomes the new primary. Replicas can also help you increase your read throughput.
Watch out for a heavy write primary whose write:read ratio is more than 50% and close to 80% of the request rate limit for that node type. Heavy write primary nodes may do a full sync with the replicas more often, which impacts your cluster performance. Frequent full syncs take away the primary node’s time that you could have used for processing incoming requests instead.
Also, resist the urge to spin up lots of replicas just for availability; it creates unnecessary stress on the primary to sync with many replicas. There is a limit of five replicas per primary node. One or two replicas in a different Availability Zone are sufficient for availability.
ElastiCache offers cluster-mode enabled configuration that supports online scaling in vertical (up and down) and horizontal (in and out), while the cluster continues to serve requests. It’s better to scale out if you use several simultaneous clients (e.g. 10,000 or more, such as 1 TPS on 10,000 clients or 5 TPS on 2,000 clients) on the primary node to make sure you have the available compute capacity to service them all. The optimal simultaneous clients per primary node depend on your specific use case and overall application architecture. Besides serving a very high number of simultaneous clients, scaling out makes sure that data is spread across multiple shards, which further increases the availability of your data in the cluster. However, if your business requires higher performance on existing cluster configuration, you should scale up. Scale up provides means to increase performance for an individual node.
Migration between the two cluster configurations (cluster-mode disabled and cluster-mode enabled) is supported by backup/restore, an offline operation, that utilizes the .rdb file from your source cluster. Therefore, you should use cluster-mode enabled configuration by default because it permits both vertical and horizontal scaling to meet future needs.
If you’re reducing the size and memory capacity of the cluster, by either scaling in or scaling down, make sure that the new configuration has sufficient memory for your data and Redis overhead.
Determine whether your workloads have any hot keys, such as one or more data objects that are requested at very high rates or have suddenly become very large. Hot keys can impair your cache engine’s ability to maintain high performance and serve all requests. For that use case, assuming you’re using the recommended cluster-mode enabled configuration, you could spread the load to various shards and keep the hot key in one shard and rest of the keys in other shards so other incoming requests aren’t blocked. If there are multiple hot keys, consider spreading them across shards.
In addition, consider separating your read and write workloads. This separation lets you scale reads by adding additional replicas as your application grows. Replicas provide eventually consistent reads. ElastiCache provides a replica endpoint for cluster-mode-disabled configuration that load balances the read traffic on replicas, enabling separation of reads and writes. In addition, some Redis clients for cluster mode enabled configuration allow traffic to be routed to replicas. You should review your specific Redis client documentation for this mechanism.
Running your benchmark testing
After determining the parameters applicable to your case, you should identify a few best-fitting cache nodes and cluster topologies. Choosing two large cache nodes may (or may not) be better than one xlarge cache node. Configure your client application and run your benchmark tests on each of these inline with your workload characteristics in a production-like environment. You should run your benchmark tests with production data and traffic patterns for no less than 14 days to generate a good baseline of your regular production workload pattern. Once you have the baseline, you should then include seasonality such as holidays, Black Friday sales in your workload to get the performance benchmark results that reflect your actual workload patterns more closely. Based on the outcome of the benchmark testing, you can select the right node size and cluster configuration for your Redis workloads.
ElastiCache has published benchmarking results using the open-source benchmarking tools
redis-benchmark. The first benchmarking exercise compares R4 and optimized R5 cache nodes, which the following section explains in more detail. For more information, see Amazon ElastiCache performance boost with Amazon EC2 M5 and R5 instances. The second benchmarking exercise is on the R5 family and compares Redis 5.0.3 with enhanced I/O against Redis 5.0.0 which doesn’t offer enhanced I/O. For more information, see Boosting application performance and reducing costs with Amazon ElastiCache for Redis.
Comparing R4 and optimized R5 cache nodes
This benchmark had 14.7 million unique keys, 200-byte string values, 80% gets, 20% sets, and no command pipelining. For this post, the benchmark ran on 20 client instances connecting to an ElastiCache R5 instance in the same Availability Zone.
The following table summarizes the benchmark test setup.
14.7 million keys with 200-byte string values = 2.9 GB.
Key was a 4-byte random string with values in range [a-z A-Z 0-9], (62**4 values = 14.7 million keys).
Value was a 200-byte non-random/regenerated string.
|Spare Memory||5 GB; accounting for 25% for snapshotting, the cache node should be at least 2.9 + 5 + 2.7 = 10.5 GB.|
|Availability||One primary with no replicas.|
Each test had 20 application nodes. Each application node opened a variable amount of connections based on the node type. For larger nodes, more connections were opened (to increase throughput). For smaller nodes, fewer connections were opened. The number of connections were based on how many connections could be opened without significantly increasing the p99.9 latency of requests.
No hot keys.
20 client connections.
Keys were generated randomly.
The benchmarking results showed that the latest R5 cache nodes supported 59%–144% more transactions per second than similarly sized R4 instances. R5 cache nodes also had up to 23% reduced average (p50) and tail (p99) latencies, which resulted in average latencies as low as 350 microseconds. The following table summarizes the data from this exercise:
|Cache Node Size||ElastiCache R4 Node||ElastiCache Optimized R5 Node||ElastiCache R4 to Optimized R5 Improvement|
|large||88,000 RPS||215,000 RPS||144%|
|xlarge||93,000 RPS||207,000 RPS||122%|
|2xlarge||107,000 RPS||217,000 RPS||102%|
|4xlarge||131,000 RPS||225,000 RPS||71%|
|8xlarge/12xlarge||128,000 RPS||247,000 RPS||92%|
|16xlarge/24xlarge||149,000 RPS||237,000 RPS||59%|
Selecting the right node size and cluster configuration for your workloads is an important activity that you should do regularly, including before migrating to ElastiCache. It’s not a one-time activity; you should do it often throughout the year, especially well in advance of any major upcoming business event. This can prepare your teams to handle the scale and expected growth in traffic much better, which enables you to continue to serve your customers seamlessly.
If you have any questions or feedback, reach out on the AWS ElastiCache Discussion Forum or in the comments.
About the Author
Anumeha is a Product Manager with Amazon Web Services.