How to scale your On-Prem Elastic Observability cluster - part 1
Questions about scaling Elasticsearch are usually answered with, "It depends." And the truth is, it really does depend on the situation. There are multiple options for scaling up and scaling out with Elasticsearch, and knowing when to use which option is no easy task.
In two blog posts, I will try to shed some light on the complexity of scaling an Elasticsearch cluster that is primarily used for observability, simply because this is the use case I have spent most of my career working on.
What Elasticsearch hosting options do you have?
First, let's take a closer look at the different Elasticsearch hosting options available: host-based, ECE, ECK, and Elastic Cloud.
Self-hosted
Running a self-hosted cluster sounds complicated, but it's less difficult than you might think. There are several options you can choose from when setting up your infrastructure.
Host-based
This is the classic setup. You build your Elastic Stack on (virtualized) hardware. I advise against doing this manually, as it is prone to errors. I recommend using an automation language, such as Ansible, to set up your nodes for Elasticsearch, Kibana, Fleet Server, and all other components.
You can use a virtualization platform to run your Elasticsearch nodes, or opt for bare metal. The main disadvantage of bare metal is that scaling up is not possible. When using a virtualization platform, you must ensure that the platform is location-aware so that you can apply anti-affinity rules. This prevents nodes with primary and replica shards for the same index from running on the same hypervisor.
ECE
ECE stands for Elastic Cloud Enterprise and is the Elastic Stack built on Docker or, in newer versions of ECE, Podman. ECE forms the basis of Elastic's hosted cloud offering. The ECE engine automates the deployment of your Elastic Stack instances and can manage multiple Elastic deployments. In addition, the engine manages the networking between the Docker/Podman nodes, provided that the correct ports are available.
ECK
ECK stands for Elastic Cloud on Kubernetes and, as the name suggests, is used to manage Elastic deployments on Kubernetes. ECK is newer than ECE and does not offer a comprehensive UI for managing your deployments. ECK focuses primarily on automation and configuration as code.
Elastic Cloud
Elastic also offers a managed solution with hosted Elasticsearch, Kibana, and other Elastic Stack components on major cloud providers such as AWS, Google Cloud, and Microsoft Azure. This allows organizations to monitor systems, detect threats, and build fast search experiences. Elastic Cloud provides automatic updates, security patches, and backups, ensuring reliability and security. Although this option eliminates the complexity and costs of managing your own infrastructure, you are still not completely free of management tasks.
Considerations for building and configuring your self-hosted Elastic platform
The scaling options apply to all hosting options, with a minor exception for ECE, unless you disable certain protections.
In general
JVM heap size
Do not set the JVM heap size between 32GB and 48GB. Java disables Compressed OOPs for application heaps larger than 32GB, changing the memory allocation from 4 to 8 bytes. This reduces the number of objects that can be stored in the heap. This means that increasing the maximum heap to a value close to 32GB and up to 47GB actually results in less effective memory being available, which can lead to OutOfMemoryErrors.
Elasticsearch is much better suited to scaling out than scaling up. So you're better off with more, smaller nodes than with just a few very large nodes.
Central Processing Unit
Because Elasticsearch has to perform multiple operations simultaneously, you should use no less than 2.5 vCPUs.
Storage
Elasticsearch recommends 45GB of storage per GB of RAM for hot data nodes. In my experience, you can stretch this a little and safely go up to 60GB per GB of RAM, or even 90GB/GB of RAM. As always, it depends.
These ratios increase as you move to warm, cold, and frozen data tiers. Cold and frozen data tiers require searchable snapshots, which are available with Enterprise subscriptions. For the warm storage tier, the recommended ratio is 160GB/GB RAM, but this can be even higher. I have seen configurations of up to 250GB/GB RAM.
For the cold data tier, you use a similar configuration to the warm tier, but because you only need one shard and no replicas, you can accommodate twice the number of shards. The frozen data tier consists solely of caching nodes; all data is retrieved from snapshots on demand. These nodes are typically large, with a lot of storage capacity. Usually, you will see a limited number of nodes with 64GB of RAM and a lot of available storage for caching snapshot data.
Sharding
Elasticsearch comes with mostly sensible defaults. For most observability solutions, 1 primary shard and 1 replica work fine. Keep the maximum shard size below 50GB and below 200 million documents per shard. Use data streams with an ILM policy to automatically roll indices.
Data center considerations
If possible, choose three data centers, provided that latency and available bandwidth allow this. Traffic and latency between the data centers are important factors for Elasticsearch. If three data centers are not possible, try to use two, with a third location (e.g., a colocation) for a voting-only master node.
If your organization has only one data center, ensure as much separation as possible: two different rooms, two different rows, two different racks, two different power feeds, and two different switches. Try to create two stacks so that primary and replica shards do not run on the same critical path.
Monitoring
Set up a separate cluster for monitoring your Elasticsearch cluster(s). If your primary cluster fails while your monitoring is running on the same instance, you will be virtually blind, and finding the cause of the failure will become an order of magnitude more difficult.
Follow-up: part two of this blog series
Now you have a good idea of the different hosting options and what you need to consider when scaling your self-hosted Elastic platform. In part two of this blog series, I will take a closer look at building a cluster, setting up the infrastructure, and how to scale to optimize performance. Read more.