Scaling your on-prem Elastic Observability Cluster - Part 1 - SUE

Scaling your on-prem Elastic Observability Cluster - Part 1

Elasticsearch scaling questions are most often answered with “It depends”. And the truth is, it really does depend. There are multiple options to scale up and out with Elasticsearch, and knowing when to use what option is no easy task. In two blog posts, I will try to shed some light on the intricacies of scaling an Elasticsearch cluster primarily used for observability, simply because that is the use case I’ve worked on most in my career.

Which Elasticsearch hosting options do I have?

Let’s first dive into the different Elasticsearch hosting options that are available: host-based, ECE, ECK, and Elastic Cloud.

Self hosted

Running a self-hosted cluster sounds complicated, but it isn’t as hard as you would think. There are multiple options for you to choose from, as to how you want your infrastructure built.

Host-based

This is the classic setup. Build your Elastic stack on (virtualised) hardware. I’d advise not to do this manually, because that is error-prone. I recommend using an automation language, like Ansible, to set up your nodes for Elasticsearch, Kibana, fleet server and all other types. You can use a virtualization platform to run your Elasticsearch nodes on, or run on bare metal. The main drawback for bare metal is that scaling up is not an option. When using a virtualization platform you need to make sure that the platform has location awareness, so that you can use anti-affinity rules to place vm’s such that nodes with primary and replica shards for the same index do not run on the same hypervisor.

ECE

ECE stands for Elastic Cloud Enterprise and this is Elastic Stack built on Docker or, with newer versions of ECE, Podman. ECE is the basis of Elastics hosted cloud offering. The ECE engine will automate the deployment of your Elastic Stack instances, and can manage multiple Elastic deployments. The engine also takes care of networking between the Docker/Podman nodes, as long as the correct ports are available.

ECK

ECK stands for Elastic Cloud on Kubernetes, and, as the name suggests, is used to manage Elastic deployments on Kubernetes. The ECK offering is newer than ECE, and you do not get a nice UI to manage your deployments. ECK is more geared towards automation and configuration as code.

Elastic Cloud

Elastic also offers a managed solution that provides hosted Elasticsearch, Kibana, and other Elastic Stack components on major cloud providers like AWS, Google Cloud, and Microsoft Azure. It enables organizations to monitor systems, detect threats, and build fast search experiences. Elastic Cloud offers automatic updates, security patches, and backups, ensuring reliability and security. However, while taking away the complexity and cost of running your own infrastructure, it does not fully relieve you of your managing duties.

Considerations for building and configuring your self-hosted Elastic platform

The scaling options are mostly applicable to all hosting options, with a small exception on ECE, unless you disable some protections.

In general

JVM Heap Size

Do not set the JVM heap size between 32GB and 48GB. Java disables Compressed OOPs for application heaps larger than 32 GB, which changes the memory allocation size from 4 to 8 bytes, reducing the number of objects it can store in the heap. This means that increasing the maximum heap to a value near to 32 GB and up to 47 GB will actually decrease the amount of memory available, leading to possible OutOfMemoryErrors.

Elasticsearch is much more suited to scale out rather than up. So you are better off using more, smaller nodes, instead of only a few huge nodes.

CPU

Since Elasticsearch needs to perform a number of operations concurrently, do not use less than 2.5vCPUs.

Storage

Elasticsearch recommends 45GB storage per GB RAM for hot data nodes. In my experience you can stretch this somewhat, and you can safely go as high as 60 GB per GB RAM or even 90GB/GB RAM. Your mileage may vary, since as per usual, it depends.

The ratios go up as you go to warm, cold and frozen data tiers. Cold and frozen data tiers require searchable snapshots, which are available with Enterprise subscriptions. The warm storage tier has a recommended ratio of 160GB/GB RAM but can go even higher. I’ve seen these go up to 250GB/GB RAM. For the cold data tier, use a similar configuration as the warm tier, but since you only need to have 1 shard and no replicas you can have double the amount of shards. The frozen data tier consists of only caching nodes, all data is retrieved from the snapshots as needed. These nodes should be large, with a lot of storage space. You typically see a few nodes with 64GB and lots of storage available, for caching snapshot data.

Sharding

Elasticsearch comes with mostly sane defaults. For most observability solutions 1 primary shard, and 1 replica works perfectly fine. Keep the maximum shard size under 50GB, and under 200 million docs/shard. Use datastreams with an ILM policy to automatically rollover indices.

DataCenter considerations

If you can get it, go for 3 DC’s, as long as latency and available bandwidth allow. Traffic and latency between the DC’s for Elasticsearch is a definite consideration here. If you can’t get 3 DC’s, try for 2, but with a 3rd location at a colocation facility for a voting-only master node.

When your company only has 1 data center, go for as much separation as possible: 2 different rooms, 2 different rows, 2 different racks, 2 different power feeds, 2 different switches. Try to create 2 stacks, so that primary and replica shards are not on the same critical path.

Monitoring

Create a separate cluster for monitoring your Elasticsearch cluster(s). If the main cluster goes down while you are also running your monitoring in the same instance, you will be almost blind and finding the cause of your outage will become an order of magnitude more difficult.

Follow-up: part two of this blog series

Now you have an idea about the different hosting options and what to consider when scaling your self-hosted Elastic Platform. In part two of this blog series, I dove into building a cluster, building out the infrastructure and how to scale to optimize for performance. Read more.

Stay up to date

By signing up for our newsletter you indicate that you have taken note of our privacy statement.

Table of contents

Scaling your on-prem Elastic Observability Cluster - Part 1
Which Elasticsearch hosting options do I have?
Considerations for building and configuring your self-hosted Elastic platform
Follow-up: part two of this blog series

Scaling your on-prem Elastic Observability Cluster – Part 1