How to scale your on-prem Elastic Observability cluster – part 2

How to scale your on-prem Elastic Observability cluster – part 2

So, in my first post, we discussed the various Elasticsearch hosting options available and the points to consider when building and configuring your self-hosted Elastic platform. In this follow-up article, we'll dive deeper into setting up a cluster, expanding the infrastructure, and how to scale to optimize for sustainability and stability.

Tip: Read my first blog about scaling an on-premises Elastic Observability Cluster here.

Start small

The smallest cluster you can set up with redundancy and quorum consists of three Elasticsearch nodes, with each node assigned all roles. This generally works well with an ingest of less than 1,000 documents per second. This mainly depends on the available CPU capacity, but 3× 4 CPU with 16 GB heap should be able to handle this, provided you do not ingest Windows logs, as these require significantly more processing power.

Expanding your infrastructure

ECE and Elastic Cloud automatically add three dedicated master-eligible nodes to each cluster with six or more Elasticsearch nodes. Personally, I would add the masters as soon as three all-in-one nodes start showing performance issues.

The next layer consists of ingest nodes. Only add ingest nodes if you are using Elasticsearch ingest pipelines. If all incoming data is already in the correct format upon arrival, these nodes offer no advantage. As your storage needs grow, you will need to add additional storage tiers, because storing only hot data quickly becomes unaffordable.

If the ingest rate continues to increase, it is wise to consider decoupling agents from Elasticsearch and placing a Kafka buffer layer between agents and Elastic. At the time of writing (December 2024), Elastic Agent output with Fleet is limited to Elasticsearch, Logstash, and Kafka.

When you scale something and in which direction

Now you have a cluster, but its performance is generally not what you expected. You think you need to scale up or scale out, but you're not sure which scaling option is right for you. This is where cluster monitoring comes in. You first need to be able to see the signals in order to determine what you need to do.

Cluster state issues

Cluster state is a data structure that keeps track of a lot of internal information that is necessary for every Elasticsearch node. The elected master recalculates the cluster state with every state change and sends an update to every active node. Once enough master-eligible nodes have confirmed the update, the elected master commits the change and broadcasts a message that the new state must be applied. If the new state cannot be published completely within the timeout set via cluster.publish.timeout, the master steps down and a re-election is forced.

As a general rule, adding more masters is not the solution if you are already using dedicated master nodes.

  • Cluster state update warnings and frequent master re-elections. If you encounter frequent cluster state update warnings or frequent master re-elections, this could mean a few things.
  • Cluster state size, caused by a field explosion. In this case, the cluster state grows because there are too many field definitions. Although this is not the focus of this document, you should check which fields are defined, which are dynamic, and which are static. Also look at the number of shards in your cluster. Fewer shards also means a smaller cluster state. More memory can help, but you don't want a huge cluster state, because syncing the cluster state takes longer as your state grows.
  • Masters are too small. If master-eligible nodes are configured with too little memory (or just too little), the JVM may need to perform garbage collection. Check the GC rate and the average time a GC run takes. Ideally, the GC rate should be less than 1/30s for master nodes.
    Solution: increase heap or RAM, or split your masters if you are running multiple roles on the masters.
  • Masters are too busy. If the average CPU usage becomes too high, you need to add CPUs or split them if you are running multiple roles. ECE is a special case. By default, ECE calculates the CPU quota based on the amount of memory allocated to a container in order to limit the noisy neighbor syndrome. For ECE, you can increase the memory allocated to the container.

Ingest

Ingest nodes are primarily compute-intensive, but with 4 CPUs and 8GB RAM, a single ingest node should be able to handle 3000/5000 events/sec, depending on the complexity of the Pipeline.

If you use Fleet of integrations, you benefit greatly from separate ingest nodes, because all integrations use ingest pipelines to fill/rename the correct ECS fields. This allows you to separate the processing of documents from indexing actions.

You determine when to scale out ingest nodes based on the CPU usage of the ingest node, linked to the index rate and CPU usage of the hot data nodes. As long as the hot data nodes are not approaching their limits, you can scale out ingest nodes.

Ingest buffering

There will come a time when your hot data nodes simply cannot handle your ingress data during peak times. That's when it's time to implement a buffering layer, unless you've already designed your solution with a buffering layer. The buffering layer ensures that log events are sent from your log producers as quickly as possible and can smooth out ingest peaks to Elasticsearch. Such a buffer layer adds extra latency, and you will need to agree with the users of your cluster on what is acceptable latency.

Hot data nodes

The hot data nodes are the workhorses of an Elasticsearch observability cluster. Data is indexed on these nodes, and most queries in an observability cluster only need to access recent data. If you don't have dedicated ingest nodes, they also handle part of the ingest pipeline processing. All processing required for ILM rollover also takes place on these nodes.

Your scaling decisions should be based on heap usage, available storage, and CPU usage.

  • If your heap usage is still low but your CPU is maxed out, add more CPU. If you cannot add CPU, scale out.
  • If your heap usage is high and you see frequent GC runs: add heap/memory, but keep in mind the 32GB heap space. If you cannot add heap to your running instances, scale out.
  • If you are approaching your storage limits (but hopefully before you hit the high watermark): you can adjust the low and high watermark targets, but add storage (keeping the storage-to-memory ratios in mind), or scale out. Another option is to adjust your ILM policies and keep your data hot for a shorter period of time, or add a warm data tier.

A brief note on Data Tiers

Elasticsearch uses the concept of Data Tiers, whereby hot nodes contain actively updated data. Warm nodes contain data that is actively searched but no longer updated. The Frozen Tier is used for indices that need to be preserved but are rarely consulted.

The frozen tier depends on snapshots and uses partially mounted indexes, whereby only certain metadata is kept in memory. Only when a search request needs the data is the entire snapshot temporarily read back so that it can be searched. Because this depends on snapshots, you can expect a search query to take minutes rather than seconds. The advantage is that you can maintain large amounts of frozen data without needing dedicated servers to keep this data available.

Conclusion

By now, you should have a pretty good idea of my vision on how to scale your on-prem Elastic Observability Cluster. With the increasing number of cyberattacks, Observability is becoming increasingly important. At SUE, we help our customers create more visibility by implementing multiple solutions, such as Elastic.

Do you need help or would you simply like to have a no-obligation conversation about Observability for your organization? Please contact us!

Stay informed
By subscribing to our newsletter, you declare that you agree with our privacy statement.

Need Elasticsearch expertise?

Nick M
Nick Methorst

Let's talk!


Need Elasticsearch expertise?

* required

By submitting this form, you confirm that you have read and understood our privacy statement.
Privacy overview
This website uses cookies. We use cookies to ensure that our website and services function properly, to gain insight into the use of our website, and to improve our products and marketing. For more information, please read our privacy and cookie policy.