Scaling your on-prem Elastic Observability Cluster - Part 2
So in my first post, we discussed the different Elasticsearch hosting options that are available and the considerations for building and configuring your self-hosted Elastic platform. In this follow-up post, we’ll dive into building a cluster, building out the infrastructure and how to scale to optimize for permanence.
Tip: read my first post on scaling an on-prem Elastic Observability Cluster here.
Starting out small
The smallest cluster you can have with redundancy and quorum is 3 Elasticsearch nodes, each node having all roles assigned. This will generally work well with less than 1000 docs/second ingested. This will primarily depend on the CPU available, but 3x 4CPU with 16GB heap should be able to do this, if you’re not ingesting windows logs, since those require a lot more processing.
Building out your infrastructure
ECE, and Elastic cloud will automatically add 3 dedicated master eligible nodes to any cluster with 6 or more Elasticsearch nodes. Personally I would add the masters when 3 all-in-one nodes start to show performance problems.
The next layer would be ingest nodes. Add ingest nodes only if you are using Elasticsearch ingest pipelines. If all your incoming data is in the correct format when it comes in you won’t get any benefit from those nodes. As your storage needs grow you will need to add more storage tiers, because only having hot data quickly becomes prohibitively expensive.
As the rate of ingestion increases you want to think about decoupling the agents from Elasticsearch, and employ a Kafka buffer layer between agents and Elastic. At the time of this writing (December 2024) Elastic agent output with fleet is limited to Elasticsearch, Logstash and Kafka.
When to scale what and in which direction
Now you have a cluster, but performance in general is not what you expect. You think you need to scale up or out, but you’re not sure which scaling option is appropriate. This is where cluster monitoring comes into play. You need to be able to first see the signals, in order to be able to determine what you need to do.
Cluster state issues
Cluster state is a data structure which keeps track of a variety of internal information, which is needed by every Elasticsearch node. The elected master recalculates the cluster state on every state change and sends an update to every active node. Once enough master eligible nodes have acknowledged the update, the elected master commits the change and broadcasts a message that the new state needs to be applied. Failure to completely publish the new state within the timeout set by cluster.publish.timeout will cause the master to stand down and force a re-election.
As a general rule, adding more masters will not be the solution if you are already using dedicated master nodes.
- Cluster state update warnings and frequent master re-elections. If you encounter frequent cluster state update warnings or frequent master re-elections, it might mean a couple of things.
- Cluster state size, caused by a field explosion. In this case the cluster state size grows because there are too many field definitions. While not the focus of this document, you need to look at what fields are defined, which are dynamic, and which are static. Also take a look at the number of shards in your cluster. Less shards also means a smaller cluster state. Increasing memory might help, but you don’t want a huge cluster state, since syncing cluster state will take longer the bigger your state is.
- Masters are too small. When master eligible nodes are configured with too little memory, or nearly too little memory, the JVM might need to do Garbage collection. Check the GC rate and the time a GC run takes on average. The GC rate should ideally be less than 1/30s for master nodes.
Solution: either increase heap or RAM, or split off your masters if you are running multiple roles on the masters. - Masters are too busy. If the average CPU usage becomes too high you will need to add CPUs, or split off when running multiple roles.
ECE is a special case. ECE by default calculates the CPU quota based on the amount of memory assigned to a container, in order to mitigate the noisy neighbour syndrome. For ECE you can increase the memory assigned to the container.
Ingest
Ingest nodes are primarily compute intensive, but with 4 CPUs and 8GB RAM a single ingest node should be able to handle 3000/5000 events/sec depending on pipeline complexity.
If you are using Fleet or integrations you will benefit greatly from separate ingest nodes, since all integrations use ingest pipelines to populate/rename the correct ECS fields. This way you split the processing of documents from indexing actions.
Deciding when to scale out ingest nodes is done on the basis of CPU usage of the ingest node, coupled with the index rate and CPU usage of the hot data nodes. As long as the hot data nodes aren’t approaching their limits, you can scale out ingest nodes.
Ingest buffering
There will come a moment when your hot data nodes simply aren’t able to handle your ingress data at peak times. Then it’s time to implement a buffering layer, unless you’ve designed your solution with the buffering layer already in place. The buffering layer will ensure that your log events are moved off of your log producers as quickly as possible, and can smooth out your ingest peaks towards Elasticsearch. Such a buffer layer will incur extra latency, and you will need to agree to an acceptable latency with the users of your cluster.
Hot Data nodes
The hot data nodes are the workhorses of an Elasticsearch observability cluster. Data is indexed on these nodes and most queries in an observability cluster only need to access recent data. If you don’t have dedicated ingest nodes, they will also handle part of the ingest pipeline processing. Any processing needed on ILM rollover will also take place on these nodes.
Your scaling decisions should be based on heap usage, available storage, and CPU usage.
- If your heap usage is still low, but you are maxing out on CPU, add more CPU. If you can’t add CPU, scale out.
- If your heap usage is high, and you are seeing frequent GC runs, add heap/memory, but keep the 32GB heap space in mind. If you can’t add heap to your running instances, scale out.
- If you’re reaching your storage limits (but before you hit the high watermark, I hope), you can change the low and high watermark targets, but either add storage, while keeping the storage to memory ratios in mind, or scale out. Another option is to change your ILM policies and keep your data hot for a shorter amount of time, or add a warm datatier.
A word on Data Tiers
Elasticsearch uses the concept of Data Tiers, in which the hot nodes hold actively updated data. The warm nodes hold data that is actively searched, which is not updated anymore. The Frozen Tier is used for indices that need to be kept, but rarely have to be accessed.
The frozen tier depends on snapshots and uses partially mounted indices where only some metadata is held in memory, until a search request needs the data, then the whole snapshot is temporarily read back so that it can be searched. Because it depends on snapshots a search here is expected to take minutes instead of seconds. The upside is that you can have large amounts of frozen data, but you don’t need dedicated servers to keep that data available.
Conclusion
So by now, you should have a pretty good idea about my opinion on how to scale your on-prem Elastic Observability Cluster. With an increasing number of cyber attacks, Observability is becoming more and more important. At SUE, we help our customers to increase visibility by implementing multiple solutions like Elastic. Do you need help or would you like to have a no-pressure chat about Observability for your company? Let’s get in touch!