Elasticsearch
Elasticsearch slow log: the cheapest performance tool you're misconfiguring
Default thresholds — 10s warn, 1s info — never catch the queries actually hurting your cluster. Here's how to tune the slow log per-index, what query / fetch / index split actually means, and 3 incident patterns only the slow log surfaces cleanly.
Elasticsearch’s slow log is the cheapest performance tool you have, and most teams configure it once at install and never touch it again. The default thresholds — 10s warn, 1s info — would never catch the queries actually hurting your cluster.
On this page
What the slow log captures
Two separate logs live on each shard, written by the per-shard logger in the data node:
- Search slow log — query-phase and fetch-phase timing. Logs the source (the query JSON), the took time, plus the index and shard.
- Indexing slow log — per-document indexing time. Logs index name, document id, source (truncated), and took time.
Both are JSON-formatted in modern Elasticsearch and live at logs/<cluster>_index_search_slowlog.json and logs/<cluster>_index_indexing_slowlog.json. Ship them with Filebeat; don’t parse text logs.
Thresholds that matter
Defaults are far too loose. Set them per-index for the query patterns that matter to you, not at the cluster level:
PUT /events-2026-04/_settings
{
"index.search.slowlog.threshold.query.warn": "1s",
"index.search.slowlog.threshold.query.info": "500ms",
"index.search.slowlog.threshold.query.debug": "200ms",
"index.search.slowlog.threshold.fetch.warn": "500ms",
"index.search.slowlog.threshold.fetch.info": "100ms",
"index.indexing.slowlog.threshold.index.warn": "1s",
"index.indexing.slowlog.threshold.index.info": "500ms",
"index.indexing.slowlog.source": "1000"
}Search slow log vs indexing slow log
Most teams configure search but ignore indexing. The indexing slow log is what surfaces:
- Heavy mappings — a document with 800 fields takes longer than one with 12.
- Refresh-interval pressure — fast indexing rates with a 1s refresh push translog flushing onto the indexing path.
- Pipeline overhead — ingest pipelines (especially ones with grok, geo, or script processors) can dominate per-doc time on hot tier nodes.
3 patterns the slow log will show you
Three incident classes that only the slow log catches cleanly:
1. The deep-pagination scan
A query like from: 9000, size: 100 tells Elasticsearch to materialize 9100 hits per shard, then throw 9000 away. With 30 shards in the index, you’ve just scanned 273k documents to return 100. The slow log shows query times jumping past 1s while result counts stay tiny — a tell-tale sign you should be using search_after or PIT.
2. The wildcard-prefix attacker
A user types *foo* in a search box and the application passes it straight to a wildcard query. Elasticsearch can’t use the inverted index for leading-wildcard patterns, so it falls back to a per-term scan. Slow log times spike to many seconds. Block at the application layer; this never improves on the Elasticsearch side.
3. The scripted-field hotspot
A dashboard adds a script field — say, a tax calculation — to every hit. Now every search runs Painless once per matched document. The slow log shows steadily increasing query times as result sets grow. Move the calculation to ingest-time or to a runtime field with a narrower scope.
FAQ
Should I run Elasticsearch slow log in production at the lowest threshold?+
Why are some slow log entries truncated?+
Can I get aggregated query stats without the slow log?+
What about OpenSearch?+
Keep reading
Redis
Redis SLOWLOG: the misunderstood telemetry that catches half your incidents
Most teams ship Redis with default SLOWLOG settings and never look at it. Here's how to tune it, what to scrape from it, and the three Redis incident classes that only show up in SLOWLOG.
ClickHouse
ClickHouse in production: monitoring without becoming a query hot-spot yourself
system.query_log is huge. system.parts is huger. Here's what to actually scrape, what to throw away, and how to monitor a ClickHouse cluster without spending half its CPU on system queries.
AI
Anomaly detection on database metrics: why thresholds fail and what works
A walk through forecast bands, change-point detection, multi-variate anomaly, and the seasonality math that makes 'p99 over 200ms' the wrong alert by default — with the Postgres example that broke our last threshold.