liminfo

Elasticsearch Reference

Free reference guide: Elasticsearch Reference

42 results

About Elasticsearch Reference

The Elasticsearch Reference is a searchable cheat sheet covering the complete Elasticsearch REST API surface across eight categories. The Index category covers creating and deleting indices with shard/replica settings, listing indices with _cat/indices, updating index settings, and reindexing with _reindex. The Query DSL category covers the most commonly used query types: match for full-text search, term for exact keyword matching, bool for composing must/should/must_not/filter clauses, range for numeric and date ranges, multi_match for searching across multiple fields simultaneously, and wildcard for pattern-based matching. Every query is shown as a complete JSON body suitable for the Kibana Dev Tools console or a curl request.

The Aggregation category covers five aggregation types: terms for bucket-based field value grouping, metric aggregations (avg, sum, min, max), date_histogram for time-series bucketing with calendar intervals, nested sub-aggregations for multi-level analytics, and cardinality for approximate unique value counting. The Mapping category explains PUT /_mapping with text/keyword/float/date/nested/geo_point field types, the difference between text (analyzed, full-text search) and keyword (unanalyzed, exact match and aggregations), nested object types, dynamic mapping control with strict mode, and geo_point for geographic proximity queries. The Analyzer category covers custom analyzer composition, the nori tokenizer for Korean morphological analysis, the _analyze API for testing, synonym filters, and ngram tokenizers for partial-word matching.

The Cluster and Monitoring categories provide operational visibility: _cluster/health for overall status, _cluster/stats for node counts and document totals, _nodes/stats for per-node JVM and OS metrics, _cluster/settings for allocation control, and _cat/shards for shard state inspection. Monitoring entries cover _cat/nodes for per-node resource usage, _tasks for running operations, slow log configuration, _cat/thread_pool for queue depth, and hot-warm architecture with routing allocation. The API category covers document indexing with POST /_doc, single-document retrieval, bulk operations with _bulk, multi-get with _mget, multi-search with _msearch, and scroll API for large result set pagination.

Key Features

  • Index management: PUT/DELETE index, _cat/indices listing, settings update, and _reindex
  • Query DSL: match, term, bool (must/should/filter), range, multi_match, and wildcard queries with JSON examples
  • Aggregations: terms buckets, avg/sum/min/max metrics, date_histogram, nested sub-aggregations, cardinality
  • Mappings: text vs keyword distinction, nested type, geo_point, dynamic strict mode, and float/date types
  • Analyzers: custom analyzer definition, nori Korean morphological analyzer, synonym filter, ngram tokenizer
  • Cluster ops: health status, stats, per-node JVM/OS metrics, shard allocation, and hot-warm architecture
  • Monitoring: _cat/nodes, _tasks, slow log threshold tuning, thread pool status, and shard inspection
  • API: _doc indexing, _bulk for batch operations, _mget, _msearch multi-query, and scroll for deep pagination

Frequently Asked Questions

What is the difference between a term query and a match query?

A term query performs exact, case-sensitive matching on a field without analysis — it is used with keyword fields to match exact values like status codes, IDs, or enum values. A match query runs the input through the field's analyzer (tokenization, lowercasing, stemming) before matching, making it suitable for full-text search on text fields. Using a term query on a text field usually returns no results because analyzed text is lowercased and tokenized.

When should I map a field as text versus keyword?

Use text for fields you want to search with full-text analysis (e.g., article titles, descriptions) — the value is broken into tokens for inverted index storage. Use keyword for fields that need exact matching, sorting, or aggregations (e.g., status, category, email address) — the value is stored as-is. A common pattern is a multi-field mapping with both: "title": { "type": "text", "fields": { "keyword": { "type": "keyword" } } }.

How does the bool query work?

The bool query combines multiple query clauses with logical operators: must (AND, affects score), should (OR, boosts score), must_not (NOT, excluded from results), and filter (AND, no scoring, cached). Queries in must and should affect the relevance score; filter and must_not do not. filter clauses are automatically cached by Elasticsearch, making them faster for repeated use than equivalent must clauses.

What is the difference between terms aggregation and cardinality aggregation?

The terms aggregation groups documents into buckets by field value and returns the top N values with their document counts — similar to SQL GROUP BY. The cardinality aggregation returns an approximate count of unique values for a field using the HyperLogLog++ algorithm. It is memory-efficient but approximate (default precision ~3% error). Use terms for top-value analysis and cardinality for distinct count metrics across large datasets.

What is the nori analyzer and when do I need it?

The nori analyzer uses the Nori tokenizer (based on MeCab-Ko/IPadic) to perform Korean morphological analysis, splitting text into morphemes (meaning units) rather than just whitespace. Without nori, Korean text is often treated as a single token, making full-text search ineffective. Install the analysis-nori plugin and define a custom analyzer using "tokenizer": "nori_tokenizer" in your index settings for Korean-language content.

What does a date_histogram aggregation do?

date_histogram groups documents into time buckets based on a date field and a calendar interval (e.g., month, week, day, hour). It returns a bucket for each interval with a document count, making it the primary tool for time-series analytics like daily active users, hourly event rates, or monthly revenue trends. You can nest metric aggregations (avg, sum) inside date_histogram to compute per-period statistics.

What is the scroll API and when should I use it instead of pagination?

The scroll API is designed for retrieving large numbers of documents (thousands to millions) efficiently, like exporting data or reindexing. Unlike from/size pagination, scroll takes a snapshot of the index state at the first request and returns batches via a scroll_id. It should not be used for real-time user-facing pagination (use search_after for that) because the snapshot is held in memory. Use scroll for batch processing and data export pipelines.

How can I diagnose a yellow or red cluster health status?

A yellow status means all primary shards are allocated but some replica shards are not (common on single-node clusters). A red status means one or more primary shards are unallocated, causing data loss risk. Start with GET /_cluster/health for overall status, then GET /_cat/shards?v to find unassigned shards. Use GET /_cluster/allocation/explain to get a detailed explanation of why a specific shard cannot be allocated.