liminfo

Building an Elasticsearch Search Engine

Build a high-performance full-text search engine for large-scale text data using Elasticsearch, with text analysis, search result highlighting, and autocomplete functionality

Elasticsearch searchfull-text search engineElasticsearch index designtext analysisElasticsearch Query DSLsearch autocompleteElasticsearch mappingsearch relevance tuning

Problem

An e-commerce platform needs to build a unified search feature across product names, descriptions, and categories. With RDBMS LIKE searches, response times exceed 2 seconds on 2M+ products, and without morphological analysis, searching "wireless headphone" cannot find products containing only "wireless" or only "headphone". Natural language search, typo correction, synonym handling, search result highlighting, and autocomplete are all required. Elasticsearch needs to be adopted to deliver accurate and rich search experiences with response times under 50ms.

Required Tools

Elasticsearch

A distributed search/analytics engine based on Apache Lucene. Supports fast text search on large datasets using inverted index structures.

Kibana

A UI tool for visualizing Elasticsearch data and testing queries in Dev Tools.

Text Analyzers

Elasticsearch built-in and custom analyzers that tokenize text for indexing. Edge N-gram enables autocomplete functionality.

@elastic/elasticsearch

The official Elasticsearch client library for Node.js.

Solution Steps

1

Install Elasticsearch and Configure Text Analyzers

Run Elasticsearch with Docker and configure text analyzers for your language. Custom analyzers tokenize text into searchable terms; for example, "wireless bluetooth headphone" becomes ["wireless", "bluetooth", "headphone"]. Edge N-gram tokenizers enable prefix-based autocomplete as users type.

# Run Elasticsearch with Docker
docker run -d --name elasticsearch \
  -p 9200:9200 -p 9300:9300 \
  -e "discovery.type=single-node" \
  -e "xpack.security.enabled=false" \
  -e "ES_JAVA_OPTS=-Xms1g -Xmx1g" \
  docker.elastic.co/elasticsearch/elasticsearch:8.12.0

# Test the default analyzer
curl -X POST "localhost:9200/_analyze" -H 'Content-Type: application/json' -d'
{
  "analyzer": "standard",
  "text": "Samsung Galaxy S24 Ultra Smartphone"
}'
# Result: ["samsung", "galaxy", "s24", "ultra", "smartphone"]

# Run Kibana (for Dev Tools)
docker run -d --name kibana \
  -p 5601:5601 \
  -e "ELASTICSEARCH_HOSTS=http://elasticsearch:9200" \
  --link elasticsearch \
  docker.elastic.co/kibana/kibana:8.12.0

# For other languages, install appropriate plugins:
# Korean: bin/elasticsearch-plugin install analysis-nori
# Chinese: bin/elasticsearch-plugin install analysis-icu
# Japanese: bin/elasticsearch-plugin install analysis-kuromoji
2

Design Index and Define Mappings

Define field types, analyzers, and search methods through index mappings. The text type is analyzed and stored in an inverted index, while keyword type is stored as exact values. Multi-field mappings allow both full-text search and sorting/filtering on the same field.

# Create product search index (mapping + analyzer settings)
PUT /products
{
  "settings": {
    "number_of_shards": 2,
    "number_of_replicas": 1,
    "analysis": {
      "analyzer": {
        "product_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "asciifolding", "word_delimiter_graph"]
        },
        "autocomplete_analyzer": {
          "type": "custom",
          "tokenizer": "edge_ngram_tokenizer",
          "filter": ["lowercase"]
        }
      },
      "tokenizer": {
        "edge_ngram_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20,
          "token_chars": ["letter", "digit"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "name": {
        "type": "text",
        "analyzer": "product_analyzer",
        "fields": {
          "keyword": { "type": "keyword" },
          "autocomplete": {
            "type": "text",
            "analyzer": "autocomplete_analyzer",
            "search_analyzer": "standard"
          }
        }
      },
      "description": {
        "type": "text",
        "analyzer": "product_analyzer"
      },
      "category": {
        "type": "keyword"
      },
      "brand": {
        "type": "keyword",
        "fields": {
          "text": { "type": "text", "analyzer": "product_analyzer" }
        }
      },
      "price": { "type": "integer" },
      "rating": { "type": "float" },
      "salesCount": { "type": "integer" },
      "createdAt": { "type": "date" },
      "tags": { "type": "keyword" }
    }
  }
}
3

Build Complex Search Queries with Query DSL

Combine must (required match), should (preferred conditions), and filter (filtering) using bool queries for complex search. Use multi_match to search across multiple fields simultaneously, and boost to adjust field weights. Use function_score to incorporate popularity, recency, etc. into scoring to optimize search result ordering.

# Complex search query (keyword + filter + sort)
POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "multi_match": {
            "query": "wireless bluetooth earbuds",
            "fields": ["name^3", "description", "brand.text^2", "tags^1.5"],
            "type": "best_fields",
            "fuzziness": "AUTO",
            "minimum_should_match": "75%"
          }
        }
      ],
      "filter": [
        { "range": { "price": { "gte": 20, "lte": 200 } } },
        { "term": { "category": "Electronics" } },
        { "range": { "rating": { "gte": 4.0 } } }
      ],
      "should": [
        { "term": { "brand": { "value": "Sony", "boost": 2 } } },
        { "range": { "salesCount": { "gte": 1000, "boost": 1.5 } } }
      ]
    }
  },
  "highlight": {
    "pre_tags": ["<mark>"],
    "post_tags": ["</mark>"],
    "fields": {
      "name": { "number_of_fragments": 0 },
      "description": { "fragment_size": 150, "number_of_fragments": 3 }
    }
  },
  "sort": [
    { "_score": "desc" },
    { "salesCount": "desc" }
  ],
  "from": 0,
  "size": 20,
  "aggs": {
    "categories": {
      "terms": { "field": "category", "size": 20 }
    },
    "price_ranges": {
      "range": {
        "field": "price",
        "ranges": [
          { "to": 50 },
          { "from": 50, "to": 100 },
          { "from": 100, "to": 300 },
          { "from": 300 }
        ]
      }
    },
    "avg_rating": {
      "avg": { "field": "rating" }
    }
  }
}
4

Implement Search Autocomplete

Use edge N-gram tokenizers to implement real-time autocomplete for in-progress input strings. As a user types "wire", "wirel", "wirele", "wireless", suggestions appear at each stage. Additionally using the Completion Suggester enables even faster autocomplete.

# Autocomplete query (Edge N-gram based)
POST /products/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match": {
            "name.autocomplete": {
              "query": "galaxy",
              "operator": "and"
            }
          }
        }
      ],
      "should": [
        {
          "match": {
            "name": {
              "query": "galaxy",
              "boost": 5
            }
          }
        }
      ]
    }
  },
  "size": 10,
  "_source": ["name", "category", "brand", "price"],
  "highlight": {
    "pre_tags": ["<b>"],
    "post_tags": ["</b>"],
    "fields": { "name": {} }
  }
}

# Completion Suggester (faster autocomplete)
# Add suggest field to mapping:
# "suggest": { "type": "completion", "analyzer": "standard" }

POST /products/_search
{
  "suggest": {
    "product_suggest": {
      "prefix": "galax",
      "completion": {
        "field": "suggest",
        "size": 5,
        "fuzzy": {
          "fuzziness": 1
        }
      }
    }
  }
}

# Popular search term autocomplete (search_as_you_type)
PUT /search_terms
{
  "mappings": {
    "properties": {
      "query": { "type": "search_as_you_type" },
      "count": { "type": "integer" }
    }
  }
}

POST /search_terms/_search
{
  "query": {
    "multi_match": {
      "query": "blueto",
      "type": "bool_prefix",
      "fields": ["query", "query._2gram", "query._3gram"]
    }
  },
  "sort": [{ "count": "desc" }],
  "size": 5
}
5

Implement Search API with Node.js Client

Execute Elasticsearch queries from a Node.js backend using the @elastic/elasticsearch client. Dynamically reflect search request parameters into the pipeline and format results for API responses. Use the bulk indexing API for efficient large-scale data indexing.

// search-service.ts - Node.js search service
import { Client } from '@elastic/elasticsearch';

const client = new Client({ node: 'http://localhost:9200' });

interface SearchParams {
  query: string;
  category?: string;
  minPrice?: number;
  maxPrice?: number;
  minRating?: number;
  page?: number;
  size?: number;
  sort?: 'relevance' | 'price_asc' | 'price_desc' | 'newest' | 'popular';
}

async function searchProducts(params: SearchParams) {
  const { query, category, minPrice, maxPrice, minRating, page = 1, size = 20, sort = 'relevance' } = params;
  const from = (page - 1) * size;

  const filters: any[] = [];
  if (category) filters.push({ term: { category } });
  if (minPrice || maxPrice) {
    filters.push({ range: { price: {
      ...(minPrice && { gte: minPrice }),
      ...(maxPrice && { lte: maxPrice }),
    }}});
  }
  if (minRating) filters.push({ range: { rating: { gte: minRating } } });

  const sortOptions: Record<string, any[]> = {
    relevance: [{ _score: 'desc' }, { salesCount: 'desc' }],
    price_asc: [{ price: 'asc' }],
    price_desc: [{ price: 'desc' }],
    newest: [{ createdAt: 'desc' }],
    popular: [{ salesCount: 'desc' }],
  };

  const result = await client.search({
    index: 'products',
    body: {
      query: {
        bool: {
          must: [{
            multi_match: {
              query,
              fields: ['name^3', 'description', 'brand.text^2', 'tags^1.5'],
              type: 'best_fields',
              fuzziness: 'AUTO',
            },
          }],
          filter: filters,
        },
      },
      highlight: {
        pre_tags: ['<mark>'],
        post_tags: ['</mark>'],
        fields: {
          name: { number_of_fragments: 0 },
          description: { fragment_size: 150, number_of_fragments: 2 },
        },
      },
      sort: sortOptions[sort],
      from,
      size,
      aggs: {
        categories: { terms: { field: 'category', size: 20 } },
        price_stats: { stats: { field: 'price' } },
      },
    },
  });

  return {
    total: (result.hits.total as any).value,
    products: result.hits.hits.map((hit: any) => ({
      ...hit._source,
      _score: hit._score,
      highlight: hit.highlight,
    })),
    aggregations: result.aggregations,
  };
}

// Bulk indexing (large-scale data indexing)
async function bulkIndexProducts(products: any[]) {
  const body = products.flatMap(product => [
    { index: { _index: 'products', _id: product.id } },
    product,
  ]);
  const result = await client.bulk({ body, refresh: true });
  console.log(`Indexed ${products.length} products, errors: ${result.errors}`);
}
6

Synonym Dictionary and Typo Correction Setup

Setting up a synonym dictionary allows "laptop" and "notebook computer" to return the same results. The fuzziness option automatically corrects typos, and a did-you-mean feature provides search term suggestions. Synonym dictionaries require closing and reopening the index to take effect, so care is needed when making changes in production.

# Analyzer settings with synonym dictionary
PUT /products/_close

PUT /products/_settings
{
  "analysis": {
    "filter": {
      "synonym_filter": {
        "type": "synonym",
        "synonyms": [
          "laptop, notebook, notebook computer",
          "phone, smartphone, mobile phone, cellphone",
          "earphone, earbuds, headphones",
          "TV, television, telly",
          "fridge, refrigerator, icebox"
        ]
      }
    },
    "analyzer": {
      "synonym_analyzer": {
        "type": "custom",
        "tokenizer": "standard",
        "filter": ["lowercase", "synonym_filter"]
      }
    }
  }
}

PUT /products/_open

# Typo correction (did-you-mean) - Phrase Suggester
POST /products/_search
{
  "suggest": {
    "text": "samsnug galaxy smarthpone",
    "spell_check": {
      "phrase": {
        "field": "name",
        "size": 1,
        "gram_size": 3,
        "direct_generator": [{
          "field": "name",
          "suggest_mode": "always"
        }],
        "highlight": {
          "pre_tag": "<em>",
          "post_tag": "</em>"
        }
      }
    }
  }
}
# Result: "samsung galaxy smartphone" (typo-corrected suggestion)

Core Code

Core structure of Elasticsearch search: bool query (must + filter) + highlighting + aggregations + sort/pagination. Use multi_match to search multiple fields and fuzziness for typo correction.

// Elasticsearch Search Engine Core Query Structure
POST /products/_search
{
  "query": {
    "bool": {
      "must": [{
        "multi_match": {
          "query": "wireless earbuds",
          "fields": ["name^3", "description", "brand.text^2"],
          "fuzziness": "AUTO"
        }
      }],
      "filter": [
        { "term": { "category": "Electronics" } },
        { "range": { "price": { "gte": 20, "lte": 200 } } }
      ]
    }
  },
  "highlight": {
    "fields": { "name": {}, "description": { "fragment_size": 150 } }
  },
  "aggs": {
    "categories": { "terms": { "field": "category" } },
    "price_stats": { "stats": { "field": "price" } }
  },
  "sort": [{ "_score": "desc" }],
  "from": 0, "size": 20
}

Common Mistakes

Using a keyword query (term) on a text type field, resulting in no search results

Text fields are tokenized by analyzers before storage, so term queries will not match exactly. Use match/multi_match for full-text search, and use term queries on the keyword subfield (name.keyword) for exact value filtering.

Relying on dynamic mapping without explicit mappings, resulting in poor search quality and bloated index size

Always define explicit mappings. Dynamic mapping stores all strings as both text + keyword, doubling disk usage. Assign appropriate analyzers only to searchable fields, and set enabled: false for unnecessary fields.

Indexing large datasets one document at a time, resulting in extremely slow indexing speed

Calling the index API one document at a time incurs network overhead and is very slow. Use the bulk API to batch 500-5000 documents at a time, set refresh_interval to -1 during indexing, and re-enable it afterward to improve initial indexing speed by 10x or more.

Related liminfo Services