全文検索の基礎——インデックス、検索アルゴリズム、Elasticsearch の仕組み

現代のアプリケーションにおいて、「検索」は不可欠な機能だ。EC サイトでの商品検索、ドキュメント管理システムでの記事検索、ログ分析でのエラー検索——膨大なデータから必要な情報を瞬時に引き出す技術が全文検索（Full-Text Search）である。

本記事では、全文検索の基本概念から、インデックス作成の仕組み、検索アルゴリズム、そして Elasticsearch を使った実践的な実装までを体系的に解説する。

全文検索の基本概念——なぜ「検索」は難しいのか

LIKE 検索の限界

データベースの LIKE 句を使った検索は、小規模なデータでは機能する。

-- 単純な LIKE 検索
SELECT * FROM articles WHERE content LIKE '%検索%';

しかし、この方法には深刻な問題がある:

def benchmark_like_search(num_records, search_term):
    """LIKE 検索のパフォーマンス試算"""
    # LIKE '%term%' は全レコードをスキャン（インデックスが効かない）
    # 計算量: O(N * M) - N: レコード数、M: 1 レコードの平均サイズ

    scan_time_per_record_ms = 0.001  # 1 レコードあたり 0.001ms
    total_time_ms = num_records * scan_time_per_record_ms

    return total_time_ms

# 例：100 万レコードで検索
print(f"100 万レコード：{benchmark_like_search(1_000_000, '検索'):.0f}ms")  # 約 1,000ms
print(f"1,000 万レコード：{benchmark_like_search(10_000_000, '検索'):.0f}ms")  # 約 10,000ms = 10 秒

LIKE 検索の問題点:

インデックスが効かない: %term% の前方一致でない検索はインデックス unusable
全レコードスキャン: データが 2 倍になると検索時間も 2 倍
関連度順ソート不可: 「どれだけ関連するか」のスコアリングができない
曖昧検索不可: タイポや複数名詞への対応が困難

全文検索エンジンのアプローチ

全文検索エンジンは、事前準備（インデックス作成） によって検索を高速化する。

【LIKE 検索 vs 全文検索】

LIKE 検索:
検索実行時 → 全レコードスキャン → 一致レコード返却
         （遅い：データ量に比例）

全文検索:
1. 事前準備：文書からインデックス作成（オフライン）
2. 検索実行：インデックス参照 → 一致レコード返却
         （速い：データ量に依存しない）

全文検索の核心: 「検索クエリが来てから探す」のではなく、「どこに何があるか事前に整理しておく」

インデックスの仕組み——逆引き索引（Inverted Index）

逆引き索引の基本構造

全文検索の心臓部は**逆引き索引（Inverted Index）**である。これは、教科書の巻末にある「索引」と同じ考え方だ。

【逆引き索引の例】

原文書:
Doc 1: "猫はかわいい動物です"
Doc 2: "犬は忠実な動物です"
Doc 3: "猫と犬は仲良し"

インデックス作成後:
単語     →  含まれる文書 ID
----------------------------
猫       →  [1, 3]
犬       →  [2, 3]
動物     →  [1, 2]
かわいい →  [1]
忠実     →  [2]
仲良し   →  [3]

検索クエリ「猫」:

インデックスから [1, 3] を即座に取得
全 3 文書中 2 文書が一致（O(1) の検索）

逆引き索引の実装

import re
from collections import defaultdict
from typing import Dict, List

class InvertedIndex:
    """簡易的な逆引き索引の実装"""

    def __init__(self):
        # 単語 → 文書 ID のリスト
        self.index: Dict[str, List[int]] = defaultdict(list)
        # 文書 ID → 原文書
        self.documents: Dict[int, str] = {}

    def tokenize(self, text: str) -> List[str]:
        """テキストをトークン（単語）に分割"""
        # 簡易実装：アルファベットは小文字に、日本語は 1 文字単位
        # 実際には形態素解析器（Kuromoji など）を使用
        text = text.lower()
        # アルファベット単語を抽出
        words = re.findall(r'[a-z]+', text)
        # 日本語文字も抽出（簡易実装）
        japanese_chars = re.findall(r'[\u4e00-\u9fff\u3040-\u309f\u30a0-\u30ff]', text)
        return words + japanese_chars

    def add_document(self, doc_id: int, text: str):
        """文書をインデックスに追加"""
        self.documents[doc_id] = text
        tokens = self.tokenize(text)

        for token in tokens:
            # 単語が含まれる文書 ID リストに追加
            if doc_id not in self.index[token]:
                self.index[token].append(doc_id)

    def search(self, query: str) -> List[int]:
        """検索クエリに一致する文書 ID を返す"""
        tokens = self.tokenize(query)
        if not tokens:
            return []

        # 最初のトークンの結果を取得
        result = set(self.index.get(tokens[0], []))

        # 以降のトークンで共通部分を取得（AND 検索）
        for token in tokens[1:]:
            result &= set(self.index.get(token, []))

        return sorted(result)

# 使用例
index = InvertedIndex()
index.add_document(1, "猫はかわいい Animal です")
index.add_document(2, "犬は忠実な Animal です")
index.add_document(3, "猫と犬は Nakayoshi")

# 検索「猫」
results = index.search("cat 猫")
print(f"「猫」を含む文書: {results}")  # [1, 3]

# 検索「動物」
results = index.search("animal 動物")
print(f"「動物」を含む文書：{results}")  # [1, 2]

用語の位置情報（Position Information）

実際の検索エンジンでは、単語の位置も保存する。

【位置情報付きインデックス】

単語     →  (文書 ID, 位置 [リスト])
-----------------------------------------
猫       →  (1, [0]), (3, [0])
犬       →  (2, [0]), (3, [2])
動物     →  (1, [4]), (2, [4])

位置情報の用途:

フレーズ検索: 「猫かわいい」が隣接しているか確認
** proximity 検索**: 2 つの単語が何単語以内にあるか
ハイライト表示: どの位置でマッチしたか

検索アルゴリズム——関連度スコアリング

ブール検索（Boolean Search）

最も単純な検索は「一致/不一致」のブール判定だ。

def boolean_search(index, query):
    """ブール検索 - 一致する文書を返すだけ"""
    return index.search(query)

# 結果：[Doc1, Doc3] - 順序なし

欠点: 100 単語含む文書も 1 単語含む文書も同じ扱い

TF-IDF（Term Frequency-Inverse Document Frequency）

TF-IDFは、単語の重要度を計算する古典的な手法だ。

TF-IDF の式:

TF（Term Frequency）: 文書内での単語の出現頻度
  TF(t, d) = (文書 d での単語 t の出現回数) / (文書 d の全単語数)

IDF（Inverse Document Frequency）: 単語の希少性
  IDF(t) = log( (総文書数) / (単語 t を含む文書数) )

TF-IDF(t, d) = TF(t, d) × IDF(t)

直感的解釈:

TF 高い: その文書で何度も出てくる = 重要そう
IDF 高い: 他の文書ではあまり出てこない = 特徴的

import math
from collections import Counter

def calculate_tf(term, document_tokens):
    """Term Frequency の計算"""
    total_terms = len(document_tokens)
    term_count = document_tokens.count(term)
    return term_count / total_terms if total_terms > 0 else 0

def calculate_idf(term, all_documents_tokens):
    """Inverse Document Frequency の計算"""
    num_docs_with_term = sum(1 for doc in all_documents_tokens if term in doc)
    total_docs = len(all_documents_tokens)
    return math.log(total_docs / num_docs_with_term) if num_docs_with_term > 0 else 0

def calculate_tfidf(term, document_tokens, all_documents_tokens):
    """TF-IDF の計算"""
    tf = calculate_tf(term, document_tokens)
    idf = calculate_idf(term, all_documents_tokens)
    return tf * idf

# 例：3 文書で TF-IDF 計算
docs = [
    ["猫", "は", "かわいい", "動物"],
    ["犬", "は", "忠実", "な", "動物"],
    ["猫", "と", "犬", "は", "仲良し"]
]

# 「猫」の TF-IDF を各文書で計算
for i, doc in enumerate(docs):
    tfidf = calculate_tfidf("猫", doc, docs)
    print(f"Doc{i+1} の「猫」TF-IDF: {tfidf:.4f}")

出力:

Doc1 の「猫」TF-IDF: 0.2877  （出現：1 回/4 単語、文書数：2/3）
Doc2 の「猫」TF-IDF: 0.0000  （出現：0 回）
Doc3 の「猫」TF-IDF: 0.2310  （出現：1 回/5 単語、文書数：2/3）

BM25（Best Matching 25）

BM25は、TF-IDF を改良した現代の検索エンジン標準アルゴリズムだ。

def bm25_score(term, document_tokens, all_documents_tokens, k1=1.5, b=0.75):
    """BM25 スコアの計算（簡易版）"""
    # 文書長の平均を計算
    avg_doc_length = sum(len(doc) for doc in all_documents_tokens) / len(all_documents_tokens)
    doc_length = len(document_tokens)

    # TF 成分（飽和あり）
    term_count = document_tokens.count(term)
    tf_numerator = term_count * (k1 + 1)
    tf_denominator = term_count + k1 * (1 - b + b * doc_length / avg_doc_length)
    tf = tf_numerator / tf_denominator if tf_denominator > 0 else 0

    # IDF 成分
    num_docs_with_term = sum(1 for doc in all_documents_tokens if term in doc)
    total_docs = len(all_documents_tokens)
    idf = math.log((total_docs - num_docs_with_term + 0.5) / (num_docs_with_term + 0.5) + 1)

    return tf * idf

# BM25 の特徴:
# 1. TF の飽和：何回も出てきてもスコア上昇は頭打ち
# 2. 文書長補正：長い文書ほど不利にならないよう調整

BM25 の改善点:

課題	TF-IDF	BM25
TF の過大評価	線形に増加	飽和する（k1 パラメータ）
文書長の影響	長い文書が不利	補正（b パラメータ）
IDF の負の値	稀な単語で負になりうる	常に正の値

形態素解析——日本語検索の鍵

なぜ日本語は難しいか

英語はスペースで単語が区切られているが、日本語は異なる。

英語： "The cat sleeps" → ["The", "cat", "sleeps"] （自動分割）
日本語："猫が寝ている" → ??? （スペースがない）

形態素解析は、日本語テキストを単語単位に分割する技術だ。

【形態素解析の例】

入力：「猫が寝ている」

解析結果：
猫 / 名詞
が / 助詞
寝 / 動詞
て / 助動詞
いる / 補助動詞

トークナイズ結果：["猫", "寝", "いる"]

Kuromoji を使った形態素解析

# kuromoji.py を使用（日本語形態素解析器）
import kuromoji

def tokenize_japanese(text):
    """日本語テキストをトークン化"""
    tokenizer = kuromoji.Tokenizer()
    tokens = tokenizer.tokenize(text)
    return [token.surface for token in tokens]

# 例
text = "猫が寝ている間に犬が走った"
tokens = tokenize_japanese(text)
print(f"トークン：{tokens}")
# 出力：['猫', 'が', '寝', 'て', 'いる', '間', 'に', '犬', 'が', '走っ', 'た']

# 名詞のみ抽出
pos_tags = [token.pos for token in tokenizer.tokenize(text)]
nouns = [token.surface for token in tokenizer.tokenize(text) if token.pos == '名詞']
print(f"名詞：{nouns}")  # ['猫', '間', '犬']

実際の検索エンジンでの活用:

索引作成時: 文書を形態素解析して単語単位でインデックス
検索時: クエリも形態素解析して一致する単語を検索
類義語展開: 「猫」→「キャット」「neko」も検索に含める

Elasticsearch——実用的な全文検索エンジン

Elasticsearch の基本アーキテクチャ

【Elasticsearch アーキテクチャ】

┌─────────────────────────────────────────────────────┐
│                    Client                           │
│              (REST API で接続)                      │
└─────────────────────┬───────────────────────────────┘
                      │
                      ▼
┌─────────────────────────────────────────────────────┐
│                 Elasticsearch Cluster               │
│  ┌───────────┐  ┌───────────┐  ┌───────────┐       │
│  │  Node 1   │  │  Node 2   │  │  Node 3   │       │
│  │ (Master)  │  │ (Data)    │  │ (Data)    │       │
│  │           │  │           │  │           │       │
│  │ ┌───────┐ │  │ ┌───────┐ │  │ ┌───────┐ │       │
│  │ │Shard 1│ │  │ │Shard 2│ │  │ │Shard 3│ │       │
│  │ │(Prim) │ │  │ │(Replica)│ │ │(Replica)│ │       │
│  │ └───────┘ │  │ └───────┘ │  │ └───────┘ │       │
│  └───────────┘  └───────────┘  └───────────┘       │
└─────────────────────────────────────────────────────┘

主要概念:

用語	説明
Cluster	1 つのシステムとして動作するノードの集合
Node	単一の Elasticsearch サーバーインスタンス
Index	文書のコレクション（RDB の「テーブル」に相当）
Document	検索対象の JSON データ（RDB の「行」に相当）
Shard	インデックスを分割した単位（分散配置）
Replica	シャードのレプリカ（可用性向上）

インデックス作成（Indexing）

from elasticsearch import Elasticsearch

# Elasticsearch クライアントの初期化
es = Elasticsearch(['http://localhost:9200'])

# インデックスの作成設定
index_settings = {
    "settings": {
        "number_of_shards": 3,
        "number_of_replicas": 1,
        "analysis": {
            "analyzer": {
                "japanese_analyzer": {
                    "tokenizer": "kuromoji_tokenizer"
                }
            }
        }
    },
    "mappings": {
        "properties": {
            "title": {"type": "text", "analyzer": "japanese_analyzer"},
            "content": {"type": "text", "analyzer": "japanese_analyzer"},
            "category": {"type": "keyword"},  # 完全一致用
            "published_at": {"type": "date"},
            "author": {"type": "keyword"}
        }
    }
}

# インデックス作成
es.indices.create(index='articles', body=index_settings)

# 文書の登録（インデキシング）
document = {
    "title": "猫の飼い方ガイド",
    "content": "猫はかわいい動物です。初心者でも飼いやすいペットとして人気があります。",
    "category": "ペット",
    "published_at": "2025-08-01",
    "author": "山田太郎"
}

es.index(index='articles', id=1, body=document)

検索クエリ（Search Queries）

# 1. 単純な全文検索（match query）
search_query = {
    "query": {
        "match": {
            "content": "猫 かわいい"
        }
    }
}

response = es.search(index='articles', body=search_query)
for hit in response['hits']['hits']:
    print(f"Score: {hit['_score']}, Title: {hit['_source']['title']}")

# 2. フレーズ検索（match_phrase query）
phrase_query = {
    "query": {
        "match_phrase": {
            "content": "猫を飼う"
        }
    }
}

# 3. 複合検索（bool query）
bool_query = {
    "query": {
        "bool": {
            "must": [
                {"match": {"content": "猫"}}
            ],
            "filter": [
                {"term": {"category": "ペット"}},
                {"range": {"published_at": {"gte": "2025-01-01"}}}
            ]
        }
    }
}

# 4. 複数フィールド検索（multi_match query）
multi_query = {
    "query": {
        "multi_match": {
            "query": "猫 飼い方",
            "fields": ["title^3", "content"],  # title は 3 倍の重み
            "type": "best_fields"
        }
    }
}

# 5. 曖昧検索（fuzzy query）
fuzzy_query = {
    "query": {
        "fuzzy": {
            "title": {
                "value": "猫",
                "fuzziness": "AUTO"  # 自動で編集距離を決定
            }
        }
    }
}

集約（Aggregations）

# 集約：カテゴリ別の文書数
aggregation_query = {
    "size": 0,  # 検索結果は不要、集約のみ
    "aggs": {
        "categories": {
            "terms": {
                "field": "category"
            }
        },
        "avg_views": {
            "avg": {
                "field": "view_count"
            }
        },
        "date_histogram": {
            "date_histogram": {
                "field": "published_at",
                "calendar_interval": "month"
            }
        }
    }
}

response = es.search(index='articles', body=aggregation_query)

# カテゴリ別集計結果
for bucket in response['aggregations']['categories']['buckets']:
    print(f"{bucket['key']}: {bucket['doc_count']}件")

高度な検索機能

オートコンプリート（Suggester）

# 補完候補の取得
suggest_query = {
    "suggest": {
        "article-suggest": {
            "prefix": "猫の",
            "completion": {
                "field": "title_suggest",
                "size": 5
            }
        }
    }
}

# インデックス側で completion field を定義
mapping_with_suggest = {
    "mappings": {
        "properties": {
            "title_suggest": {
                "type": "completion"
            }
        }
    }
}

ハイライト表示（Highlighting）

highlight_query = {
    "query": {
        "match": {"content": "猫 飼い方"}
    },
    "highlight": {
        "fields": {
            "content": {
                "pre_tags": ["<em class='highlight'>"],
                "post_tags": ["</em>"],
                "fragment_size": 150,
                "number_of_fragments": 3
            }
        }
    }
}

# 結果：
# "...<em class='highlight'>猫</em>は<em class='highlight'>飼い</em>やすい..."

類義語展開（Synonyms）

# 類義語辞書の設定
index_settings_with_synonyms = {
    "settings": {
        "analysis": {
            "analyzer": {
                "japanese_with_synonyms": {
                    "tokenizer": "kuromoji_tokenizer",
                    "filter": ["kuromoji_baseform", "synonym_filter"]
                }
            },
            "filter": {
                "synonym_filter": {
                    "type": "synonym",
                    "synonyms": [
                        "猫,ネコ,ねこ,cap",
                        "犬,イヌ,いぬ,dog",
                        "かわいい,可愛い,愛らしい"
                    ]
                }
            }
        }
    }
}

# 「猫」で検索 → 「ネコ」「ねこ」も自動で検索に含まれる

パフォーマンスチューニング

インデックス設計のベストプラクティス

【インデックス設計の指針】

1. シャード数の決定
   - 1 シャードあたり 30-50GB を目安
   - 多すぎ：管理オーバーヘッド増
   - 少なすぎ：分散効果なし

2. レプリカ戦略
   - 本番：最低 1 レプリカ（可用性）
   - 開発：レプリカなし（リソース節約）

3. マッピング設計
   - 検索対象：text 型（分析あり）
   - 完全一致：keyword 型（分析なし）
   - 数値：integer/long/float
   - 日付：date 型

4. バルクインデキシング
   - 単一登録：遅い
   - バルク登録：1,000 件/バッチが目安

検索パフォーマンスの最適化

# 1. source フィルター（不要フィールドの除外）
source_filter_query = {
    "_source": ["title", "published_at"],  # content は除外
    "query": {"match": {"content": "猫"}}
}

# 2. 検索結果の制限
size_limit_query = {
    "size": 10,  # デフォルト 10、最大 10,000
    "from": 0,   # オフセット
    "query": {"match": {"content": "猫"}}
}

# 3. 深層ページネーションの回避
# NG: from: 10000, size: 10（遅い）
# OK: search_after（カーソルベース）

search_after_query = {
    "size": 10,
    "query": {"match": {"content": "猫"}},
    "sort": [{"_score": "desc"}, {"_id": "asc"}],
    "search_after": [0.5, "doc_123"]  # 前ページの最後
}

# 4. キャッシュの活用
# - Filter コンテキストは自動キャッシュ
# - Query コンテキストはキャッシュされない
cached_query = {
    "query": {
        "bool": {
            "must": {"match": {"content": "猫"}},  # スコア計算
            "filter": {"term": {"category": "ペット"}}  # キャッシュ対象
        }
    }
}

まとめ

全文検索の核心:

基本概念: 事前インデックス作成で高速検索を実現
インデックス: 逆引き索引（Inverted Index）が中核
検索アルゴリズム: TF-IDF → BM25 へ進化
日本語処理: 形態素解析が必須
Elasticsearch: 実用的な分散検索エンジン
パフォーマンス: インデックス設計とクエリ最適化

全文検索は、現代のアプリケーションにおいてユーザー体験を左右する重要機能だ。適切な検索エンジンの選択と設計が、大量データからの「必要な情報」へのアクセスを可能にする。

重要なのは、**「LIKE 検索で始めて、限界が来たら全文検索エンジン」**という段階的なアプローチだ。小规模データでは LIKE で十分だが、データが増えたり複雑な検索が必要になったりしたら、Elasticsearch や OpenSearch などの専門エンジンへの移行を検討しよう。

参考資料

“Introduction to Information Retrieval” Christopher D. Manning 他（無料公開）: https://nlp.stanford.edu/IR-book/
Elasticsearch 公式ドキュメント: https://www.elastic.co/guide/en/elasticsearch/reference/current/index.html
Apache Lucene ドキュメント: https://lucene.apache.org/core/documentation.html
Kuromoji プロジェクト: https://www.atilika.org/
“Elasticsearch: The Definitive Guide” Clinton Gormley, Zachary Tong 著
BM25 オリジナル論文： “Okapi at TREC-3” Stephen E. Robertson 他

#検索 #インデックス #Elasticsearch #データベース #情報検索

免責事項 — 掲載情報は執筆時点のものです。料金・機能は変更される場合があります。最新情報は各公式サイトをご確認ください。