Parent-child chunking — v0.11 design
Parent-child chunking — v0.11 design
The chunk-size sweep in bench-experiments-2026-04-18 showed a trade-off that's fundamental, not solvable with tuning:
- Small chunks (500-1000 chars) rank precisely — BM25 finds the exact sentence that matches — but lose context. An LLM reading just "the retry interval is 30 s" doesn't know which service, which endpoint, which environment.
- Large chunks (2000-4000 chars) give the LLM context — one coherent section at a time — but dilute term density. Relevant passages compete with neighboring sentences for ranking weight.
Parent-child chunking resolves this by decoupling the retrieval unit from the context unit. You retrieve at one granularity and return at another.
Concept
ingest:
raw doc → split into child chunks (500 chars) → BM25 + embeddings
↓ each child carries a pointer
parent chunk (2000 chars, a superset of the child)
search_docs:
query → rank child chunks → top-k → return parent content
(not the child's 500 chars)
The LLM sees 2000-char parents — enough context to understand the retrieved material. BM25 ranks at the 500-char granularity — each chunk's relevance signal is concentrated, not diluted.
Expected lift on our real-corpus golden set (extrapolating from the chunk-size sweep): +2 to +4 nDCG@10 over the 2000-char default. Parent-child typically buys what separate small-chunk and large-chunk runs both win at.
Schema
Current documentation_chunks table (simplified):
chunk_id | file_path | chunk_index | content | embedding | …
New columns (nullable — backward compatible):
parent_chunk_index INTEGER -- points to the coarse-grained chunk
is_child BOOLEAN -- "is this a retrieval-only row?"
On ingest:
- Split the document at 2000 chars → call each result a parent, give
it
parent_chunk_index = chunk_index,is_child = false. - Within each parent, split at 500 chars → children, each carries the
parent's index and
is_child = true. - Write both parents and children. Children get embeddings; parents don't (saves ~75 % of embedding cost over a naive "embed everything" approach).
doc.md
├── chunk 0 (is_child=false, 2000 chars) ── parent, returned to LLM
│ ├── chunk 1 (is_child=true, ~500 chars, parent=0) ── retrieval-only
│ ├── chunk 2 (is_child=true, ~500 chars, parent=0)
│ ├── chunk 3 (is_child=true, ~500 chars, parent=0)
│ └── chunk 4 (is_child=true, ~500 chars, parent=0)
├── chunk 5 (is_child=false, 2000 chars) ── next parent
│ ├── chunk 6 (is_child=true, ~500 chars, parent=5)
│ └── …
Search path
search_docs picks up a new optional knob:
search_docs(query, limit=5, expand_to_parents=True)
Ranking runs on is_child = true rows (or all rows if feature is off).
After ranking, dedupe by parent_chunk_index: if two children of the
same parent score in the top-k, keep the higher-ranked one. Return
limit parents (not children).
Pseudo-SQL:
WITH child_hits AS (
SELECT chunk_id, parent_chunk_index, file_path, rank()
FROM documentation_chunks
WHERE is_child = true
AND matches(search_vector, :query)
ORDER BY rank ASC
LIMIT :limit * 3 -- overscan to allow dedup
),
dedup AS (
SELECT DISTINCT ON (file_path, parent_chunk_index)
chunk_id, parent_chunk_index, file_path, rank
FROM child_hits
ORDER BY file_path, parent_chunk_index, rank
)
SELECT p.file_path, p.title, p.content, d.rank
FROM dedup d
JOIN documentation_chunks p
ON p.file_path = d.file_path
AND p.chunk_index = d.parent_chunk_index
AND p.is_child = false
ORDER BY d.rank
LIMIT :limit;
Config
New env var:
GNOSIS_MCP_PARENT_CHILD=true— turn on parent-child chunking. Default off in v0.11.0 (opt-in); promote to default in a later minor when we've benched more corpora.GNOSIS_MCP_CHILD_CHUNK_SIZE=500— retrieval granularity.GNOSIS_MCP_CHUNK_SIZE=2000— parent granularity (already exists, default updated in v0.11.0).
The two knobs are independent but should maintain roughly
parent / child = 4. Below 2 the parent barely adds context; above 8
the child chunks become too fragmented to rank well.
Migration
Zero DB migration: the two new columns default NULL. Old indexes stay
valid (children have is_child = true, parents is_child = false,
pre-v0.11 rows all NULL and treated as parents by default). The
search_docs query gets an AND (is_child = true OR is_child IS NULL)
clause so existing corpora keep working while new ingests use the
parent-child split.
Why not overlap instead
Overlapping chunks (10-20 % overlap between adjacent chunks) is the classic mitigation for the "lose context at boundaries" problem. A Jan 2026 systematic study on SPLADE + Mistral-8B / Natural Questions found overlap delivers zero measurable lift and only increases indexing cost. Parent-child explicitly separates the concerns — no overlap needed.
Why not larger chunks
We tested 2200, 3000, 4000 chars — all below 2000's 0.8702 nDCG@10 on the real-corpus sweep (see bench-experiments doc). Bigger chunks dilute BM25 term density; parent-child dodges this by ranking on small chunks while displaying bigger ones.
Why not Anthropic's Contextual Retrieval instead
Anthropic's technique (Sept 2024 paper) prepends an LLM-generated sentence of context to each chunk. Claimed -35 % retrieval failures on average across their test corpora, but a Snowflake finance-RAG study found -5.8 nDCG in one domain. Parent-child needs no LLM call at ingest, works with any embedder, is dataset-agnostic, and composes with contextual retrieval later if we want both.
Implementation cost
- 1 day of work (schema fields, chunker, dedupe query)
- ~150 LOC
- 20-40 new tests (parent-child lifecycle, dedup boundary cases, backward compat)
- Docs update in
config.md,tools.md,bench-experiments
Acceptance criteria
- On the real-corpus golden set,
expand_to_parents=trueproduces nDCG@10 ≥ 0.88 (up from 0.8702 atchunk_size=2000). - On BEIR SciFact, zero regression vs current 0.6712 nDCG@10 (this one is already saturated by BM25; any uplift is gravy).
- Ingest time grows by at most 2× (more chunks, but embedder only runs on children so the dominant cost is unchanged).
GNOSIS_MCP_PARENT_CHILD=false(default) path is byte-identical to v0.10.x output.
Open questions
- Should children also carry embeddings, or only rank on BM25? (Probably BM25-only, to avoid 4× vector blow-up — but benchmark both.)
- What happens when a single paragraph crosses the parent boundary? Current proposal: parent splits only at H2/H3 boundaries (already how the existing chunker works), so this rarely happens.
- Multi-backend parity: the same logic has to work on Postgres
(
documentation_chunkson PG has the same schema + separatedocumentation_links). Confirmed: the two new columns add the same way on both backends.