performance

Benchmarks

docs/benchmarks.md

Benchmarks

Captured on gnosis-mcp v0.11.0, Python 3.14, Linux x86_64 laptop CPU.

Four distinct benchmark suites — each answers a different question.

TL;DR

Question	Answer
How fast is search?	9 463 QPS, 0.16 ms p95 on 100 docs; 839 QPS, 3.0 ms p95 on 5 000 docs (SQLite keyword)
How does it score on a public benchmark?	nDCG@10 = 0.671, Hit@5 = 0.73 on BEIR SciFact — within 1 % of the BM25 reference baseline (0.679)
How does it compare vs other tools?	Beats txtai on nDCG@10 by +14 %, on Hit@5 by +9 %; ingest 6× faster
Is retrieval accurate on our own corpus?	Hit Rate@5 = 1.00, MRR = 0.95, P@5 = 0.67 on 10 eval cases
Does hybrid search help?	Dataset-dependent. On SciFact (scientific) no lift over keyword. On real-world docs the local ONNX model contributes meaningfully
What does an agent pay per tool call?	~8.7 ms mean, 13.0 ms p95 end-to-end through the MCP stdio protocol
How fast is ingest?	~18–21 K chunks/s keyword-only (see §5). With local ONNX embeddings enabled, throughput is dominated by the embedder — budget ~30 docs/s on laptop CPU, varies with doc length
What improved in v0.11?	+3 points nDCG@10 on our real dev-docs corpus (0.8407 → 0.8702) — one config change: `GNOSIS_MCP_CHUNK_SIZE` lowered 4000 → 2000. Full sweep in bench-experiments

1. BEIR / SciFact — public retrieval benchmark

SciFact is the standard scientific-claim retrieval benchmark (5 183 docs, 300 test queries). Numbers here are directly comparable to published IR baselines.

Tool	nDCG@10	MRR@10	Hit@5	Recall@10	p50	p95	Ingest
gnosis-mcp (keyword)	0.6712	0.6401	0.7300	0.7938	13.4 ms	27.1 ms	24.0 s
gnosis-mcp (hybrid)	0.6712	0.6401	0.7300	0.7938	17.5 ms	32.5 ms	181.4 s
txtai (all-MiniLM-L6-v2)	0.5869	0.5527	0.6700	0.7001	3.7 ms	5.1 ms	141.8 s
Reference: BM25 baseline	0.679	—	—	—	—	—	—
Reference: ColBERTv2	0.693	—	—	—	—	—	—

Reading:

gnosis-mcp's FTS5 keyword path is within 1 % of the Lucene BM25 baseline — the gold standard that hybrid / dense retrievers have historically struggled to beat on SciFact.
txtai's dense-only result (0.59) illustrates the classic dense-retrieval tax on scientific-domain corpora with a general-purpose embedder.
Hybrid mode adds vector lookup latency (~4 ms) but doesn't lift ranking quality on this dataset. That's expected: SciFact's vocabulary is specialised enough that a generic 384-dim model can't add signal beyond BM25. Expect larger deltas on conversational / ambiguous corpora — e.g. FIQA (finance QA), or your own project docs.

Reproduce:

pip install 'gnosis-mcp[embeddings] @ .' beir txtai
uv run python tests/bench/bench_compare.py --dataset scifact \
  --tools gnosis-keyword,gnosis-hybrid,txtai

Other BEIR datasets worth trying: nfcorpus (medical), fiqa (finance QA), arguana (argument retrieval). Run any of them with tests/bench/bench_beir.py --dataset <name>.

2. Search speed — SQLite FTS5 (scale curve)

Synthetic corpus, 1 000 queries each, median of 3 runs, in-memory DB.

Docs	Chunks	Ingest (s)	QPS	p50 (ms)	p95 (ms)	p99 (ms)	Hit rate
100	300	0.015	9 463	0.10	0.16	0.19	1.00
500	1 500	0.088	3 945	0.22	0.44	0.48	1.00
1 000	3 000	0.179	2 768	0.29	0.72	0.78	1.00
2 000	6 000	0.319	1 889	0.38	1.23	1.32	1.00
5 000	15 000	0.844	839	0.80	2.97	3.54	1.00
10 000	30 000	1.642	471	1.38	5.60	6.29	1.00

Sub-millisecond p95 through 2 000 docs. Still sub-10 ms at 10 000 docs — well under the 1-second budget an LLM agent can tolerate per tool call.

Reproduce:

uv run python tests/bench/bench_search.py --docs 1000 --queries 1000 --json

3. Retrieval quality — RAG-native metrics

Ten hand-authored query→expected-path cases (internal guides + git-history docs).

Mode	Hit Rate@5	MRR	Mean Precision@5	p50 ms	p95 ms
Keyword (FTS5 + BM25)	1.000	0.950	0.668	0.12	0.27
Hybrid (FTS5 + ONNX embeddings, RRF)	1.000	0.950	0.668	0.24	0.41

Takeaways

On a small corpus with distinctive keywords, keyword search already saturates — hybrid adds no lift but ~2× latency (embedding cost).
The real payoff for hybrid appears on larger corpora with less distinctive query vocabulary (paraphrase, synonym) — not present in this test set.
MRR = 0.95 means the first relevant document is almost always #1 in the result list.
Precision@5 = 0.67 reflects the fact that several relevant docs exist for a given query — we return multiple correct matches in the top 5.

Reproduce:

uv run python tests/bench/bench_rag.py              # formatted table
uv run python tests/bench/bench_rag.py --json       # machine-readable

4. End-to-end MCP protocol latency

What a real MCP client (Claude Code, Cursor, Windsurf) pays per tool call — subprocess stdio transport, full JSON-RPC round trip.

Operation	Mean	p50	p95	p99
`initialize` (one-time handshake)	407 ms	—	—	—
`list_tools`	—	2.2 ms	—	—
`search_docs` (100 iterations)	8.7 ms	8.1 ms	13.0 ms	15.8 ms

Compare to the in-process search bench (0.16 ms p95) — the MCP protocol overhead is ~8 ms: JSON-RPC marshalling, stdio pipe, FastMCP dispatch, serialisation of results. This is what dominates real-world agent latency, not the search itself. The v0.10.13 jump from ~13 ms to ~8.7 ms came from upgrading the mcp SDK to 1.27 and the transport improvements that shipped with it.

Reproduce:

uv run python tests/bench/bench_mcp_e2e.py --queries 100

5. Ingest throughput

Corpus	Chunks	Time (s)	Throughput
100 docs	300	0.014	~21 000 chunks/s
1 000 docs	3 000	0.146	~20 500 chunks/s
5 000 docs	15 000	0.792	~18 900 chunks/s
10 000 docs	30 000	1.680	~17 900 chunks/s

Roughly linear. Re-ingestion of unchanged files is skipped via SHA-256 content hashing (near-zero cost) — so a watcher re-running after every save only pays for files that actually changed.

6. PostgreSQL (pgvector)

PostgreSQL numbers are captured manually — no CI service container yet (PG CI is opt-in via GNOSIS_MCP_CI_PG). To reproduce:

docker run -d --rm --name gnosis-bench-pg -p 15432:5432 \
  -e POSTGRES_PASSWORD=pw -e POSTGRES_DB=gnosis_bench \
  pgvector/pgvector:pg15
sleep 3
PGPASSWORD=pw psql -h localhost -p 15432 -U postgres -d gnosis_bench \
  -c "CREATE EXTENSION IF NOT EXISTS vector;"

GNOSIS_MCP_DATABASE_URL="postgresql://postgres:pw@localhost:15432/gnosis_bench" \
  uv run gnosis-mcp init-db

# Then adapt bench_search.py or point your own workload at it.

docker stop gnosis-bench-pg

Expected ranges on the same laptop with network overhead dominating:

100 docs: ~3 500–4 000 QPS, p95 ~1–2 ms
1 000 docs: ~1 500–2 000 QPS, p95 ~3–5 ms
10 000 docs: ~600–900 QPS, p95 ~8–15 ms (HNSW index dominates)

PostgreSQL pulls ahead of SQLite once the corpus crosses ~50 000 chunks and hybrid search is active — HNSW scales sub-linearly while sqlite-vec performs a full scan.

Regression gates

Benchmarks are run locally by the maintainer before each release (they are explicitly excluded from CI via the bench pytest marker so PR feedback stays fast). A >10 % regression in any of the following blocks a release:

QPS on 100- and 1 000-doc corpora
p95 latency on the scale curve
Hit Rate@5 on the eval cases
Ingest throughput

Automating this (a scheduled workflow that runs the bench suite and posts numbers to a dashboard) is a v0.12 goal — tracked in the release plan.

Methodology notes

Each bench runs once per release and the median of ≥3 runs is recorded.
Synthetic corpora are deterministic — seeded generators, repeatable.
Eval cases are human-authored and cover guides, architecture docs, and git-history paths.
Benchmarks run on laptop CPU — datacentre numbers will be higher but the shape is stable.
These are ceilings, not guarantees. Production sees extra cost from concurrent clients, larger chunks, and network transport for HTTP or PostgreSQL.