Benchmarks
Benchmarks
Captured on gnosis-mcp v0.11.0, Python 3.14, Linux x86_64 laptop CPU.
Four distinct benchmark suites — each answers a different question.
TL;DR
| Question | Answer |
|---|---|
| How fast is search? | 9 463 QPS, 0.16 ms p95 on 100 docs; 839 QPS, 3.0 ms p95 on 5 000 docs (SQLite keyword) |
| How does it score on a public benchmark? | nDCG@10 = 0.671, Hit@5 = 0.73 on BEIR SciFact — within 1 % of the BM25 reference baseline (0.679) |
| How does it compare vs other tools? | Beats txtai on nDCG@10 by +14 %, on Hit@5 by +9 %; ingest 6× faster |
| Is retrieval accurate on our own corpus? | Hit Rate@5 = 1.00, MRR = 0.95, P@5 = 0.67 on 10 eval cases |
| Does hybrid search help? | Dataset-dependent. On SciFact (scientific) no lift over keyword. On real-world docs the local ONNX model contributes meaningfully |
| What does an agent pay per tool call? | ~8.7 ms mean, 13.0 ms p95 end-to-end through the MCP stdio protocol |
| How fast is ingest? | ~18–21 K chunks/s keyword-only (see §5). With local ONNX embeddings enabled, throughput is dominated by the embedder — budget ~30 docs/s on laptop CPU, varies with doc length |
| What improved in v0.11? | +3 points nDCG@10 on our real dev-docs corpus (0.8407 → 0.8702) — one config change: GNOSIS_MCP_CHUNK_SIZE lowered 4000 → 2000. Full sweep in bench-experiments |
1. BEIR / SciFact — public retrieval benchmark
SciFact is the standard scientific-claim retrieval benchmark (5 183 docs, 300 test queries). Numbers here are directly comparable to published IR baselines.
| Tool | nDCG@10 | MRR@10 | Hit@5 | Recall@10 | p50 | p95 | Ingest |
|---|---|---|---|---|---|---|---|
| gnosis-mcp (keyword) | 0.6712 | 0.6401 | 0.7300 | 0.7938 | 13.4 ms | 27.1 ms | 24.0 s |
| gnosis-mcp (hybrid) | 0.6712 | 0.6401 | 0.7300 | 0.7938 | 17.5 ms | 32.5 ms | 181.4 s |
| txtai (all-MiniLM-L6-v2) | 0.5869 | 0.5527 | 0.6700 | 0.7001 | 3.7 ms | 5.1 ms | 141.8 s |
| Reference: BM25 baseline | 0.679 | — | — | — | — | — | — |
| Reference: ColBERTv2 | 0.693 | — | — | — | — | — | — |
Reading:
- gnosis-mcp's FTS5 keyword path is within 1 % of the Lucene BM25 baseline — the gold standard that hybrid / dense retrievers have historically struggled to beat on SciFact.
- txtai's dense-only result (0.59) illustrates the classic dense-retrieval tax on scientific-domain corpora with a general-purpose embedder.
- Hybrid mode adds vector lookup latency (~4 ms) but doesn't lift ranking quality on this dataset. That's expected: SciFact's vocabulary is specialised enough that a generic 384-dim model can't add signal beyond BM25. Expect larger deltas on conversational / ambiguous corpora — e.g. FIQA (finance QA), or your own project docs.
Reproduce:
pip install 'gnosis-mcp[embeddings] @ .' beir txtai
uv run python tests/bench/bench_compare.py --dataset scifact \
--tools gnosis-keyword,gnosis-hybrid,txtai
Other BEIR datasets worth trying: nfcorpus (medical), fiqa (finance QA),
arguana (argument retrieval). Run any of them with
tests/bench/bench_beir.py --dataset <name>.
2. Search speed — SQLite FTS5 (scale curve)
Synthetic corpus, 1 000 queries each, median of 3 runs, in-memory DB.
| Docs | Chunks | Ingest (s) | QPS | p50 (ms) | p95 (ms) | p99 (ms) | Hit rate |
|---|---|---|---|---|---|---|---|
| 100 | 300 | 0.015 | 9 463 | 0.10 | 0.16 | 0.19 | 1.00 |
| 500 | 1 500 | 0.088 | 3 945 | 0.22 | 0.44 | 0.48 | 1.00 |
| 1 000 | 3 000 | 0.179 | 2 768 | 0.29 | 0.72 | 0.78 | 1.00 |
| 2 000 | 6 000 | 0.319 | 1 889 | 0.38 | 1.23 | 1.32 | 1.00 |
| 5 000 | 15 000 | 0.844 | 839 | 0.80 | 2.97 | 3.54 | 1.00 |
| 10 000 | 30 000 | 1.642 | 471 | 1.38 | 5.60 | 6.29 | 1.00 |
Sub-millisecond p95 through 2 000 docs. Still sub-10 ms at 10 000 docs — well under the 1-second budget an LLM agent can tolerate per tool call.
Reproduce:
uv run python tests/bench/bench_search.py --docs 1000 --queries 1000 --json
3. Retrieval quality — RAG-native metrics
Ten hand-authored query→expected-path cases (internal guides + git-history docs).
| Mode | Hit Rate@5 | MRR | Mean Precision@5 | p50 ms | p95 ms |
|---|---|---|---|---|---|
| Keyword (FTS5 + BM25) | 1.000 | 0.950 | 0.668 | 0.12 | 0.27 |
| Hybrid (FTS5 + ONNX embeddings, RRF) | 1.000 | 0.950 | 0.668 | 0.24 | 0.41 |
Takeaways
- On a small corpus with distinctive keywords, keyword search already saturates — hybrid adds no lift but ~2× latency (embedding cost).
- The real payoff for hybrid appears on larger corpora with less distinctive query vocabulary (paraphrase, synonym) — not present in this test set.
- MRR = 0.95 means the first relevant document is almost always #1 in the result list.
- Precision@5 = 0.67 reflects the fact that several relevant docs exist for a given query — we return multiple correct matches in the top 5.
Reproduce:
uv run python tests/bench/bench_rag.py # formatted table
uv run python tests/bench/bench_rag.py --json # machine-readable
4. End-to-end MCP protocol latency
What a real MCP client (Claude Code, Cursor, Windsurf) pays per tool call — subprocess stdio transport, full JSON-RPC round trip.
| Operation | Mean | p50 | p95 | p99 |
|---|---|---|---|---|
initialize (one-time handshake) |
407 ms | — | — | — |
list_tools |
— | 2.2 ms | — | — |
search_docs (100 iterations) |
8.7 ms | 8.1 ms | 13.0 ms | 15.8 ms |
Compare to the in-process search bench (0.16 ms p95) — the MCP protocol overhead
is ~8 ms: JSON-RPC marshalling, stdio pipe, FastMCP dispatch, serialisation of
results. This is what dominates real-world agent latency, not the search itself.
The v0.10.13 jump from ~13 ms to ~8.7 ms came from upgrading the mcp SDK to
1.27 and the transport improvements that shipped with it.
Reproduce:
uv run python tests/bench/bench_mcp_e2e.py --queries 100
5. Ingest throughput
| Corpus | Chunks | Time (s) | Throughput |
|---|---|---|---|
| 100 docs | 300 | 0.014 | ~21 000 chunks/s |
| 1 000 docs | 3 000 | 0.146 | ~20 500 chunks/s |
| 5 000 docs | 15 000 | 0.792 | ~18 900 chunks/s |
| 10 000 docs | 30 000 | 1.680 | ~17 900 chunks/s |
Roughly linear. Re-ingestion of unchanged files is skipped via SHA-256 content hashing (near-zero cost) — so a watcher re-running after every save only pays for files that actually changed.
6. PostgreSQL (pgvector)
PostgreSQL numbers are captured manually — no CI service container yet (PG CI
is opt-in via GNOSIS_MCP_CI_PG). To reproduce:
docker run -d --rm --name gnosis-bench-pg -p 15432:5432 \
-e POSTGRES_PASSWORD=pw -e POSTGRES_DB=gnosis_bench \
pgvector/pgvector:pg15
sleep 3
PGPASSWORD=pw psql -h localhost -p 15432 -U postgres -d gnosis_bench \
-c "CREATE EXTENSION IF NOT EXISTS vector;"
GNOSIS_MCP_DATABASE_URL="postgresql://postgres:pw@localhost:15432/gnosis_bench" \
uv run gnosis-mcp init-db
# Then adapt bench_search.py or point your own workload at it.
docker stop gnosis-bench-pg
Expected ranges on the same laptop with network overhead dominating:
- 100 docs: ~3 500–4 000 QPS, p95 ~1–2 ms
- 1 000 docs: ~1 500–2 000 QPS, p95 ~3–5 ms
- 10 000 docs: ~600–900 QPS, p95 ~8–15 ms (HNSW index dominates)
PostgreSQL pulls ahead of SQLite once the corpus crosses ~50 000 chunks and hybrid search is active — HNSW scales sub-linearly while sqlite-vec performs a full scan.
Regression gates
Benchmarks are run locally by the maintainer before each release (they are
explicitly excluded from CI via the bench pytest marker so PR feedback
stays fast). A >10 % regression in any of the following blocks a release:
- QPS on 100- and 1 000-doc corpora
- p95 latency on the scale curve
- Hit Rate@5 on the eval cases
- Ingest throughput
Automating this (a scheduled workflow that runs the bench suite and posts numbers to a dashboard) is a v0.12 goal — tracked in the release plan.
Methodology notes
- Each bench runs once per release and the median of ≥3 runs is recorded.
- Synthetic corpora are deterministic — seeded generators, repeatable.
- Eval cases are human-authored and cover guides, architecture docs, and git-history paths.
- Benchmarks run on laptop CPU — datacentre numbers will be higher but the shape is stable.
- These are ceilings, not guarantees. Production sees extra cost from concurrent clients, larger chunks, and network transport for HTTP or PostgreSQL.