skills

Eval

skills/eval/SKILL.md

Eval

gnosis-mcp eval in ~10 lines of wrapper logic. Reports retrieval quality against your golden queries, compares to last known numbers, and points at /gnosis:tune if anything looks regressed.

Different from /gnosis:tune: tune sweeps configurations looking for the best chunk size / embedder / rerank combo. Eval just reports current numbers. Run eval often (after each ingest, whenever the corpus changes). Run tune occasionally (after corpus shape change, new embedder, weekend of experimentation).

Usage

/gnosis:eval                    # full run — print numbers + interpret + compare
/gnosis:eval quick              # numbers only, no interpretation
/gnosis:eval save               # save current result as new baseline
/gnosis:eval diff               # compare current to last saved baseline

Mode: $ARGUMENTS

Step 1 — run the harness

gnosis-mcp eval --json

Parses to:

{
  "queries": 10,
  "hit_at_5": 1.000,
  "mrr": 0.950,
  "mean_precision_at_5": 0.668,
  "ndcg_at_10": 0.871
}

If the command errors (no [embeddings] extra, no golden file, empty DB), explain the exact cause and the fix. Don't proceed with empty numbers — fail loudly.

Step 2 — interpret

Report as a compact table:

Metric	Value	Meaning
Hit@5	0.92	9 of 10 queries find the right doc in the top 5 results
MRR	0.79	On average the right doc ranks ~1.3 in the list (1 / 0.79)
nDCG@10	0.87	Ranking quality — 1.0 is perfect, random baseline is ~0.15
Precision@5	0.67	Of the top 5 results, 67% are relevant

Interpretation thresholds (rough heuristics, corpus-dependent):

Hit@5 ≥ 0.85 → healthy. Users rarely need to read past position 5.
Hit@5 0.70-0.85 → acceptable. Keyword-saturated corpus likely; little room to grow without changing the eval set.
Hit@5 < 0.70 → probably a corpus or chunking problem. Run /gnosis:tune to sweep chunk sizes.
MRR / Hit@5 ratio < 0.80 → relevant docs exist in top-5 but ranked too low. Try title-prepending or reranking (/gnosis:tune covers both experiments).
nDCG@10 ≥ 0.85 and Hit@5 = 0.92 matches our published dev-docs baseline — you're in the same regime as gnosis-mcp's own corpus.

Step 3 — compare to baseline (mode: default, diff, save)

Baseline lives at ~/.local/share/gnosis-mcp/eval-baseline.json.

If baseline exists (default + diff modes):

Metric       Now       Last (2026-04-18)   Δ
Hit@5        0.9200    0.9200              —
MRR          0.7933    0.7933              —
nDCG@10      0.8702    0.8702              —
P@5          0.6680    0.6680              —

Flag any Δ worse than -2 points in red and explicitly ask the user if the regression is expected. "Expected" reasons: new corpus shape, different chunk size, swapped embedder. "Not expected": accidental regression → investigate before committing the drop.

If no baseline exists (default mode only — first-time run):

No baseline found. Save current numbers as baseline? (y/N)

On y, write the JSON to ~/.local/share/gnosis-mcp/eval-baseline.json and confirm. On N, skip — user runs /gnosis:eval save later when they're happy with a specific state.

save mode: unconditionally overwrite the baseline with current numbers. Useful after an intentional improvement (e.g., /gnosis:tune found a better chunk size, user re-ingested, wants to lock in the new baseline).

Step 4 — recommendation (default mode)

Based on the numbers, point at the next action:

State	Next action
Everything green, no regression	"Numbers look healthy. Run `/gnosis:ingest <path> --prune` after any doc change to keep them that way."
Hit@5 dropped > 2 points	"Regression. Run `/gnosis:status diag` to rule out corruption, then `git diff` any recent ingest or config changes."
Hit@5 healthy but MRR weak	"Ranking is the weak link. `/gnosis:tune` can test title-prepending — sometimes worth +3 MRR points on keyword-saturated corpora."
Hit@5 < 0.70	"Chunk size or corpus shape issue. `/gnosis:tune` will sweep chunk sizes 1000-4000 and report the optimum — ~10 min runtime."
No regression but first time	"Run `/gnosis:eval save` to lock these as your baseline. Future runs will diff against this."

When to skip `/gnosis:eval`

BEIR-style public benchmark → use tests/bench/bench_beir.py instead. That's external validity; eval is in-distribution validity.
End-to-end MCP round-trip timing → tests/bench/bench_mcp_e2e.py.
Comparison across embedders → scripts/bench-embedders.sh for the full matrix.

What the numbers mean for non-experts

See docs/how-we-measure-search.md for a plain-English explainer of Hit@K / MRR / nDCG (if that doc exists in this corpus; otherwise point at the gnosismcp.com version: https://gnosismcp.com/docs/how-we-measure-search).

Eval

Eval

Usage

Mode: $ARGUMENTS

Step 1 — run the harness

Step 2 — interpret

Step 3 — compare to baseline (mode: default, diff, save)

Step 4 — recommendation (default mode)

When to skip /gnosis:eval

What the numbers mean for non-experts

When to skip `/gnosis:eval`