skills

Ingest

skills/ingest/SKILL.md

Ingest

One skill that covers every way to get content into gnosis-mcp. Routes based on $ARGUMENTS:

Path to a directory / file → local file ingest
git <repo> → git-history ingest
crawl <url> → web-crawl ingest
reingest → full reset + re-ingest from the default path
prune <path> → delete DB chunks whose source is gone

Action: $ARGUMENTS

`ingest <path>` — local files

Default entry point. Handles .md, .txt, .ipynb, .toml, .csv, .json (+ .rst / .pdf if those extras are installed).

First-time ingest

gnosis-mcp ingest ./docs --embed

--embed runs the bundled ONNX embedder (requires [embeddings] extra). Without it you get keyword-only search — usually enough for dev-doc corpora (see bench-experiments), but --embed costs nothing on a first ingest and enables hybrid later if you want it.
Incremental: every file's content hash is stored, so re-running only processes changed files. Use --force to re-ingest regardless.

Chunk size

Default is 2000 characters (~600 tokens) — peak of the Feb 2026 sweep on a real dev-docs corpus. Override per-ingest or globally:

# This invocation only
GNOSIS_MCP_CHUNK_SIZE=1500 gnosis-mcp ingest ./docs --embed

# Persistent (put in shell profile)
export GNOSIS_MCP_CHUNK_SIZE=3000   # long-form blogs / ADRs

If you're unsure, run /gnosis:tune to sweep sizes against your own golden queries.

Reorganized your knowledge folder

Files moved, deleted, renamed. Pick one:

# Safest: re-ingest + drop chunks for files that no longer exist
gnosis-mcp ingest ./docs --embed --prune

# Nuclear: drop everything first, then re-ingest
gnosis-mcp ingest ./docs --embed --wipe

# Preview what prune would delete
gnosis-mcp prune ./docs --dry-run

By default --prune leaves crawled URLs alone (since those don't correspond to local files). Add --include-crawled if you want those gone too.

`ingest git <repo>` — git commit history

Indexes each file's commit history as a searchable markdown document. Lets your agent answer "why does this code exist" queries.

gnosis-mcp ingest-git /path/to/repo --since 6m --embed

Common flags:

Flag	Effect
`--since 6m` / `--since 2025-01-01`	Window of commits to include
`--until 2026-03-01`	Upper bound
`--author "alice@"`	Filter by author name or email substring
`--max-commits-per-file 20`	Default 10, most-recent wins
`--include "src/**"`	Glob filter on touched files
`--exclude "*.lock,package.json"`	Skip noisy files
`--include-merges`	Default excludes merge commits

Each indexed doc's file_path is git-history/<original-path>.md. Cross-file co-edits generate git_co_change edges; source-file references get git_ref. Query them via mcp__gnosis__search_git_history (or filter mcp__gnosis__get_related by relation_type=git_co_change).

Re-run ingest-git whenever your history grows past the window you already indexed.

`ingest crawl <url>` — web crawl

Indexes a documentation website. Requires the [web] extra (pip install 'gnosis-mcp[web]').

# Preferred — discover URLs from sitemap.xml
gnosis-mcp crawl https://docs.stripe.com --sitemap --embed

# No sitemap? BFS link crawl, one hop deep
gnosis-mcp crawl https://docs.example.com --max-depth 1 --embed

# Subset only
gnosis-mcp crawl https://docs.example.com --sitemap \
  --include "/docs/api/**" --exclude "*.pdf"

# Preview, don't fetch
gnosis-mcp crawl https://docs.example.com --dry-run

Other flags:

Flag	Effect
`--max-pages 5000`	Safety cap
`--force`	Ignore the ETag / Last-Modified / hash cache

Behaviour:

Respects robots.txt. A same-host redirect on /robots.txt is treated as disallow (prevents spoofing).
Caches ETag + Last-Modified + content hash at ~/.local/share/gnosis-mcp/crawl-cache.json — subsequent crawls issue conditional GETs and skip unchanged pages.
Extracts markdown via trafilatura with a 30 s per-page timeout (GNOSIS_MCP_CRAWL_EXTRACT_TIMEOUT_S).

Vendor docs strategy: crawl them once, commit the indexed SQLite to version control, and you have offline, searchable vendor docs alongside your private docs. No Context7 subscription required.

`ingest reingest` — full reset

Drop everything, reinitialise, reindex. Use when:

You changed embedder (GNOSIS_MCP_EMBED_MODEL or _EMBED_DIM) — old vectors are now incompatible
You want to force a clean baseline before a benchmark
Schema drifted (rare, but init-db is idempotent so rerunning is safe)

gnosis-mcp init-db                        # ensure schema is current
gnosis-mcp ingest ./docs --embed --wipe   # delete everything + reingest
gnosis-mcp stats                           # confirm

`ingest prune <path>` — dead-chunk cleanup

Standalone prune, independent of re-ingest.

# What would go
gnosis-mcp prune ./docs --dry-run

# Delete chunks for files no longer on disk under ./docs
gnosis-mcp prune ./docs

# Also drop crawled URLs (normally spared)
gnosis-mcp prune ./docs --include-crawled

Safer than --wipe because it only deletes rows whose original file_path resolved as a local file under the given root AND is now missing. Crawled URLs, git-history docs (git-history/*), and any path outside the root are untouched unless you explicitly opt in.

Watch mode (server-side auto-reingest)

Skip the manual re-run loop — the server can watch a folder and re-ingest on file changes.

gnosis-mcp serve --watch ./docs --transport streamable-http --rest

Mtime polling + debounce. Works on every OS, no fsnotify dependency. Ideal for docs-as-code repos where you push a doc and want it searchable by your editor within a few seconds.

Verify afterwards

Always run gnosis-mcp stats (or /gnosis:status stats) after a big ingest to confirm:

Doc count matches expectations
Chunk count is sensible (~1-5 chunks per doc for ~2000-char chunk size)
Embeddings coverage is 100 % if you used --embed
No vec0 table errors

$ gnosis-mcp stats
Documents: 558
Chunks:    1,742
Embeddings: 1,742 / 1,742 (100.0 %)
Last access log entry: 2026-04-18 07:12 UTC
Backend: sqlite

Ingest

Ingest

Action: $ARGUMENTS

ingest <path> — local files

First-time ingest

Chunk size

Reorganized your knowledge folder

ingest git <repo> — git commit history

ingest crawl <url> — web crawl

ingest reingest — full reset

ingest prune <path> — dead-chunk cleanup

Watch mode (server-side auto-reingest)

Verify afterwards

See also

`ingest <path>` — local files

`ingest git <repo>` — git commit history

`ingest crawl <url>` — web crawl

`ingest reingest` — full reset

`ingest prune <path>` — dead-chunk cleanup