SEMBLE_RS(1)

NAME

semble_rsFast, AI-agent-native code search in Rust — hybrid BM25 + semantic, Tree-sitter AST chunking, dependency & impact…

SYNOPSIS

$brew install ast-grep

INFO

98 stars
15 forks
0 views

DESCRIPTION

Fast, AI-agent-native code search in Rust — hybrid BM25 + semantic, Tree-sitter AST chunking, dependency & impact analysis. Drop-in replacement for grep/cat/read/ls in Claude Code, Codex, Cursor, Aider, OpenHands.

README

semble_rs
Fast and Accurate Code Search for Agents — in Rust
Replaces grep / cat / read / ls and compresses build & CI output. Up to -99% tokens.

semble_rs is a Rust port and superset of MinishLab/semble built for AI coding agents. It returns the exact code chunks an agent needs, prints a token-cheap codebase tree instead of ls -R, and compresses 3 MB CI logs into 35 KB. One single binary, no daemon, no API keys, no GPU. Hybrid BM25 + Model2Vec static embeddings with code-aware reranking, plus a dependency graph, AST chunking, and a digest pipeline for build / test / CI output.

Quickstart

# Install Rust if needed, then:
git clone https://github.com/johunsang/semble_rs.git && cd semble_rs
cargo install --path .

The binary lands at ~/.cargo/bin/semble_rs. On first run, the default embedding model minishlab/potion-code-16M (~60 MB) is downloaded from HuggingFace.

# Map the codebase (replaces ls -R)
semble_rs tree ./my-project --symbols

Find code by what it does (replaces grep + cat)

semble_rs search "how is auth handled" ./my-project --outline

Compress build / CI output before reading it

cargo build 2>&1 | semble_rs digest gh run view <id> --log-failed | semble_rs digest

For agent integration (Claude Code, Codex, Cursor), see Agent integration.

Main Features

  • Fast: indexes the local repo (22 files) in ~150 ms, ~10 s on 1,600 files. Static embedder — no transformer forward pass at query time.
  • Token-efficient: tree collapses ls -R by 4×–747×; --outline is -47% vs full output; digest reaches -98.9% on real GitHub Actions logs.
  • Hybrid retrieval: BM25 + Model2Vec embeddings fused with RRF, then reranked with definition / identifier-stem / file-coherence boosts and noise penalties.
  • Dependency graph: deps / impact show what a file imports, defines, and what changes if you touch it. Optional Graphviz --dot output.
  • Build / CI compression: digest auto-detects cargo, pnpm/npm/yarn/bun, tsc, pytest, go test, gradle, ruff, mypy, clang/gcc/cmake/make/swiftc, GitHub Actions.
  • Single binary: no Python, no daemon, no API keys. Runs on CPU.

Search

semble_rs search "auth flow" ./my-project --outline    # pass 1: structural overview
semble_rs search "loginWithEmail" ./my-project --compact   # pass 2: matching lines
semble_rs search "save model" https://github.com/MinishLab/model2vec   # git URL

path defaults to the current directory; git URLs are accepted (cloned shallow).

Output modes

ModeOutputToken cost vs --compactWhen to use
--outlineOne signature line per chunk-47%First-pass structural scan
--groupDirectory grouping + match lines capped at 3 (+N overflow)-47%Many match lines per chunk
--compactScore + path + every matching linebaselinePrecision scan
--json --stripChunk bodies (comments stripped)+800%Tooling / pipeline integration
--jsonChunk bodies (raw)+900%Tooling / pipeline integration

Recommended: --outline to overview → --compact to narrow → --json --strip only if the chunk body itself is needed.

find-related

Given a file:line from a previous search result, returns chunks semantically similar to that location.

semble_rs find-related src/auth.rs 42 ./my-project

plan

When the agent doesn't know where to start, plan runs a small search and prints a recommended sequence of --outline / --group / --compact / deps / impact commands.

semble_rs plan "fix auth flow bug" ./my-project -k 5

plan is a guardrail, not an oracle: low-confidence candidates are leads, not facts. Skip it when the symbol or feature name is already known.

--model

All search-side commands accept --model <hf-repo-or-local-path> to override the default embedder. Also honours the SEMBLE_MODEL_PATH environment variable.

Tree

semble_rs tree prints the codebase file tree using the same gitignore-aware index as search. It exists because ls -R on a real project explodes into tens or hundreds of thousands of tokens (.git/, target/, node_modules/ all included). Measured on real repos:

Projectsemble_rs treels -RReduction
this repo (Rust + target/)533 B398,101 B747×
6,693-file Python backend3,950 B254,066 B64×
325-file ML training repo838 B7,522 B
semble_rs tree                              # current directory
semble_rs tree -d                           # directories only
semble_rs tree --max-depth 2                # cap depth
semble_rs tree --symbols                    # append top-level symbols per file
semble_rs tree --lang rust,python           # filter by language

Digest

semble_rs digest collapses build / test / install / CI output. Errors, file:line:col, tracebacks, panic stacks, and failed-step bodies are always preserved — only progress lines collapse to counts.

cargo build 2>&1            | semble_rs digest
pnpm install 2>&1           | semble_rs digest
pytest 2>&1                 | semble_rs digest
gh run view <id> --log-failed | semble_rs digest

Measured on 15 real-world fixtures:

FixtureRaw → digestSavings
cargo build (clean, 218 crates)7,611 B → 59 B-99.2%
cargo test (45 passing)3,368 B → 369 B-89.0%
pnpm install1,323 B → 349 B-73.6%
tsc (13 errors, 5 codes)1,085 B → 648 B-40.3%
pytest (4 failures)2,762 B → 2,330 B-15.6%
GitHub Actions log (rust-lang/rust failed CI, real)3.3 MB → 35 KB-98.9%
go test (with panic + stack)1,034 B → 475 B-54.1%
gradle test (2 failures)1,232 B → 522 B-57.6%
ruff / mypy / clang / cmake / swiftvaries-3% to -30%
TOTAL (15 fixtures)3.33 MB → 43 KB-98.7%

Auto-detection covers cargo, pnpm/npm/yarn/bun, tsc, pytest, go test, gradle, ruff, mypy, clang/gcc/cmake/make/swiftc, GitHub Actions. Force a handler with --format <name>; inspect with --show-format.

Dependency graph

semble_rs deps   src/auth.rs ./my-project                  # what this file imports / defines (flat)
semble_rs deps   src/auth.rs ./my-project --tree           # transitive imports as ASCII tree
semble_rs deps   src/auth.rs ./my-project --tree --max-depth 3
semble_rs deps   src/auth.rs ./my-project --dot | dot -Tpng > deps.png
semble_rs impact src/auth.rs ./my-project                  # who depends on this file (flat list)
semble_rs impact src/auth.rs ./my-project --tree           # reverse-dependency tree
semble_rs impact src/auth.rs ./my-project --dot | dot -Tpng > impact.png

--tree (v0.9.1+) renders forward (deps) or reverse (impact) dependencies as an ASCII tree with cycle detection (repeated nodes marked (cycle)) and --max-depth N truncation (). No external tool required, agent-readable.

impact is intended to be run before edits to a shared module to avoid surprises.

find-pattern

Thin wrapper around ast-grep for structural queries that semantic search can't express:

semble_rs find-pattern 'fn $name($$$)' . --lang rust --compact

Requires ast-grep installed (brew install ast-grep or cargo install ast-grep).

Encode

semble_rs encode exposes the embedding model as a CLI for scripting and debugging:

semble_rs encode "search result scoring"            # one vector → JSON array
echo -e "auth\nlogin\ntoken" | semble_rs encode     # stdin, one sentence per line
semble_rs encode "x" --model minishlab/potion-multilingual-128M

Agent integration

Append a snippet like the following to your project-root CLAUDE.md or AGENTS.md. It works for Claude Code, Codex, Cursor (.cursorrules), Aider, and OpenHands.

## Code search and exploration

Use semble_rs instead of ls -R, grep, cat:

bash semble_rs tree . --symbols # codebase map (cheap) semble_rs search &quot;&lt;feature or symbol&gt;&quot; . --outline # pass 1 semble_rs search &quot;&lt;feature or symbol&gt;&quot; . --compact # pass 2 semble_rs deps &lt;file&gt; . # what file imports / defines semble_rs impact &lt;file&gt; . # files affected by changes ​

Compress noisy command output before reading it:

bash cargo build 2&gt;&amp;1 | semble_rs digest pnpm install 2&gt;&amp;1 | semble_rs digest gh run view &lt;id&gt; --log-failed | semble_rs digest ​

semble_rs savings shows estimated tokens saved across past searches.

How it works

semble_rs chunks every file with tree-sitter at function / class / module boundaries (line-based fallback for unsupported languages), then scores every query with two complementary retrievers: static Model2Vec embeddings (default minishlab/potion-code-16M) for semantic similarity, and BM25 for lexical matches on identifiers and API names. Score lists are fused with Reciprocal Rank Fusion.

After fusion, results are reranked with code-aware signals:

Ranking signals
  • Adaptive weighting. Symbol-like queries (Foo::bar, _private, getUserById) get more lexical weight; natural-language queries stay balanced.
  • Definition boosts. Chunks that define the queried symbol (a class, def, func, etc.) outrank chunks that merely reference it.
  • Identifier stems. Query tokens are stemmed and matched against identifier stems. Querying parse config boosts chunks containing parseConfig, ConfigParser, or config_parser.
  • File coherence. When multiple chunks of a file match, the file is boosted so the top result reflects file-level relevance.
  • Sibling-chunk boost. Chunks adjacent to a top hit get a small boost — definitions and their helpers usually cluster.
  • Dependency boost. Chunks in files imported by a top hit get boosted so call-chain context surfaces.
  • Noise penalties. Test files, compat/ / legacy/ shims, example code, and .d.ts declaration stubs are down-ranked so canonical implementations surface first.

The embedder is fully static (vocab embedding lookup → mean pool → SIF weighting → L2 normalize). All of this runs in milliseconds on CPU.

Benchmarks

Retrieval quality — 100-query benchmark (this repo)

100 hand-labelled queries across 5 categories: exact symbol names, natural-language feature descriptions, scenarios, acronyms, and Korean queries. Default model minishlab/potion-code-16M.

MetricScore
Recall@170%
Recall@590%
Recall@1095%
MRR0.78
Median latency150 ms / query (cold)
CategorynR@1R@5R@10MRR
exact_symbol3093%100%100%0.96
nl_feature4075%98%100%0.83
scenario1070%100%100%0.77
acronym1050%70%70%0.56
korean1010%60%80%0.27

Query set: docs/eval_set_100.json · per-miss analysis: docs/benchmark_100.md.

Indexing and query latency by repo size

The index is rebuilt every run (no persistent cache).

Repo size (code files)Indexing + first query
22 (this repo)~0.15 s
57–120~0.3–0.7 s
1,600~10 s

digest is independent of repo size: 3.3 MB CI log → 35 KB in ~20 ms.

Token efficiency vs native shell tools

Measured on real projects:

Operationsemble_rsNativeReduction
Codebase map (this repo)tree 533 Bls -R 398 KB747×
Codebase map (6,693-file Python backend)tree 3,950 Bls -R 254 KB64×
Codebase map (325-file Python repo)tree 838 Bls -R 7,522 B
Code chunk lookup (--outline vs --compact)-47%baseline-47%
Build log (cargo build clean)digest 59 Braw 7,611 B-99.2%
CI failure log (real GitHub Actions, rust-lang/rust)digest 35 KBraw 3.3 MB-98.9%
15-fixture aggregatedigest 43 KBraw 3.33 MB-98.7%

Agents using grep + cat + ls -R spend most of their context window on irrelevant code and noise. semble_rs returns only what matters and compresses the rest.

Supported languages

LanguageSearchAST chunkingDependency graph
Rust
Python
JavaScript / TypeScript
Go
Java
C / C++
Kotlin
Ruby
PHP
Swift
HTML / CSS / Vue / Svelteline-basedpartial
Otherline-based

License

MIT

Acknowledgements

  • MinishLab/semble — original Python implementation by Stéphan Tulkens and Thomas van Dongen. semble_rs is a Rust port + superset of their work.
  • Model2Vec and model2vec-rs — static distillation framework powering the embedder.
  • Embedding model: minishlab/potion-code-16M.

SEE ALSO

clihub5/20/2026SEMBLE_RS(1)