What is dev_search and how does it help AI code assistants?

dev_search is a semantic code search tool that returns ranked code snippets matching a natural language query. Instead of returning file paths (which forces AI assistants to read entire files), it returns the actual code with metadata like line numbers, function signatures, and call relationships. In benchmarks, this made Claude 62% faster and 42% cheaper at debugging tasks because it avoided reading full files to find relevant logic.

How does semantic code search differ from grep?

Grep matches exact strings — useful for finding specific identifiers like 'calculateTax'. Semantic search matches meaning — a query like 'where do we handle authentication' finds functions related to auth even if they don't contain the word 'authentication'. dev_search chunks code by semantic units (functions, classes) using the TypeScript Compiler API's AST, then embeds those chunks with a local model (MiniLM) and stores vectors in LanceDB for similarity search.

Why does dev_search return snippets instead of file paths?

This is the core design insight. When an AI assistant gets a file path, it reads the entire file (often hundreds of lines, thousands of tokens) to find relevant code. When it gets a snippet with the actual function and its metadata (path, line numbers, signature), it has what it needs in roughly 65 tokens instead of 18,000. The token savings aren't from compression — they come from avoiding unnecessary file reads entirely.

What is MCP and how does dev_search use it?

MCP (Model Context Protocol) is a standard for connecting AI assistants to external tools. dev_search runs as an MCP server, exposing four tools (dev_search, dev_refs, dev_map, dev_history) that AI assistants like Claude Code and Cursor can call directly. The assistant decides when to use semantic search vs. grep vs. file reading based on the query, with dev_search handling conceptual questions about unfamiliar codebases.

What matters more for code search quality — the embedding model or the architecture around it?

The architecture matters more. An off-the-shelf embedding model (MiniLM, 384 dimensions) worked well enough. What made results actionable was semantic chunking (complete functions instead of arbitrary token boundaries), rich metadata (callees, signatures, line numbers, export status), snippet-based results (actual code, not just paths), and token budgets that progressively reduce detail to stay within context limits.

How dev_search helps AI understand code

This is a technical deep dive on dev-agent. For the story of building it, see 10 days of vibe coding.

I built dev_search without having built a search system before. The moment I knew it worked wasn't a benchmark — it was watching Claude debug.

The moment it clicked

I was on a Google Hangout for a demo party, the Lytics team sharing what we'd been building with AI all week. It's a tight-knit group; some of us have been working together for almost a decade. My palms were a little sweaty. I'd been listening to what everyone else had built and was excited to share mine.

I wanted to show the difference dev-agent made. So I ran the same debugging task twice: "find why search returns duplicates."

Without dev-agent: Claude started running bash commands. Guessing paths. cd packages/cli && node ./dist/index.js. Wrong path. Try again. Read entire files looking for the right function. Thirteen minutes of flailing. It proposed writing a debug script to figure out what was wrong. Cost: $0.99.

I was shocked it started executing code just to navigate the codebase.

With dev-agent: Same question. Claude used dev_search to find relevant code semantically. Five minutes later, it identified the root cause: no deduplication in the search pipeline, and proposed four solutions. Cost: $0.57.

That's 62% faster, 42% cheaper. Claude’s model weights didn't change. The difference was the four new tools I gave it that let it understand code instead of just searching for it.

This is representative of Claude Code in bigger repos. And it was my "oh shit, it actually works" moment too.

Here’s what was actually different under the hood, and why it mattered.

What dev_search does (semantic code search for AI)

dev_search "authentication flow"

Returns ranked code snippets that semantically match your query. Not file paths. Actual code. Claude and Cursor use it via MCP to understand codebases without reading entire files.

The token savings come from one insight: return snippets, not paths.

Example (simplified from a real run):

Input:

dev_search "where do we generate auth tokens"

Output:

[
  {
    "path": "src/auth/service.ts",
    "name": "generateToken",
    "startLine": 88,
    "endLine": 112,
    "snippet": "export function generateToken(user: User) { ... }",
    "score": 0.93
  },
  {
    "path": "src/auth/middleware.ts",
    "name": "requireAuth",
    "startLine": 14,
    "endLine": 39,
    "snippet": "export async function requireAuth(req, res, next) { ... }",
    "score": 0.87
  }
]

Claude sees these snippets directly instead of reading both files end to end.

Principle: Design for AI consumption first. Tools like Claude and Cursor don't need whole files; they need the smallest unit that still preserves meaning.

The architecture: local vector search for code

I built the core scanner, vector store, indexer, and CLI in one day. Not because I'm fast, but because I made decisions quickly and moved on. Claude helped me explore options; I picked reasonable ones and shipped. The architecture is there so Future Me can change those decisions cheaply.

Let me walk through each piece, including where I got it wrong.

1. The Scanner: Semantic chunking

Most RAG systems chunk by token count. You get arbitrary boundaries:

// Token-based chunking (what most RAG does)
chunk_1 = "function foo() { ... } function bar() { ..."  // Split mid-function
chunk_2 = "return x; } class User { constructor() ..."   // Meaningless boundary

I chunk by meaning instead:

// Semantic chunking (what dev_search does)
doc_1 = { type: "function", name: "foo", text: "<complete function>" }
doc_2 = { type: "function", name: "bar", text: "<complete function>" }
doc_3 = { type: "class", name: "User", text: "<complete class>" }

The scanner uses TypeScript Compiler API to walk the AST. Each function, class, and interface becomes one document.

Decision context: I explored chunking strategies with Claude. Token-based was simpler to implement, but the examples of split functions looked wrong. Semantic chunking meant more work upfront, but the search results would be complete, usable code blocks. I chose semantic chunking and haven't regretted it.

Trade-off: Large functions get truncated. Very small functions might benefit from grouping. Semantic boundaries beat arbitrary boundaries for search quality because the model sees a complete thought, not a fragment.

2. Document preparation: What gets embedded

Before embedding, each document is formatted:

function formatDocumentText(doc: Document): string {
  const parts: string[] = [];
  
  if (doc.metadata.name) {
    parts.push(`${doc.type}: ${doc.metadata.name}`);
  }
  
  if (doc.text) {
    parts.push(doc.text);
  }
  
  return parts.join('\n\n');
}

Example output:

function: authenticateUser

async function authenticateUser(user: User): Promise<AuthResult> {
  const hash = await bcrypt.compare(user.password, stored.hash);
  if (!hash) throw new AuthError('Invalid credentials');
  return generateToken(user);
}

Why prefix with type and name? The embedding model sees "function: authenticateUser" as context. Queries like "authentication" have a stronger signal to match against.

3. The embedding model

Property	Value
Model	all-MiniLM-L6-v2
Dimensions	384
Pooling	Mean
Normalization	L2
Runtime	Local (ONNX via Transformers.js)

Decision context: I hadn't worked with embedding models before. I asked Claude to compare options: OpenAI's ada-002, Cohere, various open-source models. My constraints were: runs locally (no API calls, code stays on my machine), well-documented, community validation.

MiniLM fit all three. I didn't benchmark alternatives. I picked something reasonable and moved on. The embedding model turned out to matter less than I expected. The architecture around it mattered more.

If a better local model emerges, the TransformersEmbedder class is isolated. I can swap it without touching search logic.

Why mean pooling? Averages all token embeddings. Works better than CLS token for sentence similarity.

Why L2 normalize? Makes vectors unit length. Simplifies distance calculations.

I wasn't optimizing for generic semantic search. I was optimizing for: can an AI assistant answer “where does this happen?” in one shot without groping around the repo. That lens killed a lot of premature complexity. I didn't need perfect embeddings; I needed consistent-enough ones plus good chunking and metadata.

4. Vector storage: LanceDB

I store vectors in LanceDB, an embedded, columnar database.

// Schema
{
  id: string,           // Document hash
  vector: float[384],   // Embedding
  metadata: string      // JSON blob
}

Decision context: Options I explored with Claude:

Option	Trade-off
Chroma	Requires server process
Pinecone	Cloud-only, costs money
Qdrant	Heavier, more features than needed
FAISS	No persistence, manual serialization
LanceDB	Embedded, zero config, good enough

I wanted "just works locally." LanceDB was the simplest path. The LanceDBVectorStore class implements a VectorStore interface, so if I need to switch to something else, I implement the interface and swap.

5. Similarity scoring

LanceDB returns L2 distance. I convert to a 0-1 score:

const distance = result._distance;
const score = Math.exp(-(distance * distance));

distance ≈ 0 → score ≈ 1.0 (very similar)
distance = 1 → score ≈ 0.37
distance = 2 → score ≈ 0.02 (not similar)

What I got wrong: My first scoring formula was off. The results looked weird: relevant code showing low scores, irrelevant code ranking higher than expected. I found a better formula in a blog post, tested it against results I could eyeball, and it made sense. Shipped it. There might be better formulas. This one worked.

6. Result formatting: The key insight

This is where the debugging session difference comes from.

Without dev-agent, Claude gets file paths:

Search result: "src/auth/service.ts" (score: 0.89)

Then it reads the entire file. 441 lines. 18,000 input tokens. And often it's the wrong file, so it reads another one. And another.

With dev-agent, Claude gets snippets:

{
  "path": "src/auth/service.ts",
  "name": "authenticateUser",
  "startLine": 42,
  "endLine": 67,
  "snippet": "async function authenticateUser(user: User)...",
  "score": 0.89
}

65 input tokens. The actual code. No file read needed.

This is where the token savings come from. It’s not compression; it simply avoids reading full files to find the relevant logic.

I didn't design for token savings. I designed to make Claude faster. The savings were a side effect.

7. The metadata design

Beyond similarity, what makes results actionable:

Field	Why It Matters
`path`	Claude can read the file if needed
`startLine`, `endLine`	Precise location for edits
`signature`	Quick understanding without reading body
`snippet`	The actual code. No file read needed
`callees`	What this function calls (for `dev_refs`)
`exported`	Is this public API?
`docstring`	Author's intent

This metadata is why dev_search chains well with other tools:

dev_search "auth" → finds authenticateUser
dev_refs "authenticateUser" → uses callees from metadata
dev_history file="src/auth/service.ts" → uses path from metadata

8. Token budgets

Results respect a token budget:

const formatter = new CompactFormatter({
  maxResults: limit,
  tokenBudget: tokenBudget ?? 2000,
  includeSnippets: true,
});

If results exceed the budget, the formatter progressively reduces:

Truncate snippets
Reduce result count
Remove optional fields

This ensures Claude gets maximum context within its window.

What broke (and how I fixed it)

The commit history tells the real story. Some things I had to fix:

Searching filename instead of content. Early version of the explore feature matched against filenames. A query like "find authentication" would return files named auth.ts instead of files containing authentication logic. Obvious in hindsight.

The similarity formula. Already mentioned: first version produced unintuitive scores.

Cursor integration. The MCP server worked in tests. Then I connected it to Cursor. Zombie processes. Stdin closing unexpectedly. Process.exit during graceful shutdown. Real-world usage broke things that tests didn't catch.

Memory leaks. Event listeners not cleaned up. Had to implement circular buffers and proper shutdown handling.

These aren't failures; they're the normal shape of building something. Ship, use it, fix what breaks.

Performance

Metric	Typical Value
Index time (first run)	~5-10 minutes for 10k files
Embedding time per doc	~100ms
Search latency	Under 100ms
Storage overhead	~10-50MB per 1000 documents

What I'd improve

I built this in a week. There are known gaps:

Limitation	Why It Exists	What Would Help
No hybrid search	Pure vector, no keyword matching	BM25 + vector combined
No re-ranking	Single-stage retrieval	Cross-encoder on top-k
No query expansion	Query embedded as-is	Synonyms, related terms
Large functions truncated	Scanner limits text length	Smarter chunking
Full re-embed on change	Simpler implementation	Incremental indexing

These are all swappable. The architecture is modular:

TransformersEmbedder implements EmbeddingProvider
LanceDBVectorStore implements VectorStore
CompactFormatter / VerboseFormatter are pluggable
Scanner is separate from indexer

When I learn better approaches, or when someone contributes them, they plug in.

When to use it

Scenario	Best Tool
"Where is X implemented?" (conceptual)	`dev_search`
"Find exact string `calculateTax`"	`grep`
"What calls this function?"	`dev_refs`
"Show me the project structure"	`dev_map`

dev_search excels at conceptual queries in unfamiliar codebases. It's weak at exact string matching; use grep for that.

The core insight

I came into this thinking the embedding model would be the hard part. It wasn't.

The embedding model is a commodity. Off-the-shelf MiniLM worked fine.

What mattered:

What you chunk: semantic units, not arbitrary tokens
What metadata you keep: callees, signatures, line numbers
What you return: snippets, not file paths

The model is swappable. The design around it is the work.

Using AI to make decisions

I built this without prior experience in search systems. Claude helped me explore the option space quickly:

"Compare embedding models for code search"
"What are the trade-offs between Chroma and LanceDB?"
"How should I convert L2 distance to a similarity score?"

I didn't always take the first suggestion. But AI compressed the exploration phase. Instead of reading five blog posts and three papers, I had a conversation and got oriented in minutes.

Then I made a choice. Not necessarily the optimal choice, just a reasonable one that let me move forward.

The architecture is modular because I knew I'd learn more. The embedding model might not be the best. The vector store might need replacing. The scoring formula might be naive.

That's fine. I can iterate. The goal was to build something that works, learn from using it, and improve.

Principle: Make a decision, ship it, and leave room to be wrong. Architecture should make it cheap to change your mind later.

Frequently Asked Questions

Common questions about the concepts covered

This piece has focused on dev_search, which is the semantic search part of dev-agent. In a follow-up, I'll dig into dev_refs, dev_map, and dev_history, and how they use structure and relationships in the codebase to give AI the rest of the context it was missing.

dev-agent is open source. The search implementation is in packages/core/src/vector/ and packages/mcp-server/src/adapters/built-in/search-adapter.ts.