Khmer Kbach decorative pattern
Skip to main content

How dev_search helps AI understand code

8 min read
AI & ToolsTechnical Deep Dive
Khmer decorative icon
Khmer decorative icon

This is a technical deep dive on dev-agent. For the story of building it, see 10 days of vibe coding.

I built dev_search without having built a search system before. The moment I knew it worked wasn't a benchmark — it was watching Claude debug.

The moment it clicked

I was on a Google Hangout for a demo party, the Lytics team sharing what we'd been building with AI all week. It's a tight-knit group; some of us have been working together for almost a decade. My palms were a little sweaty. I'd been listening to what everyone else had built and was excited to share mine.

I wanted to show the difference dev-agent made. So I ran the same debugging task twice: "find why search returns duplicates."

Without dev-agent: Claude started running bash commands. Guessing paths. cd packages/cli && node ./dist/index.js. Wrong path. Try again. Read entire files looking for the right function. Thirteen minutes of flailing. It proposed writing a debug script to figure out what was wrong. Cost: $0.99.

I was shocked it started executing code just to navigate the codebase.

With dev-agent: Same question. Claude used dev_search to find relevant code semantically. Five minutes later, it identified the root cause: no deduplication in the search pipeline, and proposed four solutions. Cost: $0.57.

That's 62% faster, 42% cheaper. Claude’s model weights didn't change. The difference was the four new tools I gave it that let it understand code instead of just searching for it.

This is representative of Claude Code in bigger repos. And it was my "oh shit, it actually works" moment too.

Here’s what was actually different under the hood, and why it mattered.


What dev_search does (semantic code search for AI)

dev_search "authentication flow"

Returns ranked code snippets that semantically match your query. Not file paths. Actual code. Claude and Cursor use it via MCP to understand codebases without reading entire files.

The token savings come from one insight: return snippets, not paths.

Example (simplified from a real run):

Input:

dev_search "where do we generate auth tokens"

Output:

[
  {
    "path": "src/auth/service.ts",
    "name": "generateToken",
    "startLine": 88,
    "endLine": 112,
    "snippet": "export function generateToken(user: User) { ... }",
    "score": 0.93
  },
  {
    "path": "src/auth/middleware.ts",
    "name": "requireAuth",
    "startLine": 14,
    "endLine": 39,
    "snippet": "export async function requireAuth(req, res, next) { ... }",
    "score": 0.87
  }
]

Claude sees these snippets directly instead of reading both files end to end.

Principle: Design for AI consumption first. Tools like Claude and Cursor don't need whole files; they need the smallest unit that still preserves meaning.


The architecture: local vector search for code

I built the core scanner, vector store, indexer, and CLI in one day. Not because I'm fast, but because I made decisions quickly and moved on. Claude helped me explore options; I picked reasonable ones and shipped. The architecture is there so Future Me can change those decisions cheaply.

Let me walk through each piece, including where I got it wrong.


1. The Scanner: Semantic chunking

Most RAG systems chunk by token count. You get arbitrary boundaries:

// Token-based chunking (what most RAG does)
chunk_1 = "function foo() { ... } function bar() { ..."  // Split mid-function
chunk_2 = "return x; } class User { constructor() ..."   // Meaningless boundary

I chunk by meaning instead:

// Semantic chunking (what dev_search does)
doc_1 = { type: "function", name: "foo", text: "<complete function>" }
doc_2 = { type: "function", name: "bar", text: "<complete function>" }
doc_3 = { type: "class", name: "User", text: "<complete class>" }

The scanner uses TypeScript Compiler API to walk the AST. Each function, class, and interface becomes one document.

Decision context: I explored chunking strategies with Claude. Token-based was simpler to implement, but the examples of split functions looked wrong. Semantic chunking meant more work upfront, but the search results would be complete, usable code blocks. I chose semantic chunking and haven't regretted it.

Trade-off: Large functions get truncated. Very small functions might benefit from grouping. Semantic boundaries beat arbitrary boundaries for search quality because the model sees a complete thought, not a fragment.


2. Document preparation: What gets embedded

Before embedding, each document is formatted:

function formatDocumentText(doc: Document): string {
  const parts: string[] = [];
  
  if (doc.metadata.name) {
    parts.push(`${doc.type}: ${doc.metadata.name}`);
  }
  
  if (doc.text) {
    parts.push(doc.text);
  }
  
  return parts.join('\n\n');
}

Example output:

function: authenticateUser

async function authenticateUser(user: User): Promise<AuthResult> {
  const hash = await bcrypt.compare(user.password, stored.hash);
  if (!hash) throw new AuthError('Invalid credentials');
  return generateToken(user);
}

Why prefix with type and name? The embedding model sees "function: authenticateUser" as context. Queries like "authentication" have a stronger signal to match against.


3. The embedding model

PropertyValue
Modelall-MiniLM-L6-v2
Dimensions384
PoolingMean
NormalizationL2
RuntimeLocal (ONNX via Transformers.js)

Decision context: I hadn't worked with embedding models before. I asked Claude to compare options: OpenAI's ada-002, Cohere, various open-source models. My constraints were: runs locally (no API calls, code stays on my machine), well-documented, community validation.

MiniLM fit all three. I didn't benchmark alternatives. I picked something reasonable and moved on. The embedding model turned out to matter less than I expected. The architecture around it mattered more.

If a better local model emerges, the TransformersEmbedder class is isolated. I can swap it without touching search logic.

Why mean pooling? Averages all token embeddings. Works better than CLS token for sentence similarity.

Why L2 normalize? Makes vectors unit length. Simplifies distance calculations.

I wasn't optimizing for generic semantic search. I was optimizing for: can an AI assistant answer “where does this happen?” in one shot without groping around the repo. That lens killed a lot of premature complexity. I didn't need perfect embeddings; I needed consistent-enough ones plus good chunking and metadata.


4. Vector storage: LanceDB

I store vectors in LanceDB, an embedded, columnar database.

// Schema
{
  id: string,           // Document hash
  vector: float[384],   // Embedding
  metadata: string      // JSON blob
}

Decision context: Options I explored with Claude:

OptionTrade-off
ChromaRequires server process
PineconeCloud-only, costs money
QdrantHeavier, more features than needed
FAISSNo persistence, manual serialization
LanceDBEmbedded, zero config, good enough

I wanted "just works locally." LanceDB was the simplest path. The LanceDBVectorStore class implements a VectorStore interface, so if I need to switch to something else, I implement the interface and swap.


5. Similarity scoring

LanceDB returns L2 distance. I convert to a 0-1 score:

const distance = result._distance;
const score = Math.exp(-(distance * distance));
  • distance ≈ 0score ≈ 1.0 (very similar)
  • distance = 1score ≈ 0.37
  • distance = 2score ≈ 0.02 (not similar)

What I got wrong: My first scoring formula was off. The results looked weird: relevant code showing low scores, irrelevant code ranking higher than expected. I found a better formula in a blog post, tested it against results I could eyeball, and it made sense. Shipped it. There might be better formulas. This one worked.


6. Result formatting: The key insight

This is where the debugging session difference comes from.

Without dev-agent, Claude gets file paths:

Search result: "src/auth/service.ts" (score: 0.89)

Then it reads the entire file. 441 lines. 18,000 input tokens. And often it's the wrong file, so it reads another one. And another.

With dev-agent, Claude gets snippets:

{
  "path": "src/auth/service.ts",
  "name": "authenticateUser",
  "startLine": 42,
  "endLine": 67,
  "snippet": "async function authenticateUser(user: User)...",
  "score": 0.89
}

65 input tokens. The actual code. No file read needed.

This is where the token savings come from. It’s not compression; it simply avoids reading full files to find the relevant logic.

I didn't design for token savings. I designed to make Claude faster. The savings were a side effect.


7. The metadata design

Beyond similarity, what makes results actionable:

FieldWhy It Matters
pathClaude can read the file if needed
startLine, endLinePrecise location for edits
signatureQuick understanding without reading body
snippetThe actual code. No file read needed
calleesWhat this function calls (for dev_refs)
exportedIs this public API?
docstringAuthor's intent

This metadata is why dev_search chains well with other tools:

dev_search "auth" → finds authenticateUser
dev_refs "authenticateUser" → uses callees from metadata
dev_history file="src/auth/service.ts" → uses path from metadata

8. Token budgets

Results respect a token budget:

const formatter = new CompactFormatter({
  maxResults: limit,
  tokenBudget: tokenBudget ?? 2000,
  includeSnippets: true,
});

If results exceed the budget, the formatter progressively reduces:

  1. Truncate snippets
  2. Reduce result count
  3. Remove optional fields

This ensures Claude gets maximum context within its window.


What broke (and how I fixed it)

The commit history tells the real story. Some things I had to fix:

Searching filename instead of content. Early version of the explore feature matched against filenames. A query like "find authentication" would return files named auth.ts instead of files containing authentication logic. Obvious in hindsight.

The similarity formula. Already mentioned: first version produced unintuitive scores.

Cursor integration. The MCP server worked in tests. Then I connected it to Cursor. Zombie processes. Stdin closing unexpectedly. Process.exit during graceful shutdown. Real-world usage broke things that tests didn't catch.

Memory leaks. Event listeners not cleaned up. Had to implement circular buffers and proper shutdown handling.

These aren't failures; they're the normal shape of building something. Ship, use it, fix what breaks.


Performance

MetricTypical Value
Index time (first run)~5-10 minutes for 10k files
Embedding time per doc~100ms
Search latencyUnder 100ms
Storage overhead~10-50MB per 1000 documents

What I'd improve

I built this in a week. There are known gaps:

LimitationWhy It ExistsWhat Would Help
No hybrid searchPure vector, no keyword matchingBM25 + vector combined
No re-rankingSingle-stage retrievalCross-encoder on top-k
No query expansionQuery embedded as-isSynonyms, related terms
Large functions truncatedScanner limits text lengthSmarter chunking
Full re-embed on changeSimpler implementationIncremental indexing

These are all swappable. The architecture is modular:

  • TransformersEmbedder implements EmbeddingProvider
  • LanceDBVectorStore implements VectorStore
  • CompactFormatter / VerboseFormatter are pluggable
  • Scanner is separate from indexer

When I learn better approaches, or when someone contributes them, they plug in.


When to use it

ScenarioBest Tool
"Where is X implemented?" (conceptual)dev_search
"Find exact string calculateTax"grep
"What calls this function?"dev_refs
"Show me the project structure"dev_map

dev_search excels at conceptual queries in unfamiliar codebases. It's weak at exact string matching; use grep for that.


The core insight

I came into this thinking the embedding model would be the hard part. It wasn't.

The embedding model is a commodity. Off-the-shelf MiniLM worked fine.

What mattered:

  • What you chunk: semantic units, not arbitrary tokens
  • What metadata you keep: callees, signatures, line numbers
  • What you return: snippets, not file paths

The model is swappable. The design around it is the work.


Using AI to make decisions

I built this without prior experience in search systems. Claude helped me explore the option space quickly:

  • "Compare embedding models for code search"
  • "What are the trade-offs between Chroma and LanceDB?"
  • "How should I convert L2 distance to a similarity score?"

I didn't always take the first suggestion. But AI compressed the exploration phase. Instead of reading five blog posts and three papers, I had a conversation and got oriented in minutes.

Then I made a choice. Not necessarily the optimal choice, just a reasonable one that let me move forward.

The architecture is modular because I knew I'd learn more. The embedding model might not be the best. The vector store might need replacing. The scoring formula might be naive.

That's fine. I can iterate. The goal was to build something that works, learn from using it, and improve.

Principle: Make a decision, ship it, and leave room to be wrong. Architecture should make it cheap to change your mind later.


This piece has focused on dev_search, which is the semantic search part of dev-agent. In a follow-up, I'll dig into dev_refs, dev_map, and dev_history, and how they use structure and relationships in the codebase to give AI the rest of the context it was missing.

dev-agent is open source. The search implementation is in packages/core/src/vector/ and packages/mcp-server/src/adapters/built-in/search-adapter.ts.

Khmer decorative icon
Related Project

dev-agent

MCP server for AI coding assistants. Provides semantic code search via LanceDB embeddings. ~42% token cost reduction in my usage.

TypeScriptMCPLanceDBOpen Source
Khmer decorative icon