# ClearCode Part 2: AST-Aware Indexing, Vector Stores, and Hybrid Retrieval

## Where Part 1 left us

Part 1 covered the architecture before writing any code: the folder structure, the reasoning behind each layer, and the open questions I did not have answers to yet. The context layer was the piece I expected to spend the most time on.

That turned out to be correct. This post covers what I built there and what I learned.

The end state: a working RAG-powered code assistant that runs as a local REPL. Point it at any codebase and ask questions in plain English. It indexes the source files, embeds them into a vector store, and answers using retrieved context.

* * *

## The chunking decision

The first and most consequential decision in any RAG system for code is how to chunk the source files.

The most common approach in tutorials is RecursiveCharacterTextSplitter: split on a character count, with some overlap. This is the right tool for prose. It is the wrong tool for source code.

Character splits break functions at arbitrary boundaries. They separate docstrings from the function bodies they describe. They mix unrelated code into the same chunk because two functions happened to be close together in the file. The embedding model then has to make sense of a fragment that has no semantic coherence, and retrieval degrades accordingly.

The alternative is structure-aware chunking: use the AST to identify the natural boundaries in the code, and make each named block its own chunk.

I used tree-sitter for this. The `code_parser.py` module walks the AST of each source file and extracts top-level named blocks - functions, classes, methods - as atomic chunks. For a Python file with 10 functions, you get 10 chunks, each containing the signature, the docstring, and the full body of one function. For a JavaScript file with classes and methods, each method is one chunk.

Two decisions within the chunking strategy are worth explaining:

**No nested functions indexed separately.** The `_walk` function stops recursing the moment it hits a named block node. A nested function is included in its parent's chunk, not indexed as a separate unit. Indexing nested functions separately would duplicate context - the outer function already contains the inner one - and produce misleading chunk boundaries where the inner function appears without its surrounding context.

**Fallback for non-code files.** Text files, markdown, config files, and anything without meaningful AST structure fall back to a sliding window chunker. The two strategies are composable: the AST chunker handles source files, the sliding window handles everything else.

* * *

## The byte offset bug

This is the detail that cost me the most time.

tree-sitter returns byte offsets when you ask it where a node starts and ends in the source file. Most Python source files are ASCII, so byte offsets and character offsets are identical, and you never notice the difference. The moment a source file contains a multi-byte character - a Unicode string literal, a non-ASCII variable name, a comment with an emoji - byte and character offsets diverge.

If you slice source on character indices using a byte offset, you get the wrong content. The slice is off by however many extra bytes appeared before that point in the file. There is no error. The chunk is silently wrong.

The fix is simple once you understand it: encode the source to bytes first, slice on byte indices, then decode the result. All source slicing in `code_parser.py` now happens this way.

```python
source_bytes = source.encode("utf-8")
chunk_content = source_bytes[start_byte:end_byte].decode("utf-8")
```

This is the kind of correctness detail that only surfaces when you run against real-world codebases rather than toy examples. The test suite now includes a file with multi-byte characters to catch regressions.

* * *

## Three retrieval backends

The retrieval layer sits behind a factory interface. Switching backends is a single field in `config.yaml`:

```yaml
vector_store:
  retrieval_mode: hybrid  # dense | sparse | hybrid
```

Three backends are implemented:

**Dense (ChromaDB or Qdrant)** - embeds each chunk with `text-embedding-3-small` and retrieves by cosine similarity at query time. Best for conceptual and design questions where the question and the answer use different vocabulary.

**Sparse (BM25 via fastembed)** - indexes each chunk as a bag of terms and retrieves by keyword overlap. No embedding model required. Best for exact identifier lookups, error string searches, constant name lookups.

**Hybrid (Qdrant)** - stores both a dense embedding and a sparse BM25 vector per chunk, and fuses the two scores at query time. Best when a question mixes exact identifiers with conceptual reasoning.

* * *

## What the difference looks like in practice

These are real outputs from all three retrieval modes, run against the ClearCode codebase itself.

**Question 1: Conceptual design question**

```plaintext
> /ask what is the philosophy behind not indexing nested functions?

The design philosophy is outlined across code_parser.py and CLAUDE.md. The _walk
function (lines 113–135) stops recursing the moment it hits a named block node,
ensuring that each top-level function or class is indexed as a single coherent
unit. Indexing nested functions separately would duplicate context — the outer
function already contains the inner one — and produce misleading chunk boundaries.

This also simplifies the stable chunk ID scheme (source::name::start_line) used
for idempotent upserts in ChromaDB, where collisions on nested definitions with
the same name would silently overwrite chunks.
```

Dense retrieval returned the design rationale from `code_parser.py` and `CLAUDE.md` with full context - the `_walk` function, the reasoning about semantic coherence, and the note about stable chunk IDs.

Sparse retrieval on the same question retrieved auth-related chunks with no connection to the question. "Philosophy", "nested", and "functions" do not co-occur as exact terms in the relevant source locations - the design rationale is written in prose that uses different vocabulary from the question.

**Question 2: Exact identifier lookup**

```plaintext
> /ask what value does CHUNK_OVERLAP have and where is it defined?

CHUNK_OVERLAP is used in the _sliding_window function in
clearcode/context/indexers/code_parser.py at line 154, where it controls the
step size between successive window positions as CHUNK_SIZE - CHUNK_OVERLAP.
```

Sparse retrieval found the exact usage site in `code_parser.py` at line 154, where `CHUNK_OVERLAP` appears as a literal identifier.

Dense retrieval on the same question found "10-line overlap" in `CLAUDE.md` where the concept is described in prose - right answer, wrong location. No line reference, no code context. The embedding captured the concept but not the identifier.

**Question 3: Mixed identifier and conceptual**

```plaintext
> /ask how does _extract_name fall back when no identifier node is found,
  and why does that cause a problem with BLOCK_NODE_TYPES like arrow_function?

_extract_name (code_parser.py, lines 138–143) iterates over the children of a
block node looking for a child with type "identifier", "name", or
"property_identifier". If none is found, it returns node.type as a fallback —
so for an arrow_function node with no direct identifier child, the chunk name
becomes "arrow_function".

This is a problem because the identifier for an arrow function (const x = () => {})
lives on the parent variable_declarator, not on the arrow_function node itself.
The fallback produces a generic, non-unique name that collides across all
anonymous arrow functions in the same file, and makes retrieval results
meaningless for JS codebases heavy in functional patterns.
```

Hybrid retrieval retrieved the full picture: the fallback implementation from `code_parser.py` lines 138-143, and the parent/child AST relationship explanation that makes the arrow\_function case a problem.

Dense alone retrieved the conceptual relationship but not the exact implementation. Sparse alone pinned the identifiers but missed the reasoning. Hybrid composed both.

* * *

## The honest limitation

Module-level constants defined between functions do not fall inside any named AST node. They are not indexed by the AST chunker.

Ask for the value of `CHUNK_OVERLAP` defined at module level (lines 44-47 in `code_parser.py`) and neither dense nor sparse retrieval will find it. Dense found the value in documentation prose. Sparse found the usage site. Neither found the definition.

This is not a bug I missed. It is a tradeoff with a known fix: index module-level statements (constants, imports, top-level assignments) as an additional chunk type per file. That is the next improvement to the context layer.

It is documented explicitly rather than papered over because a system that knows what it does not know is more useful than one that guesses.

* * *

## Multi-project use

ChromaDB stores its index in a `.chromadb/` folder inside whichever directory you run `clearcode` from. Each project automatically gets its own isolated index.

Qdrant uses a single named collection by default. Running `clearcode` in a new project reuses the existing collection rather than re-indexing. For multi-project use, ChromaDB is the recommended starting point.

* * *

## What comes next

Part 3 is taking shape. The current plan is to focus on memory and tools - giving the agent the ability to remember context across sessions and take real actions on the codebase rather than just answering questions about it.

But I am genuinely curious what you would prioritize. If you were building this, what would you tackle next? Drop a comment - I read every one, and the best suggestions go directly into the roadmap.

Follow along at https://blog.divyampatro.dev/series/clearcode

Full source: https://github.com/f2015537/clearcode Part 1: https://blog.divyampatro.dev/clearcode-part-1-reverse-engineering-a-coding-agent-before-writing-a-single-line-of-code
