ClearCode Part 1: Reverse Engineering a Coding Agent Before Writing a Single Line of Code

Why I am building this

I use Claude Code every day. For the longest time it felt like a black box.

I type a prompt. Code appears. Files change. Tests run. Pull requests get written. I have no real idea what is happening between my input and those outputs.

That is not a complaint. It is a problem I want to solve, for myself, by doing the only thing that has ever actually worked for me when I want to understand something: building it from scratch.

ClearCode is my attempt to reverse engineer a production-grade autonomous coding agent. Not to beat Anthropic. Not to ship a competing product. To understand how these systems actually work under the hood, and to document every decision and dead end publicly so anyone else curious about the same question has a ground-up reference to learn from.

This is part 1: the architecture before the code.

What a coding agent actually needs to do

Before designing anything, it helps to think clearly about what the problem actually is. A coding agent is not just an LLM with a text editor. It needs to:

Understand the codebase it is working in, not just the file you are currently looking at
Know which parts of that codebase are relevant to the current task
Keep that understanding current as files change
Take actions: read files, write files, run commands, search, navigate
Reason about multi-step problems where the right next action depends on what previous actions returned
Know when it is done, and when it is stuck
Not do something irreversible without checking first

That list is not exhaustive. But it is enough to start making architectural decisions.

The folder structure

The first design decision was how to divide the problem into layers. Here is where I landed:

clearcode/
│
├── context/                     # Context layer
│   ├── indexers/                # Build indexes over the codebase
│   ├── retrievers/              # Query those indexes
│   └── memory/                  # Short-term and long-term memory
│
├── agent/                       # Agent reasoning layer
│
├── llm/                         # LLM provider abstraction
│
├── tools/                       # Individual tool functions
│
├── mcp/                         # MCP server integrations
│
├── skills/                      # Higher-level composed capabilities
│
├── safety/                      # Safety layer
│
├── freshness/                   # Freshness layer
│
├── observability/               # Observability layer
│
└── eval/                        # Evaluation layer
    ├── datasets/                # Shared golden dataset
    ├── retrieval/               # Recall@k, MRR, NDCG, Hit Rate
    ├── context/                 # Context precision, context recall
    ├── generation/              # Faithfulness, answer relevancy (RAGAS)
    └── agent/                   # Task success rate, step accuracy

Let me explain the reasoning behind each layer.

The context layer

This is where I expect to spend the most time and learn the most. A coding agent's understanding of the codebase it is working in is the ceiling for everything else. If the agent retrieves the wrong files, or retrieves the right files but in the wrong form, no amount of reasoning quality will save the output.

The context layer has three parts:

Indexers build representations of the codebase that can be searched. This is where the chunking strategy decisions from my earlier RAG work become directly relevant. For code, AST-based chunking (preserving functions and classes as atomic units) is almost certainly better than character-based chunking. I expect to spend significant time here.

Retrievers query those indexes. From my RAG case study, I know that hybrid retrieval (semantic plus BM25) handles the full range of query types better than either alone. An exact function name lookup and a conceptual query about permission logic need different retrieval strategies.

Memory is the layer I am most excited and most uncertain about. Short-term memory needs to track what has happened in the current session. Long-term memory needs to persist things the agent should remember across sessions. From my LLM memory patterns work, I know the scoping distinction that matters: STM is per-thread, LTM is per-user.

The agent layer

The agent is the reasoning loop. It decides what to do next, calls a tool, observes the result, and decides what to do after that.

I do not have a firm opinion on the agent architecture yet. The main question is how much structure to impose on the loop: a free-form ReAct-style loop where the LLM decides everything, a more structured plan-then-execute pattern, or something in between. I expect this to be one of the most consequential decisions in the whole project.

The LLM provider abstraction

This layer exists so that the rest of the system does not know or care which model it is using. From my earlier work with LiteLLM, I know that decoupling the model from the agent logic means you can benchmark different models against the same tasks and switch based on cost, latency, or capability without touching any other layer.

Tools

Tools are the atomic capabilities the agent can call: read a file, write a file, run a terminal command, search the codebase, look something up. Each tool is a function with a well-defined input and output.

The principle here is the same one I applied in the MCP server project: write the tool logic once, expose it cleanly, and let the agent layer decide when and how to use it.

MCP server integrations

MCP is the protocol that makes tools portable across agents and frameworks. Rather than binding tools directly to a single agent, wrapping them in an MCP server means they can be used by any MCP-compatible client.

I will wire ClearCode's tools into an MCP server so they are reusable outside the agent loop itself.

Skills

Skills are composed capabilities built on top of individual tools. "Refactor this function" is a skill that might call read-file, analyse-code, write-file, and run-tests in sequence. "Add a feature" is a skill that might involve planning, searching, writing, and testing.

The distinction between a tool and a skill is the level of composition. Tools are atomic. Skills are workflows.

Safety

A coding agent that can write and execute code needs guardrails. The safety layer is where I will enforce things like: no running destructive commands without confirmation, no writing outside the project directory, no making network calls without explicit permission.

I do not have a concrete plan for this layer yet. It is one of the genuinely hard problems in autonomous agent design.

Freshness

The codebase changes as the agent works. A file the agent indexed at the start of a session may be different by the time the agent tries to use that index later in the same session. The freshness layer is responsible for detecting staleness and triggering re-indexing.

This is a problem I have not seen addressed clearly in most agent tutorials. I suspect it is more important in practice than the literature suggests.

Evaluation

This is the layer I am most deliberate about upfront, because it is the one that gets skipped most often in agent projects. The eval layer has four sub-layers, each measuring a different dimension:

Retrieval: Recall@k, MRR, NDCG, Hit Rate. Does the context layer return the right files?

Context: Context precision and context recall. Is the retrieved context relevant, and is it complete?

Generation: Faithfulness and answer relevancy via RAGAS. Does the agent's output actually reflect what it retrieved, and does it answer the question?

Agent: Task success rate and step accuracy. Does the agent complete the task correctly, and does it take reasonable steps to get there?

Without this layer, I cannot tell whether the changes I make to the context pipeline or the agent loop are actually improving anything.

What I do not know yet

Quite a lot. The folder structure above represents the shape of the problem as I understand it today. Some of those layers will turn out to be more complex than I expect. Some will be simpler. Some of the things I think I need will turn out to be unnecessary. Some things I have not thought of yet will turn out to be critical.

That is the point of building this in public. The unknown unknowns are more interesting than the known unknowns, and the only way to find them is to start.

What comes next

Part 2 will cover the context layer: how to build an index over a codebase, what chunking strategy to use for code specifically, and how to query it. That is where the building actually starts.

Source code (work in progress): https://github.com/f2015537/clearcode

If you have ever wondered how tools like Claude Code or Cursor work under the hood, follow along. I am going to find out.

ClearCode Part 1: Reverse Engineering a Coding Agent Before Writing a Single Line of Code

Why I am building this

What a coding agent actually needs to do

The folder structure

The context layer

The agent layer

The LLM provider abstraction

Tools

MCP server integrations

Skills

Safety

Freshness

Evaluation

What I do not know yet

What comes next

Comments

ClearCode

More from this blog

Memory in LLM Agents, Explained: From Stateless Calls to Long-Term Memory

PageIndex: Vectorless, Reasoning-Based RAG Explained

Chunking vs Retrieval: A RAG Case Study on a Real Codebase

Building a RAG Chatbot: Every Design Decision Explained

Command Palette

Why I am building this

What a coding agent actually needs to do

The folder structure

The context layer

The agent layer

The LLM provider abstraction

Tools

MCP server integrations

Skills

Safety

Freshness

Evaluation

What I do not know yet

What comes next

Comments

ClearCode

More from this blog