Building a RAG Chatbot: Every Design Decision Explained

Introduction

RAG (Retrieval-Augmented Generation) is one of those patterns that looks simple on a diagram and gets complicated fast in practice. The chunking strategy, embedding model choice, vector store selection, and the split between ingestion and retrieval all have real consequences for answer quality, cost, and maintainability.

This post walks through every design decision I made building a RAG chatbot that answers natural language questions about Washington state hiking trails. The domain is specific. The decisions generalise.

Architecture overview

The system has two distinct stages that run independently:

DATA PIPELINE (data_ingestion.ipynb)

Web Sources          LangChain              ChromaDB
───────────      ──────────────────      ──────────────
NPS.gov      →   WebBaseLoader       →   text-embedding
Wikipedia        BeautifulSoup           -3-small
Recreation.gov   RecursiveCharacter      ./chroma_db/
                 TextSplitter            washington_hikes

AGENT RUNTIME (hiking_agent/)

User Query
    │
    ▼
Google ADK Agent  →  retrieve_info()  →  Chroma similarity search (k=5)
(Gemini 2.5 Flash)   FunctionTool
    │
    ▼
Grounded Response

The ingestion pipeline runs once and produces a persisted Chroma collection. The agent is stateless and reconnects to that collection on every query.

Decision 1: RAG instead of fine-tuning

Trail conditions, permit availability, and access windows change seasonally. A fine-tuned model would need a full retraining cycle every time the data changes.

RAG sidesteps this entirely. When the data needs refreshing, re-run the ingestion notebook. The agent does not need to be redeployed. The model does not need to be retrained. The knowledge base updates independently of the running system.

This is the core argument for RAG over fine-tuning whenever the underlying data has a meaningful refresh rate: the separation between knowledge and inference is a feature, not a limitation.

Decision 2: Chunk size - 1000 chars / 150 char overlap

Chunk size is one of the most consequential RAG decisions and one of the least discussed.

Too small: individual chunks lose context. A chunk that says "permits are required" without the surrounding sentence explaining which trail and which season is useless at retrieval time.

Too large: chunks exceed the token budget for retrieval. With k=5 results passed to Gemini, the combined retrieved context needs to fit within the model's context window while leaving room for the system prompt and response.

1000 chars hits the right balance for NPS and Wikipedia content, which tends to be dense and information-rich. The 150-char overlap is not arbitrary either. It prevents information from being lost at chunk boundaries, which matters most when a key fact (a distance, an elevation, a permit requirement) happens to fall at the end of a chunk.

Decision 3: text-embedding-3-small over text-embedding-3-large

OpenAI's large embedding model improves retrieval recall on English tasks. It also costs approximately 5x more per token than the small model.

On a 640-chunk corpus, the recall improvement is marginal. The retrieval interface is identical between the two models, so swapping later requires changing a single constructor argument. Starting with the cheaper model and benchmarking before upgrading is the correct order of operations.

Decision 4: Chroma local persistent store over a hosted vector DB

Pinecone, Weaviate, and Qdrant are all excellent options for production RAG systems. For a single-developer showcase project, they introduce infrastructure cost and setup friction with no meaningful benefit.

Chroma's local persistent mode stores the vector collection on disk and reconnects to it on every session. The retrieval API (as_retriever()) is identical to hosted alternatives. Swapping to a hosted backend if this were to scale requires changing a single constructor call.

Decision 5: Decoupled ingestion and agent

This is the architectural decision with the most operational impact.

The ingestion pipeline is scraping-dependent. It hits 29 URLs, filters and cleans the content, chunks and embeds it, and writes to disk. It is a one-shot operation that can be triggered manually or scheduled via cron or a workflow orchestrator.

The agent has no scraping dependencies at all. It connects to the persisted Chroma store, runs similarity search, and passes the retrieved chunks to Gemini. It can be deployed to any environment that has access to the chroma_db directory without any of the ingestion tooling.

This separation means:

Refresh the knowledge base without touching the agent
Deploy the agent without any scraping setup
Run the ingestion pipeline on a schedule without coordinating with the live agent

Data sources and the honest limitation

The ingestion pipeline scrapes 29 URLs across three authoritative sources: NPS.gov for park-level information and seasonal guidance, Wikipedia for rich articles on individual trails and wilderness areas, and Recreation.gov for permit and quota information.

The honest limitation: Washington Trails Association (WTA) is the most comprehensive per-trail database in the state, with difficulty ratings, distances, and elevation profiles for over 10,000 hikes. It is protected by Cloudflare's JS challenge and cannot be reliably scraped with a standard HTTP client.

This is the single highest-impact data improvement available: a Playwright-based scraper that executes the JS challenge and extracts WTA's per-trail data would transform the chatbot's ability to answer specific trail queries (distance, difficulty, elevation for individual hikes) rather than region-level queries.

I documented this limitation explicitly in the README rather than papering over it. A system that knows what it does not know is more useful than one that guesses.

What I would change at larger scale

Hybrid search. Dense vector search alone misses exact-match queries. Combining it with BM25 sparse retrieval improves recall on specific trail names like "Rattlesnake Ledge" or "The Enchantments" where keyword matching is more reliable than semantic similarity.

Metadata filtering. Storing region, difficulty, and distance as Chroma metadata fields would enable structured pre-filtering before semantic search. "Easy hikes in the North Cascades" becomes a metadata filter + semantic search rather than relying entirely on the embedding to do both.

Evaluation harness. A golden dataset of Q&A pairs with known correct answers would let me measure retrieval precision and answer quality across data refreshes. Without it, I am eyeballing the demo outputs.

Scheduled re-ingestion. Trail conditions and permit availability are seasonal. A scheduled pipeline that re-ingests the data sources on a weekly or monthly cadence would keep the knowledge base current without manual intervention.

Conclusion

The RAG pattern is simple. Getting it right requires deliberate decisions on chunking, embeddings, retrieval depth, and the boundary between ingestion and inference. Every one of those decisions has a tradeoff, and the right choice depends on the data characteristics, the query patterns, and the operational constraints of the system.

The honest limitation section in the README is not an apology. It is a specification of what would make the system better, written for the next person who works on it.

Full source: https://github.com/f2015537/RAG-Chatbot

Building a RAG Chatbot: Every Design Decision Explained

Introduction

Architecture overview

Decision 1: RAG instead of fine-tuning

Decision 2: Chunk size - 1000 chars / 150 char overlap

Decision 3: text-embedding-3-small over text-embedding-3-large

Decision 4: Chroma local persistent store over a hosted vector DB

Decision 5: Decoupled ingestion and agent

Data sources and the honest limitation

What I would change at larger scale

Conclusion

Comments

More from this blog

ClearCode Part 5: Semantic Caching, Incremental Indexing, and the Hardest Part of Caching

ClearCode Part 4: Autonomous Plan Execution, LLM-as-Judge, and Human-in-the-Loop Approval

ClearCode Part 3: Memory, Agent Reasoning, Skills, and MCP

ClearCode Part 2: AST-Aware Indexing, Vector Stores, and Hybrid Retrieval

ClearCode Part 1: Reverse Engineering a Coding Agent Before Writing a Single Line of Code

Command Palette

Introduction

Architecture overview

Decision 1: RAG instead of fine-tuning

Decision 2: Chunk size - 1000 chars / 150 char overlap

Decision 3: text-embedding-3-small over text-embedding-3-large

Decision 4: Chroma local persistent store over a hosted vector DB

Decision 5: Decoupled ingestion and agent

Data sources and the honest limitation

What I would change at larger scale

Conclusion

Comments

More from this blog