PageIndex: Vectorless, Reasoning-Based RAG Explained

Introduction

PageIndex, an open-source project by VectifyAI, is currently the #1 trending repository on GitHub with over 30,000 stars. Its central claim is direct and worth taking seriously: similarity is not relevance, and most RAG systems are optimizing for the wrong thing.

This post is my breakdown of how PageIndex works, why the architecture makes sense for a specific class of documents, and where I think the tradeoffs are understated. All credit for the underlying system goes to VectifyAI. Links to the original repository are at the end.

The problem PageIndex is responding to

Traditional RAG chunks a document into fixed-size windows, embeds each chunk into a vector space, and retrieves whichever chunks are closest to the query's embedding at inference time. This works reasonably well for short, homogenous documents.

It breaks down for long, structured professional documents: financial reports, regulatory filings, legal contracts, technical manuals. A chunk can be lexically and even semantically similar to a query while still being the wrong section to answer it, because professional documents encode relevance through structure (which section, which subsection, which exception clause) that flat vector similarity cannot see.

This is the same failure mode I found in my own RAG case study on code search: a docstring containing the actual answer got split away from its function body by character-based chunking, and the embedding retriever never found it. PageIndex generalizes that observation into a different default: do not chunk at all.

How PageIndex works

The system has two stages.

Stage 1: Build a hierarchical tree index.

Instead of slicing a document into arbitrary chunks, PageIndex parses it into a tree structure resembling a table of contents. Each node has a title, a node ID, a start and end page range, and an LLM-generated summary of that section's content. Nested sections become nested tree nodes.

This tree is generated once per document, using an LLM (GPT-4o by default in the open-source version) to identify section boundaries and write summaries.

Stage 2: Reasoning-based retrieval through tree search.

At query time, instead of embedding the query and running a similarity search, an LLM reasons over the tree. It reads node summaries, decides which branches are worth exploring further, and navigates down toward the most relevant leaf nodes. This is explicitly compared to AlphaGo in the project's README: tree search guided by a learned evaluator, applied to documents instead of a game board.

The result is retrieval that is traceable. You can see exactly which path through the document tree the LLM took to arrive at an answer, and which page ranges it drew from, rather than an opaque list of cosine similarity scores.

The evidence

VectifyAI cites a system called Mafin 2.5, built on PageIndex, that scored 98.7% on FinanceBench, a benchmark for financial document question answering. They claim this outperforms vector-based RAG approaches on the same benchmark.

This is a strong, specific, falsifiable claim, which is more than most RAG tooling marketing offers. It is worth treating as a meaningful signal rather than a definitive conclusion. Benchmark performance on FinanceBench specifically does not automatically generalize to every document type, especially ones with less clean hierarchical structure than financial filings.

Where the architecture makes sense

PageIndex's design assumptions hold up well for documents that already have strong hierarchical structure: SEC filings, regulatory text, academic textbooks, legal contracts, technical manuals. These documents are written with a table of contents in mind. A tree index captures structure that genuinely exists in the source material rather than imposing artificial structure through chunk boundaries.

For this category of document, removing chunking entirely is not a workaround, it is the more accurate representation of how the information is actually organized.

Where I would want more detail before adopting it in production

Latency and cost at query time. Vector similarity search is a single embedding call plus a fast nearest-neighbor lookup. Tree search requires the LLM to reason over multiple nodes, potentially across multiple reasoning steps per query. This trades a cheap, fast retrieval step for a more expensive, slower one. Whether that tradeoff is worth it depends entirely on your latency budget and query volume.

Documents without clean hierarchical structure. Financial filings and legal documents are well-suited to this approach because their structure is the point. Documents that are unstructured prose, internal Slack threads exported to PDF, meeting transcripts, casual documentation, do not have a meaningful tree to build, and PageIndex's self-hosted version's reliance on standard PDF parsing rather than the enhanced OCR pipeline in the paid tier may struggle to extract a useful structure from these.

Scale beyond a single document. The open-source repo and its core tree search operate on a single document. VectifyAI's newer "PageIndex File System" extension addresses multi-document reasoning, but that capability sits in the commercial layer, not the open-source core covered in this repository.

The broader pattern worth noticing

This project is evidence of a shift in how the RAG community is thinking about chunking. My own case study on code RAG found that chunking strategy mattered more than retrieval method when the chunking destroyed semantically important context. PageIndex is a more radical version of the same insight: rather than finding the right chunk size, remove the chunking step and index the structure that already exists in the document.

Whether vectorless reasoning-based retrieval becomes a mainstream pattern or remains well-suited to a specific category of structured professional documents is an open question. The architecture is a serious, well-evidenced answer to a real limitation in standard RAG, and it is worth understanding even if you do not adopt it directly.

Credit and links

All credit for PageIndex goes to VectifyAI and the project's contributors.

Original repository: https://github.com/VectifyAI/PageIndex Project homepage: https://vectify.ai/pageindex

If this approach is useful to you, consider starring the original repository.

PageIndex: Vectorless, Reasoning-Based RAG Explained

Introduction

The problem PageIndex is responding to

How PageIndex works

The evidence

Where the architecture makes sense

Where I would want more detail before adopting it in production

The broader pattern worth noticing

Credit and links

Comments

More from this blog

Chunking vs Retrieval: A RAG Case Study on a Real Codebase

Building a RAG Chatbot: Every Design Decision Explained

The Necklace That Failed Five Times: Building a Self-Critiquing Multi-Agent Portrait Pipeline

Bridging Google ADK and MCP: Building Framework-Agnostic AI Tools

Command Palette

Introduction

The problem PageIndex is responding to

How PageIndex works

The evidence

Where the architecture makes sense

Where I would want more detail before adopting it in production

The broader pattern worth noticing

Credit and links

Comments

More from this blog