← Back

What I Learned Building RAG Before It Had a Name

Drafted by Lam Hoang
rag · unacog · klyde · semantic-retrieval · ai

In early 2023, my dev partner Sam and I were building a collaborative LLM chat platform called Unacog. We wanted users to be able to chat with their own documents — research papers, notes, PDFs, whatever they had. The concept of "retrieval augmented generation" existed in academic papers, but it wasn't in the zeitgeist yet. Nobody was calling it RAG on Twitter. There were no RAG-in-a-box products. We were just trying to solve a problem: how do you give an LLM access to information it wasn't trained on?

We figured it out from first principles. Some of what we learned is still useful.

The Chunking Problem

The first thing you learn building a RAG pipeline is that chunking matters more than you think. You take a document, split it into pieces, embed each piece into a vector, and store it. When a user asks a question, you embed the question, find the most similar vectors, and feed those chunks to the LLM as context.

Simple in theory. In practice, how you split the document changes everything.

We built four chunking strategies and tested them against real data:

Size-based splitting — the simplest approach. Walk through the text splitting by newlines, then by words within lines, accumulating tokens until you hit a threshold. We used gpt-tokenizer for accurate token counting rather than character estimation. Fast, predictable, but it breaks mid-thought whenever the threshold hits.

Sentence-based chunking — group N sentences together with configurable separators. The key insight: bidirectional overlap. After filling a chunk to the sentence count, we'd append the next N sentences AND prepend the previous N sentences to the next chunk. This continuity at boundaries made a real difference in retrieval quality. Chunk boundaries are where context gets lost, and overlap is cheap insurance.

Recursive character splitting — LangChain's RecursiveCharacterTextSplitter with our token-counting length function plugged in. Configurable chunk size, overlap, and separator hierarchy. The most flexible option, but also the one where bad defaults produce bad results. We found that the separator hierarchy matters more than the chunk size.

No chunking — for short documents, just embed the whole thing. Sometimes the simplest approach is right.

We didn't settle on one strategy. We built all four because different documents need different approaches. A research paper with clear section breaks works differently than a transcript. A FAQ works differently than a narrative.

A Million Vectors

To prove the system worked, we built four demos and indexed real data at scale:

arXiv AI papers — 2,800+ research papers from Cornell's arXiv, scraped via HuggingFace. Over a million vectors indexed across four Pinecone indexes, each using a different chunk size (100, 200, 300, and 400 tokens). Same corpus, four ways to slice it. This let us directly compare retrieval quality across chunk sizes for the same queries.

COVID research — 389 documents, 51,000 vectors from science.org. Two indexes: sentence-based and recursive. Same data, different splitting strategies.

Bible — Full text indexed two ways: individual verses and full chapters. This was our stress test for granularity. When someone asks "what does the Bible say about forgiveness," do you want one verse or an entire chapter? The answer depends on the question, which is why we built the retrieval to be configurable.

Song lyrics — Billboard Top 100, four granularities: full song, stanza, verse, and double-stanza. This is where things got interesting.

Small-to-Big Retrieval

One of the patterns we developed was what people now call "small-to-big" retrieval — match on a small chunk for precision, then expand to the surrounding context for completeness.

The implementation: chunk IDs are padded with their position ({docId}_00001_00005 — chunk 1 of 5). When you get a match, you can look up adjacent chunks from the same document and merge them. The trick is handling overlap — when chunks share content at their boundaries, you need to deduplicate. Our annexChunkWithoutOverlap function detects overlapping text sequences and merges intelligently rather than just concatenating.

We exposed three knobs: topK (how many chunks to retrieve), includeK (how many to actually use in the prompt), and contextK (how many adjacent chunks to expand each match with). Users could tune precision vs context width per query.

This pattern shows up in most RAG frameworks now. We were building it by hand because the frameworks didn't exist yet.

Dynamic Prompt Templates

We used Handlebars for prompt construction. Two layers of templates:

A document template that formats each retrieved chunk:

Chapter ({{title}}):
 {{text}}

A main template that assembles everything:

Please respond using the following chapters as guidance:
{{documents}}
Respond to this prompt:
{{prompt}}

Both were user-configurable per session. Metadata fields from the vector database — title, text, URL, any custom meta_* fields — got merged as template variables. This meant users could customize how retrieved context was presented to the LLM without touching code.

The Bible demo had three preset templates including violence ratings and age-appropriateness analysis. The flexibility of template-driven prompting meant the same retrieval pipeline could power completely different use cases.

The Discovery: Subjective Metrics

The song lyrics demo is where something unexpected happened.

We were doing standard semantic retrieval — embed a query, find similar songs. It worked, but the results were limited. "Find songs similar to this one" only goes so far when similarity is a single axis: semantic cosine distance.

So we built a metrics system. Ten categories — romantic, comedic, violent, political, religious, sad, motivational, mature, seasonal, inappropriate language. For each song, we ran prompts that asked the LLM to rate the content 0-10 on each dimension. We stored these scores as numeric metadata on the embeddings.

Now retrieval had multiple axes. Instead of just "find similar songs," you could say "find songs similar to X where romantic is above 7 and violent is below 3." That query is impossible with pure embedding similarity. You need the metadata dimensions.

This changed how I think about retrieval entirely. Cosine similarity gives you one axis: how semantically close are these pieces of text? Subjective metrics give you as many axes as you can define. Emotional intensity. Bias level. Technical complexity. Persuasion tactics. Whatever matters for your domain.

The practical impact: retrieval that actually matches what the user wants, not just what's textually similar.

This Became Klyde

The metrics system was too useful to leave inside a demo. We extracted it into a Chrome extension called Klyde — a tool that lets you perform subjective analysis on any content in the browser.

You define prompt sets (collections of scoring prompts for a domain), point them at a web page or text selection, and get numeric scores back. International news analysis with 7 dimensions. Persuasion detection with 8 rhetorical markers. Email tone analysis. Whatever you need.

The scores can feed back into vector retrieval as metadata filters. The Chrome extension solved the data collection problem — how do you get content scored and indexed without a custom pipeline for every use case? You browse the web and analyze as you go.

What I Think About RAG Now

RAG isn't dead. It's just not the answer to everything, and the hype made people think it was.

Vector embeddings let LLMs understand similarity without reading the full text. That's useful — especially for large datasets where you can't stuff everything into the context window. It's the only practical approach when you have thousands of documents.

But people tried to use RAG for everything. Memory systems. Conversation history. User preferences. And what we're seeing now — with systems like OpenClaw and modern agent architectures — is that for a lot of those use cases, the better approach is to let the agent navigate the file system and read documents when it needs them. That's more natural for how these models work.

The thing that actually makes LLMs better isn't better retrieval. It's better models. Every time the underlying model improves, everything built on top of it improves. RAG is a tool in the toolbox, not the solution.

But subjective metrics — multi-dimensional scoring that lets you filter by qualities embeddings can't capture — I think that's still underexplored. Most RAG systems are still doing single-axis cosine similarity with maybe some keyword filtering. Storing LLM-generated scores as metadata and filtering on them at query time is a different approach, and it works.

What Happened to Unacog

Sam and I eventually went our separate ways. The site is still up but the platform isn't actively maintained. The demos still work. The code is all there.

I don't regret any of it. The RAG pipeline, the chunking strategies, the subjective metrics discovery — all of that carried directly into Klyde, into how I built Eureka's entity system, into how I think about context in every project since. The platform died but the ideas didn't.

The Numbers

Unacog: ~32,100 lines of code. 7 Cloud Functions, 20 REST endpoints, 5 Firebase services.

Unacog Demos: ~5,200 lines. 1M+ vectors indexed across Pinecone. Four demos, each showcasing different retrieval strategies.

Klyde: ~6,600 lines of hand-written code. Chrome extension (Manifest V3) with 15 built-in prompt templates, 6 pre-built vector indexes, and a bulk analysis system.

I built all of this before most people had heard the term RAG. Not because I was trying to be early — because I had a problem to solve and these were the tools that worked.