← Back

Subjective Metrics: The RAG Feature Nobody's Using

Drafted by Lam Hoang
rag · klyde · semantic-retrieval · subjective-metrics · ai

Here's a standard RAG query: "find documents similar to this one." You embed the query, run a cosine similarity search, get back the closest vectors. It works. It's also limited to a single dimension — how textually similar are these two pieces of content?

What if you could also ask: "find documents similar to this one, but only the ones that are highly persuasive and low on emotional language"? Or: "find research papers related to this topic where the methodology is rigorous and the conclusions are cautious"?

You can't do that with embeddings alone. Embeddings compress meaning into a single vector. They can tell you what something is about. They can't tell you what it is like.

That's the problem subjective metrics solve.

How It Works

The idea is simple. Before you need to retrieve a document, you score it.

Take a piece of content — an article, a song, a research paper, an email. Run it through a set of prompts that each ask the LLM to rate it 0-10 on a specific quality. Store those scores as numeric metadata alongside the vector embedding.

For song lyrics, we used ten categories: romantic, comedic, violent, political, religious, sad, motivational, mature, seasonal, inappropriate language. Each prompt gives the LLM specific criteria for what a 0 looks like versus a 10, and asks it to return a JSON score.

Now each song has an embedding (what it's about) AND ten numeric scores (what it's like). At query time, you can filter on any combination: romantic >= 7 AND violent <= 2 AND sad >= 5. The vector search handles semantic similarity. The metadata filters handle everything else.

Why This Matters

Most retrieval systems treat every document as a point in semantic space. Similar documents cluster together. That's useful but it's also lossy — two documents can be about the same topic but completely different in tone, quality, bias, complexity, or intent.

A Fox News article and an AP article about the same event will have similar embeddings. They're about the same thing. But if you're building a system that needs balanced perspectives, cosine similarity won't help you distinguish them. Bias scores will.

A beginner Python tutorial and an advanced systems design post might both match a query about "Python development." Complexity scores let you route the right content to the right user.

A customer support ticket that's frustrated and one that's confused might embed similarly — both are about the same product issue. Emotional intensity scores let you prioritize the one that needs urgent attention.

Embeddings tell you what. Subjective metrics tell you how.

What We Built

I stumbled into this while building demos for Unacog, our LLM chat platform. The song lyrics demo started as a simple semantic search — find songs similar to a query. It worked but felt flat. The results were topically relevant but not usefully relevant.

Adding the ten-category scoring system changed the results immediately. Instead of "here are songs about love" you get "here are songs about love that are melancholic but not violent, with high lyrical maturity." That's a playlist someone would actually listen to.

We built the scoring into the embedding pipeline. When a song gets indexed, it gets chunked, embedded, AND scored across all ten dimensions. The scores are stored as Pinecone metadata. At query time, the vector similarity search runs first, then metadata filters narrow the results.

The latency cost is nearly zero at query time — metadata filtering happens in the vector database, not in application code. The scoring cost is a one-time investment per document during indexing.

Klyde: Making It General-Purpose

The song demo proved the concept. But it was hardcoded to ten music-specific categories. We wanted the same capability for any domain.

That became Klyde — a Chrome extension where users define their own prompt sets. A prompt set is a collection of scoring prompts for a specific domain:

International news — 7 dimensions: threat level, clarity, tone, perspective diversity, bias, historical context, international perspective.

Persuasion detection — 8 dimensions: cherry picking, ad hominem, false dichotomy, over-generalization, scapegoating, bandwagon appeal, repetition, emotional language.

Email analysis — tone, urgency, clarity, actionability.

You can create your own sets for whatever you need. Describe what you want to measure and the system generates a scoring prompt from an example. Each prompt produces either a 0-10 score, a free-text analysis, or structured JSON.

The Chrome extension solves the data collection problem. Instead of building a custom scoring pipeline for every use case, you browse the web and score content as you encounter it. Point it at an article, select your prompt sets, get scores back. The scores are stored and can feed into vector retrieval as metadata filters.

Bulk Scoring

Single-page analysis is useful for exploration. But the real value comes from scoring at scale.

Klyde has a bulk analysis mode. Import a CSV of URLs, or scrape a page for all its links, select which to analyze. The system runs your prompt sets against each URL — server-side scraping or browser-based — and exports results as CSV or JSON.

Score 200 articles for bias and you have a dataset. Score a competitor's entire blog for persuasion tactics and you have competitive intelligence. Score your own content library for tone consistency and you have a quality audit.

The export format is designed to be useful downstream. Upload the scores as metadata when you embed the documents and your retrieval system gains all those filtering dimensions.

Where I Think This Goes

The RAG ecosystem is converging on a standard architecture: chunk, embed, retrieve, generate. The tooling gets better every month. But almost everyone is still working with a single retrieval axis — semantic similarity.

Adding scored metadata dimensions is not hard. The LLM calls for scoring are cheap compared to the value they add. And the queries they enable — "find me X that is also Y but not Z" — are the queries real users actually want to ask.

Some specific use cases I think are underexplored:

Content moderation at retrieval time. Instead of moderating after generation, score your knowledge base for safety dimensions during indexing. Filter out high-risk content before it ever reaches the prompt.

Personalized retrieval. Score documents on complexity, formality, technical depth. Match against user preference profiles. A junior developer and a senior architect asking the same question get different source material.

Research synthesis. Score papers on methodology rigor, sample size, recency, citation count. When the LLM synthesizes findings, it draws from sources filtered by quality rather than just relevance.

Customer intelligence. Score support tickets on frustration, urgency, churn risk. Route to different response strategies based on the scores, not just the topic.

None of this requires new models or new infrastructure. It requires using the models we already have to generate structured metadata, and using the vector databases we already have to filter on it.

The Code

Klyde is ~6,600 lines of TypeScript. Chrome extension, Manifest V3. It uses Unacog's Firebase backend for LLM calls and vector queries. The prompt management UI is built on Tabulator grids with full CRUD — import, export, AI-assisted generation.

The subjective metrics idea started in a song lyrics demo because I wanted better playlist recommendations. It turned into a general-purpose tool because the same pattern — score content on arbitrary dimensions, filter at retrieval time — works everywhere.

The models will keep getting smarter. Context windows will keep growing. But the insight that retrieval should be multi-dimensional, not just semantically similar, is independent of any specific model or framework. It's a design pattern, not a product feature.