Subjective Metrics: Multi-Dimensional Retrieval

March 31, 2026

rag · klyde · semantic-retrieval · subjective-metrics · ai

Embeddings tell you what something is about. They can't tell you what it is like.

A Fox News article and an AP article about the same event will have similar embeddings. They're about the same thing. But if you need balanced perspectives, cosine similarity won't help you distinguish them. A beginner Python tutorial and an advanced systems design post might both match a query about "Python development." A frustrated support ticket and a confused one about the same product issue embed similarly.

Semantic similarity is one axis. Most retrieval systems treat it as the only axis.

The Pattern

Score content before you need to retrieve it. Take a piece of content — an article, a research paper, a support ticket, a song. Run it through prompts that ask an LLM to rate it 0-10 on specific qualities. Store those scores as numeric metadata alongside the vector embedding.

Now retrieval has multiple dimensions. Instead of just "find similar documents," you can query: semantic match AND bias <= 3 AND methodology_rigor >= 7. The vector search handles similarity. The metadata filters handle everything else.

The latency cost at query time is nearly zero — metadata filtering happens inside the vector database, not in application code. The scoring cost is a one-time investment per document during indexing.

What This Enables

Content moderation at retrieval time. Score your knowledge base for safety dimensions during indexing. Filter out high-risk content before it ever reaches the prompt — not after generation.

Personalized retrieval. Score documents on complexity, formality, technical depth. Match against user profiles. A junior developer and a senior architect asking the same question get different source material.

Research synthesis. Score papers on methodology rigor, sample size, recency. When the LLM synthesizes findings, it draws from sources filtered by quality rather than just relevance.

Customer intelligence. Score support tickets on frustration, urgency, churn risk. Route to different response strategies based on the scores, not just the topic.

None of this requires new models or new infrastructure. It requires using the models we already have to generate structured metadata, and using the vector databases we already have to filter on it.

How I Discovered This

I covered the origin in the RAG article — building a song lyrics demo where semantic search alone felt flat. Adding ten scoring dimensions (romantic, violent, political, sad, etc.) transformed "find similar songs" into "find similar songs that are melancholic but not violent, with high lyrical maturity." The results went from topically relevant to usefully relevant.

Song data indexed at four granularities with ten scoring dimensions layered on top

That demo convinced me the pattern was general-purpose.

Klyde: Making It Work for Any Domain

The song demo was hardcoded to ten music categories. I wanted the same capability for anything.

Klyde is a Chrome extension where you define your own prompt sets — collections of scoring prompts for a specific domain:

International news — 7 dimensions: threat level, clarity, tone, perspective diversity, bias, historical context, international perspective.

Persuasion detection — 8 dimensions: cherry picking, ad hominem, false dichotomy, over-generalization, scapegoating, bandwagon appeal, repetition, emotional language.

Email analysis — tone, urgency, clarity, actionability.

You can create sets for whatever you need. Describe what you want to measure and the system generates a scoring prompt. Each prompt produces a 0-10 score, free-text analysis, or structured JSON.

The Chrome extension solves the data collection problem. Instead of building a custom scoring pipeline for every use case, you browse the web and score content as you encounter it. Point it at an article, select your prompt sets, get scores back.

Bulk scoring

Single-page analysis is useful for exploration. The real value comes at scale.

Import a CSV of URLs, or scrape a page for all its links, select which to analyze. The system runs your prompt sets against each URL and exports results as CSV or JSON. Score 200 articles for bias and you have a dataset. Score a competitor's blog for persuasion tactics and you have competitive intelligence. Score your own content library for tone consistency and you have a quality audit.

The export format is designed to feed downstream — upload the scores as metadata when you embed the documents and your retrieval system gains all those filtering dimensions.

A Design Pattern, Not a Product Feature

The models will keep getting smarter. Context windows will keep growing. But the insight that retrieval should be multi-dimensional — not just semantically similar — is independent of any specific model or framework. It's a design pattern. Score content on the dimensions that matter for your domain. Store those scores where your retrieval system can filter on them. Query across multiple axes instead of one.

The pattern works today with Pinecone metadata filters. It'll work tomorrow with whatever replaces Pinecone. The specific tooling changes. The idea that one axis of similarity isn't enough doesn't.