The Complete Guide to RAG and Vector Databases in 2026
Last updated: January 2026
Most developers spend their first few months with RAG systems making every mistake in the book. Wrong embeddings, chunking that destroyed context, vector databases that buckled under production load. The documentation was scattered, the benchmarks contradicted each other, and nobody seemed to agree on best practices.
This guide is what I wish existed when I started.
After building RAG systems for healthcare document processing, legal contract analysis, and enterprise knowledge bases, I've learned that the difference between a demo and production isn't just "more vectors." It's understanding why you're choosing each component and how they interact under real-world conditions.
Let's get into it.
Why RAG Still Matters in 2026 (Despite What You've Heard)
Every few months, someone proclaims RAG is dead. Usually right after a new model ships with a bigger context window.
The argument goes: Claude 4 handles 200k tokens, Gemini 2.5 processes up to a million tokens. Why bother with retrieval when you can just dump everything into the context?
Here's what those takes miss.
Cost and latency compound fast. Processing a million tokens per query gets expensive. At current rates, a single query against a full knowledge base could cost $10-20 in API fees. Multiply that by thousands of daily queries and you're looking at infrastructure costs that dwarf whatever you'd spend on a vector database.
Position bias is real. Research from Google confirms that LLMs equipped with massive contexts still struggle when relevant information is buried in the middle of a document. The model's accuracy varies depending on where the answer lives in the context. RAG lets you surface exactly what's relevant, right at the top where the model pays attention.
Freshness matters. Your context window can't include documents that were uploaded five minutes ago. RAG systems with proper ingestion pipelines can.
Privacy and compliance. In regulated industries, you can't just ship patient records or financial data to an external API. RAG with on-premise vector databases keeps sensitive information within your security perimeter.
The nuanced take for 2026: start simple. If your knowledge base fits comfortably in 200k tokens and doesn't change often, skip the retrieval stack. Add RAG when scale, freshness, latency, or privacy truly demand it.
What's actually happening is that RAG is evolving from a specific pattern into what some are calling a "Context Engine" - intelligent retrieval that adapts to the query, understands document relationships, and provides governed, explainable context to AI systems.
Top 10 Vector Databases Ranked for 2026
After testing these databases in production scenarios ranging from 100k to 50 million vectors, here's my honest assessment.
The Ranking
| Rank | Database | Type | Best For | Starting Price |
|---|---|---|---|---|
| 1 | Pinecone | Managed | Enterprise scale, minimal ops | Free tier, then usage-based |
| 2 | Qdrant | OSS + Cloud | Complex filtering, cost-sensitive | Free 1GB forever, $25/mo |
| 3 | Weaviate | OSS + Cloud | Hybrid search, modularity | $25/mo after trial |
| 4 | Milvus | OSS | Billion-scale, GPU acceleration | Free (self-hosted) |
| 5 | pgvector | OSS Extension | PostgreSQL shops, simplicity | Free (with PostgreSQL) |
| 6 | MongoDB Atlas Vector | Managed | Existing MongoDB users | Pay-as-you-go |
| 7 | Supabase Vector | Managed | Full-stack developers, speed | Free tier, then $25/mo |
| 8 | Chroma | OSS | Prototyping, local dev | Free |
| 9 | FAISS | OSS Library | Research, custom pipelines | Free |
| 10 | Elasticsearch | OSS + Cloud | Existing ELK stack | Free tier available |
In-Depth Reviews
1. Pinecone - The Enterprise Standard
Pinecone leads because it just works. The managed infrastructure handles scaling, replication, and failover automatically. For teams that want to focus on their RAG logic rather than database operations, it's the default choice.
Query latency sits consistently under 50ms even at scale. The serverless offering launched in 2024 eliminated cold start concerns. Multi-region deployment is straightforward.
The downside is cost at scale. Once you're pushing millions of vectors with high query volume, the bill climbs. But for most teams, the operational simplicity justifies the premium.
Best for: Production RAG systems where reliability matters more than cost optimization.
2. Qdrant - The Performance Champion
Qdrant surprised me. Written in Rust, it delivers exceptional performance with a smaller footprint than competitors. The filtering capabilities are genuinely powerful - you can run complex metadata queries alongside vector similarity without sacrificing speed.
The API is clean. The documentation is excellent. The free tier (1GB forever, no credit card) makes it easy to evaluate properly.
What sets Qdrant apart is quantization support that maintains accuracy while reducing memory requirements. For cost-sensitive deployments where you're paying for every GB of RAM, this matters.
Best for: Applications requiring complex filtering, cost-conscious teams wanting performance.
3. Weaviate - The Flexible Hybrid
Weaviate's strength is hybrid search - combining vector similarity with keyword matching in a single query. For document retrieval where exact terms matter alongside semantic meaning, this dual approach improves relevance.
The modular architecture lets you swap embedding models, add custom modules, and extend functionality. It's the Swiss Army knife approach.
Teams report higher resource requirements at scale. Below 50 million vectors, it runs efficiently. Beyond that, plan capacity carefully.
Best for: Hybrid search requirements, teams wanting flexibility and modularity.
4. Milvus - The Billion-Scale Workhorse
Milvus handles scale that makes other databases sweat. GPU acceleration, distributed querying, and more indexing strategies than any competitor. If you're building a billion-vector system, Milvus should be on your shortlist.
The tradeoff is operational complexity. This isn't a "deploy and forget" solution. You need engineering muscle to run Milvus well.
GitHub stars (~25k) and the LF AI Foundation backing signal strong community support.
Best for: Massive scale deployments with dedicated infrastructure teams.
5. pgvector - The PostgreSQL Native
If you're already running PostgreSQL, pgvector lets you add vector search without introducing another database. Same backup procedures, same monitoring, same security model.
Performance is solid for moderate scale. The HNSW indexing improvements in recent versions handle millions of vectors competently.
For teams that prize operational simplicity and don't need the specialized features of dedicated vector databases, pgvector is often enough.
Best for: PostgreSQL-native teams, moderate scale, unified data architecture.
6. MongoDB Atlas Vector Search
The February 2025 acquisition of Voyage AI signals MongoDB's serious commitment to the vector space. For existing MongoDB users, adding vector search without architectural changes is compelling.
In September 2025, MongoDB extended these capabilities to self-managed offerings, not just Atlas. This opens doors for on-premise deployments with strict data residency requirements.
The unified platform - operational data and vectors in one place - simplifies RAG architectures considerably.
Best for: MongoDB shops, unified operational + vector storage needs.
7. Supabase Vector
Supabase wraps pgvector with their excellent developer experience. If you're building full-stack applications with Supabase for auth, storage, and database, vector search integrates seamlessly.
The documentation is clear. The client libraries are polished. For getting a RAG prototype running in an afternoon, it's hard to beat.
Best for: Full-stack developers, rapid prototyping, Supabase ecosystem users.
8. Chroma - The Prototyping King
Chroma owns the local development experience. Two commands to install and run. Python-native API. No cloud accounts required.
For testing embedding models, experimenting with chunking strategies, or building proof-of-concepts, Chroma removes all friction.
Don't mistake this for a production recommendation. Chroma is for learning and prototyping. Scale needs will push you elsewhere.
Best for: Local development, learning RAG, rapid prototyping.
9. FAISS - The Research Foundation
Meta's FAISS is a library, not a database. It provides the building blocks - indexing algorithms, similarity search primitives - that you assemble into your own solution.
Researchers and teams building custom retrieval systems use FAISS directly. Everyone else uses databases that build on similar techniques.
Best for: Research, custom implementations, maximum control.
10. Elasticsearch - The Legacy Upgrade Path
If you're already running Elasticsearch for logging or search, adding vector capabilities to your existing cluster might make sense. The vector search features have matured considerably.
For greenfield RAG projects, purpose-built vector databases will likely outperform. But for organizations with significant ELK investments, the upgrade path is legitimate.
Best for: Existing Elasticsearch deployments, unified observability and search.
Editor's Picks by Use Case
For Startups: Start with Qdrant's free tier or Supabase Vector. Graduate to Pinecone when you need enterprise features.
For Enterprise: Pinecone for managed simplicity, MongoDB Atlas Vector if you're already MongoDB, Milvus if you have the team for self-hosting at scale.
For Healthcare/Finance: Qdrant or Milvus self-hosted for data sovereignty. Pinecone with BAA for managed with compliance.
For Learning: Chroma locally, then deploy to Supabase Vector to understand production concerns.
Embeddings Deep Dive: Choosing the Right Model
Embeddings are where retrieval quality is made or lost. Pick wrong, and no amount of database optimization will save you.
Dimension Sizes: What They Actually Mean
Embedding dimensions represent how much semantic information the model captures. More dimensions generally means more nuance, but also more storage and compute.
| Dimensions | Storage/Vector | Use Case | Models |
|---|---|---|---|
| 384-512 | ~1.5-2 KB | Lightweight, mobile, high volume | MiniLM, E5-small |
| 768 | ~3 KB | Balanced general purpose | BGE-base, BAAI |
| 1024 | ~4 KB | Domain-specific, high quality | Voyage-finance-2, Voyage-law-2 |
| 1536 | ~6 KB | High accuracy general | OpenAI text-embedding-3-small |
| 3072 | ~12 KB | Maximum accuracy | OpenAI text-embedding-3-large, Voyage-3-large |
Top Embedding Models for 2026
General Purpose - Commercial
-
Google Gemini Embedding (gemini-embedding-001) - Currently #1 on MTEB leaderboard. Now GA in Gemini API and Vertex AI.
-
Voyage AI voyage-3-large - Outperforms OpenAI and Cohere across 100+ datasets by 9-20%. After MongoDB's acquisition, expect deeper integration.
-
OpenAI text-embedding-3-large - Reliable workhorse. The "ancient" option by AI standards (March 2023) but still solid for general use.
General Purpose - Open Source
-
Alibaba Qwen3-Embedding - Ranks just behind Gemini on MTEB. Apache 2.0 license means you can run it yourself.
-
NVIDIA NV-Embed - Fine-tuned from Llama-3.1-8B. Excellent multilingual support, 69.32 MTEB score.
-
BGE (BAAI) - Strong performance, active development, MIT license.
Domain-Specific Winners
For Finance: Voyage-finance-2 demolished OpenAI on SEC filings (54% vs 38.5% accuracy). The gap widens on pure financial queries (63.75% vs 40%).
For Legal: Harvey AI partnered with Voyage to create voyage-law-2-harvey. It reduces irrelevant results by 25% compared to off-the-shelf alternatives.
For Code: Qodo-Embed-1-1.5B was designed specifically for code retrieval. If you're building code search or documentation systems, evaluate it.
My Recommendations by Industry
| Industry | Primary Model | Backup Model | Why |
|---|---|---|---|
| Healthcare | Voyage-3-large | BGE-large | Need accuracy, domain terms matter |
| Finance | Voyage-finance-2 | Gemini Embedding | Specialized beats general by 15%+ |
| Legal | Voyage-law-2 | Voyage-3-large | Contract language is specific |
| E-commerce | Gemini Embedding | text-embedding-3-small | Balance of cost and quality |
| Customer Support | text-embedding-3-small | BGE-base | Volume matters, good enough is fine |
| Code Search | Qodo-Embed | NV-Embed | Code is different from prose |
Practical Advice
Run your own evaluation. The MTEB leaderboard is a starting point, not a destination. What matters is how the model performs on your data.
Create a test set of 100-500 query-document pairs from your actual corpus. Measure MRR (Mean Reciprocal Rank) and NDCG. The model that wins on benchmarks might not win on your specific use case.
Chunking Strategies That Actually Work
Chunking is the silent killer of RAG accuracy. Get it wrong and you'll spend weeks debugging retrieval issues that trace back to how you split documents.
The Core Strategies
1. Fixed-Size Chunking Split every N characters or tokens, regardless of content.
- Pros: Fast, simple, predictable
- Cons: Breaks mid-sentence, destroys context
- Use for: Prototyping only
2. Recursive Character Splitting Split on paragraphs first, then sentences, then characters as needed.
- Pros: Preserves structure, LangChain default for good reason
- Cons: Still content-agnostic
- Use for: Most RAG applications (start here)
3. Semantic Chunking Use embeddings to detect topic shifts and split on meaning boundaries.
- Pros: Up to 70% accuracy improvement, preserves concepts
- Cons: Higher compute cost, slower ingestion
- Use for: High-value document corpuses where quality matters
4. Document-Aware Chunking Respect document structure - headers, sections, code blocks.
- Pros: Maintains document logic, enables hierarchical retrieval
- Cons: Requires format-specific parsing
- Use for: Technical docs, legal contracts, structured content
5. Agentic/LLM-Based Chunking Let an LLM decide chunk boundaries based on content understanding.
- Pros: Best quality for complex documents
- Cons: Expensive, slow
- Use for: Legal contracts, research papers, compliance docs
Optimal Settings
After extensive testing, here's what works:
Chunk size: 256-512 tokens
Overlap: 10-20% (50-100 tokens for 500-token chunks)
NVIDIA's benchmarks found page-level chunking achieved 0.648 accuracy with lowest variance. But for most applications, recursive splitting at 400-512 tokens delivers 85-90% recall without the overhead.
Code-Specific Chunking
Code needs different treatment. Don't split functions in half.
For Python:
- Split on top-level functions and classes
- Include docstrings with their functions
- Preserve import context
For JavaScript/TypeScript:
- Respect module boundaries
- Keep JSDoc comments attached
- Consider AST-aware splitting
# Code-aware chunking example
from langchain.text_splitter import RecursiveCharacterTextSplitter
code_splitter = RecursiveCharacterTextSplitter.from_language(
language="python",
chunk_size=1000,
chunk_overlap=100
)
Building RAG Systems: The Framework Decision
The framework wars are tiresome. Here's the practical reality.
LangChain
LangChain is the Swiss Army knife - chains, agents, memory, tools, and integrations with everything. Version 1.0+ brought stable APIs and a modular structure (langchain-core, langchain-community, provider-specific packages).
Use LangChain when:
- Building complex multi-step workflows
- Need broad tool integration
- Want the largest ecosystem of examples and community support
Framework overhead: ~10ms per call, ~2.4k token usage
LlamaIndex
LlamaIndex treats your data as a first-class citizen. Ingestion, chunking, index construction, and query engines are deeply thought out.
Use LlamaIndex when:
- Your primary bottleneck is retrieval quality
- Working with messy, unstructured data (PDFs, varied formats)
- Need hierarchical indexing out of the box
Framework overhead: ~6ms per call, ~1.6k token usage
The Power Move: Use Both
In production for 2026, the winning pattern often combines them:
-
LlamaIndex as the Data Layer - Ingest PDFs, clean data, build the vector index. LlamaIndex's retriever is genuinely superior for complex documents.
-
LangChain as the Control Layer - Wrap the LlamaIndex query engine as a LangChain tool. Let a LangGraph agent decide when to call retrieval.
This hybrid approach gives you the best of both worlds.
Direct API Implementation
Sometimes frameworks add more complexity than value. For simple RAG pipelines:
# Direct OpenAI + Qdrant implementation
from openai import OpenAI
from qdrant_client import QdrantClient
client = OpenAI()
qdrant = QdrantClient(url="http://localhost:6333")
def simple_rag(query: str, collection: str = "documents"):
# 1. Embed the query
embedding = client.embeddings.create(
model="text-embedding-3-small",
input=query
).data[0].embedding
# 2. Search vectors
results = qdrant.search(
collection_name=collection,
query_vector=embedding,
limit=5
)
# 3. Build context
context = "\n\n".join([r.payload["text"] for r in results])
# 4. Generate response
response = client.chat.completions.create(
model="gpt-4-turbo",
messages=[
{"role": "system", "content": f"Answer based on this context:\n{context}"},
{"role": "user", "content": query}
]
)
return response.choices[0].message.content
No framework overhead. Full control. Sometimes this is exactly what you need.
Agentic RAG: The 2026 Frontier
Traditional RAG is a single retrieval step. Agentic RAG embeds decision-making into the process.
Key Patterns
Self-RAG (ICLR 2024)
The model decides whether to retrieve, when to retrieve, and critiques its own outputs. Instead of always retrieving, it learns to retrieve on demand.
Key innovation: Reflection tokens as part of generation. The model predicts when it's uncertain and needs external knowledge.
Corrective RAG
After retrieval, evaluate whether the retrieved documents actually help. If not, try a different search strategy.
# Corrective RAG pseudocode
def corrective_rag(query):
docs = retrieve(query)
relevance = evaluate_relevance(query, docs)
if relevance < threshold:
# Try different strategies
docs = web_search(query) # or decompose query
return generate_with_context(query, docs)
Multi-Step Retrieval
Complex questions often need multiple retrieval passes. First retrieve background context, then retrieve specific details.
Hybrid Search
Combine dense vectors (semantic similarity) with sparse vectors (keyword matching). When a user searches for "HIPAA compliance requirements," you want both semantic understanding of compliance concepts AND exact keyword matching on "HIPAA."
# Hybrid search with Weaviate
results = client.query.get(
"Document",
["text", "title"]
).with_hybrid(
query="HIPAA compliance requirements",
alpha=0.5 # Balance between vector and keyword
).with_limit(10).do()
Reranking
Initial retrieval is fast but imprecise. A reranker (like Cohere Rerank or cross-encoder models) scores the top-N results more carefully.
# Reranking pipeline
initial_results = vector_search(query, limit=50) # Fast, recall-oriented
reranked = reranker.rerank(query, initial_results, top_n=5) # Slow, precision-oriented
Industry-Specific RAG Implementations
Healthcare
The Challenge: HIPAA compliance, sensitive patient data, life-critical accuracy.
Recommended Stack:
- Vector DB: Qdrant or Milvus (self-hosted for data sovereignty)
- Embeddings: Voyage-3-large (accuracy critical)
- Framework: LlamaIndex for medical document parsing
Critical Requirements:
- All data must be encrypted at rest and in transit
- Business Associate Agreements (BAAs) required for any cloud components
- Audit trails for every query and response
- De-identify documents before embedding where possible
Architecture Pattern:
Patient Query -> On-Premise RAG -> Local LLM
(no PHI leaves network)
Insurance
The Challenge: Complex policy documents, claims processing speed, fraud detection.
Recommended Stack:
- Vector DB: MongoDB Atlas Vector (unified with operational data)
- Embeddings: Voyage-finance-2 (handles financial/insurance terminology)
- Framework: LangGraph for multi-step claims workflows
Use Case Example: Auto insurance claims RAG reviews accident photos, police reports, and repair estimates while checking policy coverage and precedent cases. Output: coverage determination, estimated payout, fraud indicators.
McKinsey reports 30% processing time reduction and 20% cost savings with smart document systems.
Legal
The Challenge: Precision requirements, case law complexity, contract nuance.
Recommended Stack:
- Vector DB: Pinecone or Qdrant (filtering by jurisdiction, date, court)
- Embeddings: Voyage-law-2 or voyage-law-2-harvey
- Framework: LlamaIndex (PDF parsing excellence)
Research Impact: Legal teams report research time dropping from 3 hours to 20 minutes per matter. The caveat: attorneys remain responsible for verification. RAG assists but doesn't replace professional judgment.
Contract analysis use case: Identify key clauses, potential risks, and inconsistencies across multiple documents. Compare terms against standard clauses, flag deviations for review.
Financial Services
The Challenge: Real-time requirements, regulatory compliance, market data integration.
Recommended Stack:
- Vector DB: Pinecone (latency) or MongoDB Atlas (operational integration)
- Embeddings: Voyage-finance-2 (15%+ accuracy improvement on financial text)
- Framework: Direct implementation for latency-critical paths
Key Consideration: Many financial queries need current market data alongside historical knowledge. RAG architecture should support dynamic context injection for real-time feeds.
Complete Code Examples
Python: Production RAG Pipeline
"""
Production RAG Pipeline with LlamaIndex + Qdrant
Includes: chunking, embedding, indexing, querying
"""
from llama_index.core import (
VectorStoreIndex,
SimpleDirectoryReader,
Settings,
StorageContext
)
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from qdrant_client import QdrantClient
# Configuration
QDRANT_URL = "http://localhost:6333"
COLLECTION_NAME = "documents"
EMBEDDING_MODEL = "text-embedding-3-small"
LLM_MODEL = "gpt-4-turbo"
def setup_qdrant():
"""Initialize Qdrant client and collection"""
client = QdrantClient(url=QDRANT_URL)
# Create collection if it doesn't exist
collections = client.get_collections().collections
if not any(c.name == COLLECTION_NAME for c in collections):
client.create_collection(
collection_name=COLLECTION_NAME,
vectors_config={
"size": 1536, # text-embedding-3-small dimensions
"distance": "Cosine"
}
)
return client
def create_index(documents_path: str):
"""Create vector index from documents"""
# Setup Qdrant
qdrant_client = setup_qdrant()
vector_store = QdrantVectorStore(
client=qdrant_client,
collection_name=COLLECTION_NAME
)
# Configure LlamaIndex settings
Settings.embed_model = OpenAIEmbedding(model=EMBEDDING_MODEL)
Settings.llm = OpenAI(model=LLM_MODEL)
Settings.chunk_size = 512
Settings.chunk_overlap = 50
# Load and index documents
documents = SimpleDirectoryReader(documents_path).load_data()
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
documents,
storage_context=storage_context,
show_progress=True
)
return index
def query_rag(query: str, index: VectorStoreIndex):
"""Query the RAG system"""
query_engine = index.as_query_engine(
similarity_top_k=5,
response_mode="tree_summarize"
)
response = query_engine.query(query)
return response
# Usage
if __name__ == "__main__":
# First time: create index
index = create_index("./documents")
# Query
response = query_rag("What are the key compliance requirements?", index)
print(response)
TypeScript: Next.js RAG API Route
/**
* RAG API Route for Next.js
* Uses Vercel AI SDK with Pinecone
*/
import { NextRequest, NextResponse } from 'next/server';
import { Pinecone } from '@pinecone-database/pinecone';
import OpenAI from 'openai';
const openai = new OpenAI();
const pinecone = new Pinecone();
interface RAGRequest {
query: string;
namespace?: string;
topK?: number;
}
export async function POST(request: NextRequest) {
try {
const { query, namespace = 'default', topK = 5 }: RAGRequest =
await request.json();
// 1. Generate embedding for query
const embeddingResponse = await openai.embeddings.create({
model: 'text-embedding-3-small',
input: query,
});
const queryEmbedding = embeddingResponse.data[0].embedding;
// 2. Search Pinecone
const index = pinecone.Index(process.env.PINECONE_INDEX!);
const searchResults = await index.namespace(namespace).query({
vector: queryEmbedding,
topK,
includeMetadata: true,
});
// 3. Build context from results
const context = searchResults.matches
.map((match) => match.metadata?.text || '')
.join('\n\n---\n\n');
// 4. Generate response with context
const completion = await openai.chat.completions.create({
model: 'gpt-4-turbo',
messages: [
{
role: 'system',
content: `You are a helpful assistant. Answer questions based on the provided context. If the context doesn't contain relevant information, say so.
Context:
${context}`,
},
{
role: 'user',
content: query,
},
],
temperature: 0.7,
max_tokens: 1000,
});
// 5. Return response with sources
return NextResponse.json({
answer: completion.choices[0].message.content,
sources: searchResults.matches.map((match) => ({
id: match.id,
score: match.score,
title: match.metadata?.title,
url: match.metadata?.url,
})),
});
} catch (error) {
console.error('RAG Error:', error);
return NextResponse.json(
{ error: 'Failed to process query' },
{ status: 500 }
);
}
}
Python: Agentic RAG with LangGraph
"""
Agentic RAG with LangGraph
Implements: query decomposition, corrective retrieval, self-reflection
"""
from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant
from langchain.prompts import ChatPromptTemplate
from typing import TypedDict, List, Optional
from qdrant_client import QdrantClient
# State definition
class RAGState(TypedDict):
query: str
sub_queries: Optional[List[str]]
retrieved_docs: List[str]
relevance_scores: List[float]
needs_correction: bool
response: Optional[str]
# Initialize components
llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
qdrant = Qdrant(
client=QdrantClient(url="http://localhost:6333"),
collection_name="documents",
embeddings=embeddings
)
def decompose_query(state: RAGState) -> RAGState:
"""Break complex queries into sub-queries"""
prompt = ChatPromptTemplate.from_template(
"""Break this query into 2-3 simpler sub-queries that together answer the original.
If the query is already simple, return just the original query.
Query: {query}
Return as a JSON list of strings."""
)
response = llm.invoke(prompt.format(query=state["query"]))
# Parse response into sub_queries
import json
try:
sub_queries = json.loads(response.content)
except:
sub_queries = [state["query"]]
state["sub_queries"] = sub_queries
return state
def retrieve_documents(state: RAGState) -> RAGState:
"""Retrieve documents for each sub-query"""
all_docs = []
for sub_query in state["sub_queries"]:
docs = qdrant.similarity_search_with_score(sub_query, k=3)
all_docs.extend(docs)
# Deduplicate and keep top results
seen = set()
unique_docs = []
scores = []
for doc, score in sorted(all_docs, key=lambda x: x[1], reverse=True):
content = doc.page_content
if content not in seen:
seen.add(content)
unique_docs.append(content)
scores.append(score)
state["retrieved_docs"] = unique_docs[:5]
state["relevance_scores"] = scores[:5]
return state
def check_relevance(state: RAGState) -> RAGState:
"""Determine if retrieved documents are relevant enough"""
avg_score = sum(state["relevance_scores"]) / len(state["relevance_scores"])
state["needs_correction"] = avg_score < 0.7
return state
def correct_retrieval(state: RAGState) -> RAGState:
"""Try alternative retrieval strategies"""
# Rewrite query for better retrieval
prompt = ChatPromptTemplate.from_template(
"""The search for "{query}" didn't find relevant results.
Rewrite this query to find better matches. Be more specific or use synonyms.
Return only the rewritten query."""
)
rewritten = llm.invoke(prompt.format(query=state["query"]))
# Try again with rewritten query
docs = qdrant.similarity_search_with_score(rewritten.content, k=5)
state["retrieved_docs"] = [doc.page_content for doc, _ in docs]
state["relevance_scores"] = [score for _, score in docs]
return state
def generate_response(state: RAGState) -> RAGState:
"""Generate final response from context"""
context = "\n\n".join(state["retrieved_docs"])
prompt = ChatPromptTemplate.from_template(
"""Answer the question based on the provided context.
If the context doesn't contain enough information, acknowledge the limitation.
Context:
{context}
Question: {query}
Answer:"""
)
response = llm.invoke(prompt.format(context=context, query=state["query"]))
state["response"] = response.content
return state
def should_correct(state: RAGState) -> str:
"""Routing function for correction path"""
return "correct" if state["needs_correction"] else "generate"
# Build the graph
workflow = StateGraph(RAGState)
# Add nodes
workflow.add_node("decompose", decompose_query)
workflow.add_node("retrieve", retrieve_documents)
workflow.add_node("check", check_relevance)
workflow.add_node("correct", correct_retrieval)
workflow.add_node("generate", generate_response)
# Add edges
workflow.set_entry_point("decompose")
workflow.add_edge("decompose", "retrieve")
workflow.add_edge("retrieve", "check")
workflow.add_conditional_edges(
"check",
should_correct,
{
"correct": "correct",
"generate": "generate"
}
)
workflow.add_edge("correct", "generate")
workflow.add_edge("generate", END)
# Compile
app = workflow.compile()
# Usage
def agentic_rag(query: str) -> str:
result = app.invoke({"query": query})
return result["response"]
Open Source vs Closed Source: The Real Tradeoffs
Cost Analysis
Self-hosting costs more than most teams expect:
| Scenario | Pinecone (Managed) | Qdrant (Self-Hosted) |
|---|---|---|
| 1M vectors, low QPS | ~$70/mo | ~$50/mo (t3.medium) + ops time |
| 10M vectors, medium QPS | ~$450/mo | ~$200/mo (r5.large) + ops time |
| 100M vectors, high QPS | Custom pricing | ~$1000/mo (cluster) + dedicated ops |
The hidden cost is engineering time. A production vector database needs monitoring, backup procedures, scaling decisions, and incident response. If you don't have dedicated ops capacity, the "free" open source option gets expensive fast.
Data Sovereignty
For healthcare, finance, and government, data location matters. Open source wins here - you can run Qdrant or Milvus in your own VPC, on-premise, or in specific geographic regions.
Pinecone now offers dedicated deployments for enterprise customers. MongoDB Atlas and Weaviate Cloud provide region selection. But for maximum control, self-hosting remains the only option.
Customization
Open source lets you modify indexing strategies, add custom distance metrics, or integrate directly with your infrastructure. When the default behavior doesn't fit, you can change it.
Managed services optimize for the common case. If you're in that common case, great. If you're not, you'll hit walls.
Scaling Reality
Milvus handles billion-scale vectors that would crush simpler solutions. But it requires serious infrastructure knowledge. Pinecone scales effortlessly - for a price.
My recommendation: start managed, move to self-hosted when you hit either cost or capability limits that justify the operational overhead.
Where RAG Goes From Here
RAG in 2026 isn't the same pattern we knew in 2023. It's evolving into a "Context Engine" - an intelligent system that understands not just what to retrieve, but when, how, and whether retrieval even helps.
What's coming:
-
Reasoning-integrated retrieval - Models like DeepSeek-R1 that think before they retrieve, using reasoning to determine what information they actually need.
-
Multi-modal RAG - Retrieving and reasoning over images, documents, and structured data together.
-
Governed context - Enterprise systems where retrieval is auditable, explainable, and constrained by access controls.
-
Agentic workflows - RAG as one tool among many that an agent orchestrates to accomplish complex tasks.
The teams winning in 2026 aren't just implementing RAG. They're thinking about information architecture - how knowledge flows through their systems, how it stays current, and how AI accesses it intelligently.
Start simple. Add complexity when the data demands it. And remember that the best RAG system is the one that solves your actual problem, not the one with the most sophisticated architecture diagram.
Quick Reference: My 2026 Recommendations
If you're just starting: Chroma locally, Supabase Vector for first deployment, text-embedding-3-small.
If you're scaling: Qdrant or Pinecone, Voyage-3-large or domain-specific Voyage models, LlamaIndex for data layer.
If you're enterprise: Pinecone or self-hosted Milvus, comprehensive evaluation of embedding models on your data, hybrid LlamaIndex + LangGraph architecture.
If you're regulated industry: Self-hosted Qdrant/Milvus, domain-specific embeddings, strict data governance, comprehensive audit trails.
The landscape will keep changing. The principles - understand your data, measure what matters, start simple, scale when needed - won't.
the SolvedByCode team builds RAG systems at SolvedByCode.ai while documenting the journey from traditional development to AI-native coding. This guide represents hundreds of hours of production experience, failures, and hard-won insights.
Sources and Further Reading
- MTEB Leaderboard - Massive Text Embedding Benchmark
- Qdrant Benchmarks - Vector database performance comparisons
- Self-RAG Paper (ICLR 2024) - Learning to Retrieve, Generate, and Critique
- Agentic RAG Survey - Comprehensive survey on agentic retrieval systems
- Voyage AI Blog - State-of-the-art embedding model details
- Weaviate Chunking Guide - Practical chunking strategies
- LangChain vs LlamaIndex 2026 - Framework comparison
- RAG Review 2025 - Year-end evolution analysis
- MongoDB Vector Search Docs - RAG implementation guide
- Supabase Vector Docs - PostgreSQL vector search setup




