The Complete Guide to RAG and Vector Databases in 2026

The Complete Guide to RAG and Vector Databases in 2026

Marco Nahmias
Marco Nahmias
January 27, 202625 min read

The Complete Guide to RAG and Vector Databases in 2026

Last updated: January 2026

Most developers spend their first few months with RAG systems making every mistake in the book. Wrong embeddings, chunking that destroyed context, vector databases that buckled under production load. The documentation was scattered, the benchmarks contradicted each other, and nobody seemed to agree on best practices.

This guide is what I wish existed when I started.

After building RAG systems for healthcare document processing, legal contract analysis, and enterprise knowledge bases, I've learned that the difference between a demo and production isn't just "more vectors." It's understanding why you're choosing each component and how they interact under real-world conditions.

Let's get into it.


Why RAG Still Matters in 2026 (Despite What You've Heard)

Every few months, someone proclaims RAG is dead. Usually right after a new model ships with a bigger context window.

The argument goes: Claude 4 handles 200k tokens, Gemini 2.5 processes up to a million tokens. Why bother with retrieval when you can just dump everything into the context?

Here's what those takes miss.

Cost and latency compound fast. Processing a million tokens per query gets expensive. At current rates, a single query against a full knowledge base could cost $10-20 in API fees. Multiply that by thousands of daily queries and you're looking at infrastructure costs that dwarf whatever you'd spend on a vector database.

Position bias is real. Research from Google confirms that LLMs equipped with massive contexts still struggle when relevant information is buried in the middle of a document. The model's accuracy varies depending on where the answer lives in the context. RAG lets you surface exactly what's relevant, right at the top where the model pays attention.

Freshness matters. Your context window can't include documents that were uploaded five minutes ago. RAG systems with proper ingestion pipelines can.

Privacy and compliance. In regulated industries, you can't just ship patient records or financial data to an external API. RAG with on-premise vector databases keeps sensitive information within your security perimeter.

The nuanced take for 2026: start simple. If your knowledge base fits comfortably in 200k tokens and doesn't change often, skip the retrieval stack. Add RAG when scale, freshness, latency, or privacy truly demand it.

What's actually happening is that RAG is evolving from a specific pattern into what some are calling a "Context Engine" - intelligent retrieval that adapts to the query, understands document relationships, and provides governed, explainable context to AI systems.


Top 10 Vector Databases Ranked for 2026

After testing these databases in production scenarios ranging from 100k to 50 million vectors, here's my honest assessment.

The Ranking

RankDatabaseTypeBest ForStarting Price
1PineconeManagedEnterprise scale, minimal opsFree tier, then usage-based
2QdrantOSS + CloudComplex filtering, cost-sensitiveFree 1GB forever, $25/mo
3WeaviateOSS + CloudHybrid search, modularity$25/mo after trial
4MilvusOSSBillion-scale, GPU accelerationFree (self-hosted)
5pgvectorOSS ExtensionPostgreSQL shops, simplicityFree (with PostgreSQL)
6MongoDB Atlas VectorManagedExisting MongoDB usersPay-as-you-go
7Supabase VectorManagedFull-stack developers, speedFree tier, then $25/mo
8ChromaOSSPrototyping, local devFree
9FAISSOSS LibraryResearch, custom pipelinesFree
10ElasticsearchOSS + CloudExisting ELK stackFree tier available

In-Depth Reviews

1. Pinecone - The Enterprise Standard

Pinecone leads because it just works. The managed infrastructure handles scaling, replication, and failover automatically. For teams that want to focus on their RAG logic rather than database operations, it's the default choice.

Query latency sits consistently under 50ms even at scale. The serverless offering launched in 2024 eliminated cold start concerns. Multi-region deployment is straightforward.

The downside is cost at scale. Once you're pushing millions of vectors with high query volume, the bill climbs. But for most teams, the operational simplicity justifies the premium.

Best for: Production RAG systems where reliability matters more than cost optimization.

2. Qdrant - The Performance Champion

Qdrant surprised me. Written in Rust, it delivers exceptional performance with a smaller footprint than competitors. The filtering capabilities are genuinely powerful - you can run complex metadata queries alongside vector similarity without sacrificing speed.

The API is clean. The documentation is excellent. The free tier (1GB forever, no credit card) makes it easy to evaluate properly.

What sets Qdrant apart is quantization support that maintains accuracy while reducing memory requirements. For cost-sensitive deployments where you're paying for every GB of RAM, this matters.

Best for: Applications requiring complex filtering, cost-conscious teams wanting performance.

3. Weaviate - The Flexible Hybrid

Weaviate's strength is hybrid search - combining vector similarity with keyword matching in a single query. For document retrieval where exact terms matter alongside semantic meaning, this dual approach improves relevance.

The modular architecture lets you swap embedding models, add custom modules, and extend functionality. It's the Swiss Army knife approach.

Teams report higher resource requirements at scale. Below 50 million vectors, it runs efficiently. Beyond that, plan capacity carefully.

Best for: Hybrid search requirements, teams wanting flexibility and modularity.

4. Milvus - The Billion-Scale Workhorse

Milvus handles scale that makes other databases sweat. GPU acceleration, distributed querying, and more indexing strategies than any competitor. If you're building a billion-vector system, Milvus should be on your shortlist.

The tradeoff is operational complexity. This isn't a "deploy and forget" solution. You need engineering muscle to run Milvus well.

GitHub stars (~25k) and the LF AI Foundation backing signal strong community support.

Best for: Massive scale deployments with dedicated infrastructure teams.

5. pgvector - The PostgreSQL Native

If you're already running PostgreSQL, pgvector lets you add vector search without introducing another database. Same backup procedures, same monitoring, same security model.

Performance is solid for moderate scale. The HNSW indexing improvements in recent versions handle millions of vectors competently.

For teams that prize operational simplicity and don't need the specialized features of dedicated vector databases, pgvector is often enough.

Best for: PostgreSQL-native teams, moderate scale, unified data architecture.

6. MongoDB Atlas Vector Search

The February 2025 acquisition of Voyage AI signals MongoDB's serious commitment to the vector space. For existing MongoDB users, adding vector search without architectural changes is compelling.

In September 2025, MongoDB extended these capabilities to self-managed offerings, not just Atlas. This opens doors for on-premise deployments with strict data residency requirements.

The unified platform - operational data and vectors in one place - simplifies RAG architectures considerably.

Best for: MongoDB shops, unified operational + vector storage needs.

7. Supabase Vector

Supabase wraps pgvector with their excellent developer experience. If you're building full-stack applications with Supabase for auth, storage, and database, vector search integrates seamlessly.

The documentation is clear. The client libraries are polished. For getting a RAG prototype running in an afternoon, it's hard to beat.

Best for: Full-stack developers, rapid prototyping, Supabase ecosystem users.

8. Chroma - The Prototyping King

Chroma owns the local development experience. Two commands to install and run. Python-native API. No cloud accounts required.

For testing embedding models, experimenting with chunking strategies, or building proof-of-concepts, Chroma removes all friction.

Don't mistake this for a production recommendation. Chroma is for learning and prototyping. Scale needs will push you elsewhere.

Best for: Local development, learning RAG, rapid prototyping.

9. FAISS - The Research Foundation

Meta's FAISS is a library, not a database. It provides the building blocks - indexing algorithms, similarity search primitives - that you assemble into your own solution.

Researchers and teams building custom retrieval systems use FAISS directly. Everyone else uses databases that build on similar techniques.

Best for: Research, custom implementations, maximum control.

10. Elasticsearch - The Legacy Upgrade Path

If you're already running Elasticsearch for logging or search, adding vector capabilities to your existing cluster might make sense. The vector search features have matured considerably.

For greenfield RAG projects, purpose-built vector databases will likely outperform. But for organizations with significant ELK investments, the upgrade path is legitimate.

Best for: Existing Elasticsearch deployments, unified observability and search.


For Startups: Start with Qdrant's free tier or Supabase Vector. Graduate to Pinecone when you need enterprise features.

For Enterprise: Pinecone for managed simplicity, MongoDB Atlas Vector if you're already MongoDB, Milvus if you have the team for self-hosting at scale.

For Healthcare/Finance: Qdrant or Milvus self-hosted for data sovereignty. Pinecone with BAA for managed with compliance.

For Learning: Chroma locally, then deploy to Supabase Vector to understand production concerns.


Embeddings Deep Dive: Choosing the Right Model

Embeddings are where retrieval quality is made or lost. Pick wrong, and no amount of database optimization will save you.

Dimension Sizes: What They Actually Mean

Embedding dimensions represent how much semantic information the model captures. More dimensions generally means more nuance, but also more storage and compute.

DimensionsStorage/VectorUse CaseModels
384-512~1.5-2 KBLightweight, mobile, high volumeMiniLM, E5-small
768~3 KBBalanced general purposeBGE-base, BAAI
1024~4 KBDomain-specific, high qualityVoyage-finance-2, Voyage-law-2
1536~6 KBHigh accuracy generalOpenAI text-embedding-3-small
3072~12 KBMaximum accuracyOpenAI text-embedding-3-large, Voyage-3-large

Top Embedding Models for 2026

General Purpose - Commercial

  1. Google Gemini Embedding (gemini-embedding-001) - Currently #1 on MTEB leaderboard. Now GA in Gemini API and Vertex AI.

  2. Voyage AI voyage-3-large - Outperforms OpenAI and Cohere across 100+ datasets by 9-20%. After MongoDB's acquisition, expect deeper integration.

  3. OpenAI text-embedding-3-large - Reliable workhorse. The "ancient" option by AI standards (March 2023) but still solid for general use.

General Purpose - Open Source

  1. Alibaba Qwen3-Embedding - Ranks just behind Gemini on MTEB. Apache 2.0 license means you can run it yourself.

  2. NVIDIA NV-Embed - Fine-tuned from Llama-3.1-8B. Excellent multilingual support, 69.32 MTEB score.

  3. BGE (BAAI) - Strong performance, active development, MIT license.

Domain-Specific Winners

For Finance: Voyage-finance-2 demolished OpenAI on SEC filings (54% vs 38.5% accuracy). The gap widens on pure financial queries (63.75% vs 40%).

For Legal: Harvey AI partnered with Voyage to create voyage-law-2-harvey. It reduces irrelevant results by 25% compared to off-the-shelf alternatives.

For Code: Qodo-Embed-1-1.5B was designed specifically for code retrieval. If you're building code search or documentation systems, evaluate it.

My Recommendations by Industry

IndustryPrimary ModelBackup ModelWhy
HealthcareVoyage-3-largeBGE-largeNeed accuracy, domain terms matter
FinanceVoyage-finance-2Gemini EmbeddingSpecialized beats general by 15%+
LegalVoyage-law-2Voyage-3-largeContract language is specific
E-commerceGemini Embeddingtext-embedding-3-smallBalance of cost and quality
Customer Supporttext-embedding-3-smallBGE-baseVolume matters, good enough is fine
Code SearchQodo-EmbedNV-EmbedCode is different from prose

Practical Advice

Run your own evaluation. The MTEB leaderboard is a starting point, not a destination. What matters is how the model performs on your data.

Create a test set of 100-500 query-document pairs from your actual corpus. Measure MRR (Mean Reciprocal Rank) and NDCG. The model that wins on benchmarks might not win on your specific use case.


Chunking Strategies That Actually Work

Chunking is the silent killer of RAG accuracy. Get it wrong and you'll spend weeks debugging retrieval issues that trace back to how you split documents.

The Core Strategies

1. Fixed-Size Chunking Split every N characters or tokens, regardless of content.

  • Pros: Fast, simple, predictable
  • Cons: Breaks mid-sentence, destroys context
  • Use for: Prototyping only

2. Recursive Character Splitting Split on paragraphs first, then sentences, then characters as needed.

  • Pros: Preserves structure, LangChain default for good reason
  • Cons: Still content-agnostic
  • Use for: Most RAG applications (start here)

3. Semantic Chunking Use embeddings to detect topic shifts and split on meaning boundaries.

  • Pros: Up to 70% accuracy improvement, preserves concepts
  • Cons: Higher compute cost, slower ingestion
  • Use for: High-value document corpuses where quality matters

4. Document-Aware Chunking Respect document structure - headers, sections, code blocks.

  • Pros: Maintains document logic, enables hierarchical retrieval
  • Cons: Requires format-specific parsing
  • Use for: Technical docs, legal contracts, structured content

5. Agentic/LLM-Based Chunking Let an LLM decide chunk boundaries based on content understanding.

  • Pros: Best quality for complex documents
  • Cons: Expensive, slow
  • Use for: Legal contracts, research papers, compliance docs

Optimal Settings

After extensive testing, here's what works:

Chunk size: 256-512 tokens
Overlap: 10-20% (50-100 tokens for 500-token chunks)

NVIDIA's benchmarks found page-level chunking achieved 0.648 accuracy with lowest variance. But for most applications, recursive splitting at 400-512 tokens delivers 85-90% recall without the overhead.

Code-Specific Chunking

Code needs different treatment. Don't split functions in half.

For Python:

  • Split on top-level functions and classes
  • Include docstrings with their functions
  • Preserve import context

For JavaScript/TypeScript:

  • Respect module boundaries
  • Keep JSDoc comments attached
  • Consider AST-aware splitting
# Code-aware chunking example
from langchain.text_splitter import RecursiveCharacterTextSplitter

code_splitter = RecursiveCharacterTextSplitter.from_language(
    language="python",
    chunk_size=1000,
    chunk_overlap=100
)

Building RAG Systems: The Framework Decision

The framework wars are tiresome. Here's the practical reality.

LangChain

LangChain is the Swiss Army knife - chains, agents, memory, tools, and integrations with everything. Version 1.0+ brought stable APIs and a modular structure (langchain-core, langchain-community, provider-specific packages).

Use LangChain when:

  • Building complex multi-step workflows
  • Need broad tool integration
  • Want the largest ecosystem of examples and community support

Framework overhead: ~10ms per call, ~2.4k token usage

LlamaIndex

LlamaIndex treats your data as a first-class citizen. Ingestion, chunking, index construction, and query engines are deeply thought out.

Use LlamaIndex when:

  • Your primary bottleneck is retrieval quality
  • Working with messy, unstructured data (PDFs, varied formats)
  • Need hierarchical indexing out of the box

Framework overhead: ~6ms per call, ~1.6k token usage

The Power Move: Use Both

In production for 2026, the winning pattern often combines them:

  1. LlamaIndex as the Data Layer - Ingest PDFs, clean data, build the vector index. LlamaIndex's retriever is genuinely superior for complex documents.

  2. LangChain as the Control Layer - Wrap the LlamaIndex query engine as a LangChain tool. Let a LangGraph agent decide when to call retrieval.

This hybrid approach gives you the best of both worlds.

Direct API Implementation

Sometimes frameworks add more complexity than value. For simple RAG pipelines:

# Direct OpenAI + Qdrant implementation
from openai import OpenAI
from qdrant_client import QdrantClient

client = OpenAI()
qdrant = QdrantClient(url="http://localhost:6333")

def simple_rag(query: str, collection: str = "documents"):
    # 1. Embed the query
    embedding = client.embeddings.create(
        model="text-embedding-3-small",
        input=query
    ).data[0].embedding

    # 2. Search vectors
    results = qdrant.search(
        collection_name=collection,
        query_vector=embedding,
        limit=5
    )

    # 3. Build context
    context = "\n\n".join([r.payload["text"] for r in results])

    # 4. Generate response
    response = client.chat.completions.create(
        model="gpt-4-turbo",
        messages=[
            {"role": "system", "content": f"Answer based on this context:\n{context}"},
            {"role": "user", "content": query}
        ]
    )

    return response.choices[0].message.content

No framework overhead. Full control. Sometimes this is exactly what you need.


Agentic RAG: The 2026 Frontier

Traditional RAG is a single retrieval step. Agentic RAG embeds decision-making into the process.

Key Patterns

Self-RAG (ICLR 2024)

The model decides whether to retrieve, when to retrieve, and critiques its own outputs. Instead of always retrieving, it learns to retrieve on demand.

Key innovation: Reflection tokens as part of generation. The model predicts when it's uncertain and needs external knowledge.

Corrective RAG

After retrieval, evaluate whether the retrieved documents actually help. If not, try a different search strategy.

# Corrective RAG pseudocode
def corrective_rag(query):
    docs = retrieve(query)
    relevance = evaluate_relevance(query, docs)

    if relevance < threshold:
        # Try different strategies
        docs = web_search(query)  # or decompose query

    return generate_with_context(query, docs)

Multi-Step Retrieval

Complex questions often need multiple retrieval passes. First retrieve background context, then retrieve specific details.

Hybrid Search

Combine dense vectors (semantic similarity) with sparse vectors (keyword matching). When a user searches for "HIPAA compliance requirements," you want both semantic understanding of compliance concepts AND exact keyword matching on "HIPAA."

# Hybrid search with Weaviate
results = client.query.get(
    "Document",
    ["text", "title"]
).with_hybrid(
    query="HIPAA compliance requirements",
    alpha=0.5  # Balance between vector and keyword
).with_limit(10).do()

Reranking

Initial retrieval is fast but imprecise. A reranker (like Cohere Rerank or cross-encoder models) scores the top-N results more carefully.

# Reranking pipeline
initial_results = vector_search(query, limit=50)  # Fast, recall-oriented
reranked = reranker.rerank(query, initial_results, top_n=5)  # Slow, precision-oriented

Industry-Specific RAG Implementations

Healthcare

The Challenge: HIPAA compliance, sensitive patient data, life-critical accuracy.

Recommended Stack:

  • Vector DB: Qdrant or Milvus (self-hosted for data sovereignty)
  • Embeddings: Voyage-3-large (accuracy critical)
  • Framework: LlamaIndex for medical document parsing

Critical Requirements:

  1. All data must be encrypted at rest and in transit
  2. Business Associate Agreements (BAAs) required for any cloud components
  3. Audit trails for every query and response
  4. De-identify documents before embedding where possible

Architecture Pattern:

Patient Query -> On-Premise RAG -> Local LLM
                                   (no PHI leaves network)

Insurance

The Challenge: Complex policy documents, claims processing speed, fraud detection.

Recommended Stack:

  • Vector DB: MongoDB Atlas Vector (unified with operational data)
  • Embeddings: Voyage-finance-2 (handles financial/insurance terminology)
  • Framework: LangGraph for multi-step claims workflows

Use Case Example: Auto insurance claims RAG reviews accident photos, police reports, and repair estimates while checking policy coverage and precedent cases. Output: coverage determination, estimated payout, fraud indicators.

McKinsey reports 30% processing time reduction and 20% cost savings with smart document systems.

The Challenge: Precision requirements, case law complexity, contract nuance.

Recommended Stack:

  • Vector DB: Pinecone or Qdrant (filtering by jurisdiction, date, court)
  • Embeddings: Voyage-law-2 or voyage-law-2-harvey
  • Framework: LlamaIndex (PDF parsing excellence)

Research Impact: Legal teams report research time dropping from 3 hours to 20 minutes per matter. The caveat: attorneys remain responsible for verification. RAG assists but doesn't replace professional judgment.

Contract analysis use case: Identify key clauses, potential risks, and inconsistencies across multiple documents. Compare terms against standard clauses, flag deviations for review.

Financial Services

The Challenge: Real-time requirements, regulatory compliance, market data integration.

Recommended Stack:

  • Vector DB: Pinecone (latency) or MongoDB Atlas (operational integration)
  • Embeddings: Voyage-finance-2 (15%+ accuracy improvement on financial text)
  • Framework: Direct implementation for latency-critical paths

Key Consideration: Many financial queries need current market data alongside historical knowledge. RAG architecture should support dynamic context injection for real-time feeds.


Complete Code Examples

Python: Production RAG Pipeline

"""
Production RAG Pipeline with LlamaIndex + Qdrant
Includes: chunking, embedding, indexing, querying
"""

from llama_index.core import (
    VectorStoreIndex,
    SimpleDirectoryReader,
    Settings,
    StorageContext
)
from llama_index.vector_stores.qdrant import QdrantVectorStore
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.llms.openai import OpenAI
from qdrant_client import QdrantClient

# Configuration
QDRANT_URL = "http://localhost:6333"
COLLECTION_NAME = "documents"
EMBEDDING_MODEL = "text-embedding-3-small"
LLM_MODEL = "gpt-4-turbo"

def setup_qdrant():
    """Initialize Qdrant client and collection"""
    client = QdrantClient(url=QDRANT_URL)

    # Create collection if it doesn't exist
    collections = client.get_collections().collections
    if not any(c.name == COLLECTION_NAME for c in collections):
        client.create_collection(
            collection_name=COLLECTION_NAME,
            vectors_config={
                "size": 1536,  # text-embedding-3-small dimensions
                "distance": "Cosine"
            }
        )

    return client

def create_index(documents_path: str):
    """Create vector index from documents"""

    # Setup Qdrant
    qdrant_client = setup_qdrant()
    vector_store = QdrantVectorStore(
        client=qdrant_client,
        collection_name=COLLECTION_NAME
    )

    # Configure LlamaIndex settings
    Settings.embed_model = OpenAIEmbedding(model=EMBEDDING_MODEL)
    Settings.llm = OpenAI(model=LLM_MODEL)
    Settings.chunk_size = 512
    Settings.chunk_overlap = 50

    # Load and index documents
    documents = SimpleDirectoryReader(documents_path).load_data()
    storage_context = StorageContext.from_defaults(vector_store=vector_store)

    index = VectorStoreIndex.from_documents(
        documents,
        storage_context=storage_context,
        show_progress=True
    )

    return index

def query_rag(query: str, index: VectorStoreIndex):
    """Query the RAG system"""
    query_engine = index.as_query_engine(
        similarity_top_k=5,
        response_mode="tree_summarize"
    )

    response = query_engine.query(query)
    return response

# Usage
if __name__ == "__main__":
    # First time: create index
    index = create_index("./documents")

    # Query
    response = query_rag("What are the key compliance requirements?", index)
    print(response)

TypeScript: Next.js RAG API Route

/**
 * RAG API Route for Next.js
 * Uses Vercel AI SDK with Pinecone
 */

import { NextRequest, NextResponse } from 'next/server';
import { Pinecone } from '@pinecone-database/pinecone';
import OpenAI from 'openai';

const openai = new OpenAI();
const pinecone = new Pinecone();

interface RAGRequest {
  query: string;
  namespace?: string;
  topK?: number;
}

export async function POST(request: NextRequest) {
  try {
    const { query, namespace = 'default', topK = 5 }: RAGRequest =
      await request.json();

    // 1. Generate embedding for query
    const embeddingResponse = await openai.embeddings.create({
      model: 'text-embedding-3-small',
      input: query,
    });
    const queryEmbedding = embeddingResponse.data[0].embedding;

    // 2. Search Pinecone
    const index = pinecone.Index(process.env.PINECONE_INDEX!);
    const searchResults = await index.namespace(namespace).query({
      vector: queryEmbedding,
      topK,
      includeMetadata: true,
    });

    // 3. Build context from results
    const context = searchResults.matches
      .map((match) => match.metadata?.text || '')
      .join('\n\n---\n\n');

    // 4. Generate response with context
    const completion = await openai.chat.completions.create({
      model: 'gpt-4-turbo',
      messages: [
        {
          role: 'system',
          content: `You are a helpful assistant. Answer questions based on the provided context. If the context doesn't contain relevant information, say so.

Context:
${context}`,
        },
        {
          role: 'user',
          content: query,
        },
      ],
      temperature: 0.7,
      max_tokens: 1000,
    });

    // 5. Return response with sources
    return NextResponse.json({
      answer: completion.choices[0].message.content,
      sources: searchResults.matches.map((match) => ({
        id: match.id,
        score: match.score,
        title: match.metadata?.title,
        url: match.metadata?.url,
      })),
    });
  } catch (error) {
    console.error('RAG Error:', error);
    return NextResponse.json(
      { error: 'Failed to process query' },
      { status: 500 }
    );
  }
}

Python: Agentic RAG with LangGraph

"""
Agentic RAG with LangGraph
Implements: query decomposition, corrective retrieval, self-reflection
"""

from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_community.vectorstores import Qdrant
from langchain.prompts import ChatPromptTemplate
from typing import TypedDict, List, Optional
from qdrant_client import QdrantClient

# State definition
class RAGState(TypedDict):
    query: str
    sub_queries: Optional[List[str]]
    retrieved_docs: List[str]
    relevance_scores: List[float]
    needs_correction: bool
    response: Optional[str]

# Initialize components
llm = ChatOpenAI(model="gpt-4-turbo", temperature=0)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
qdrant = Qdrant(
    client=QdrantClient(url="http://localhost:6333"),
    collection_name="documents",
    embeddings=embeddings
)

def decompose_query(state: RAGState) -> RAGState:
    """Break complex queries into sub-queries"""
    prompt = ChatPromptTemplate.from_template(
        """Break this query into 2-3 simpler sub-queries that together answer the original.
        If the query is already simple, return just the original query.

        Query: {query}

        Return as a JSON list of strings."""
    )

    response = llm.invoke(prompt.format(query=state["query"]))
    # Parse response into sub_queries
    import json
    try:
        sub_queries = json.loads(response.content)
    except:
        sub_queries = [state["query"]]

    state["sub_queries"] = sub_queries
    return state

def retrieve_documents(state: RAGState) -> RAGState:
    """Retrieve documents for each sub-query"""
    all_docs = []

    for sub_query in state["sub_queries"]:
        docs = qdrant.similarity_search_with_score(sub_query, k=3)
        all_docs.extend(docs)

    # Deduplicate and keep top results
    seen = set()
    unique_docs = []
    scores = []

    for doc, score in sorted(all_docs, key=lambda x: x[1], reverse=True):
        content = doc.page_content
        if content not in seen:
            seen.add(content)
            unique_docs.append(content)
            scores.append(score)

    state["retrieved_docs"] = unique_docs[:5]
    state["relevance_scores"] = scores[:5]
    return state

def check_relevance(state: RAGState) -> RAGState:
    """Determine if retrieved documents are relevant enough"""
    avg_score = sum(state["relevance_scores"]) / len(state["relevance_scores"])
    state["needs_correction"] = avg_score < 0.7
    return state

def correct_retrieval(state: RAGState) -> RAGState:
    """Try alternative retrieval strategies"""
    # Rewrite query for better retrieval
    prompt = ChatPromptTemplate.from_template(
        """The search for "{query}" didn't find relevant results.
        Rewrite this query to find better matches. Be more specific or use synonyms.

        Return only the rewritten query."""
    )

    rewritten = llm.invoke(prompt.format(query=state["query"]))

    # Try again with rewritten query
    docs = qdrant.similarity_search_with_score(rewritten.content, k=5)
    state["retrieved_docs"] = [doc.page_content for doc, _ in docs]
    state["relevance_scores"] = [score for _, score in docs]

    return state

def generate_response(state: RAGState) -> RAGState:
    """Generate final response from context"""
    context = "\n\n".join(state["retrieved_docs"])

    prompt = ChatPromptTemplate.from_template(
        """Answer the question based on the provided context.
        If the context doesn't contain enough information, acknowledge the limitation.

        Context:
        {context}

        Question: {query}

        Answer:"""
    )

    response = llm.invoke(prompt.format(context=context, query=state["query"]))
    state["response"] = response.content
    return state

def should_correct(state: RAGState) -> str:
    """Routing function for correction path"""
    return "correct" if state["needs_correction"] else "generate"

# Build the graph
workflow = StateGraph(RAGState)

# Add nodes
workflow.add_node("decompose", decompose_query)
workflow.add_node("retrieve", retrieve_documents)
workflow.add_node("check", check_relevance)
workflow.add_node("correct", correct_retrieval)
workflow.add_node("generate", generate_response)

# Add edges
workflow.set_entry_point("decompose")
workflow.add_edge("decompose", "retrieve")
workflow.add_edge("retrieve", "check")
workflow.add_conditional_edges(
    "check",
    should_correct,
    {
        "correct": "correct",
        "generate": "generate"
    }
)
workflow.add_edge("correct", "generate")
workflow.add_edge("generate", END)

# Compile
app = workflow.compile()

# Usage
def agentic_rag(query: str) -> str:
    result = app.invoke({"query": query})
    return result["response"]

Open Source vs Closed Source: The Real Tradeoffs

Cost Analysis

Self-hosting costs more than most teams expect:

ScenarioPinecone (Managed)Qdrant (Self-Hosted)
1M vectors, low QPS~$70/mo~$50/mo (t3.medium) + ops time
10M vectors, medium QPS~$450/mo~$200/mo (r5.large) + ops time
100M vectors, high QPSCustom pricing~$1000/mo (cluster) + dedicated ops

The hidden cost is engineering time. A production vector database needs monitoring, backup procedures, scaling decisions, and incident response. If you don't have dedicated ops capacity, the "free" open source option gets expensive fast.

Data Sovereignty

For healthcare, finance, and government, data location matters. Open source wins here - you can run Qdrant or Milvus in your own VPC, on-premise, or in specific geographic regions.

Pinecone now offers dedicated deployments for enterprise customers. MongoDB Atlas and Weaviate Cloud provide region selection. But for maximum control, self-hosting remains the only option.

Customization

Open source lets you modify indexing strategies, add custom distance metrics, or integrate directly with your infrastructure. When the default behavior doesn't fit, you can change it.

Managed services optimize for the common case. If you're in that common case, great. If you're not, you'll hit walls.

Scaling Reality

Milvus handles billion-scale vectors that would crush simpler solutions. But it requires serious infrastructure knowledge. Pinecone scales effortlessly - for a price.

My recommendation: start managed, move to self-hosted when you hit either cost or capability limits that justify the operational overhead.


Where RAG Goes From Here

RAG in 2026 isn't the same pattern we knew in 2023. It's evolving into a "Context Engine" - an intelligent system that understands not just what to retrieve, but when, how, and whether retrieval even helps.

What's coming:

  1. Reasoning-integrated retrieval - Models like DeepSeek-R1 that think before they retrieve, using reasoning to determine what information they actually need.

  2. Multi-modal RAG - Retrieving and reasoning over images, documents, and structured data together.

  3. Governed context - Enterprise systems where retrieval is auditable, explainable, and constrained by access controls.

  4. Agentic workflows - RAG as one tool among many that an agent orchestrates to accomplish complex tasks.

The teams winning in 2026 aren't just implementing RAG. They're thinking about information architecture - how knowledge flows through their systems, how it stays current, and how AI accesses it intelligently.

Start simple. Add complexity when the data demands it. And remember that the best RAG system is the one that solves your actual problem, not the one with the most sophisticated architecture diagram.


Quick Reference: My 2026 Recommendations

If you're just starting: Chroma locally, Supabase Vector for first deployment, text-embedding-3-small.

If you're scaling: Qdrant or Pinecone, Voyage-3-large or domain-specific Voyage models, LlamaIndex for data layer.

If you're enterprise: Pinecone or self-hosted Milvus, comprehensive evaluation of embedding models on your data, hybrid LlamaIndex + LangGraph architecture.

If you're regulated industry: Self-hosted Qdrant/Milvus, domain-specific embeddings, strict data governance, comprehensive audit trails.

The landscape will keep changing. The principles - understand your data, measure what matters, start simple, scale when needed - won't.


the SolvedByCode team builds RAG systems at SolvedByCode.ai while documenting the journey from traditional development to AI-native coding. This guide represents hundreds of hours of production experience, failures, and hard-won insights.


Sources and Further Reading

Contact Agent

Get in Touch

We'll respond within 24 hours

Or call directly

(954) 906-9919