For the past three years, building an AI application that could answer questions about your data required assembling a multi-layered infrastructure: document processors, chunking pipelines, embedding models, vector databases, retrieval systems, rerankers, and orchestration logic to tie it all together. This was the RAG stack — Retrieval Augmented Generation — and it became the default architecture for any LLM application that needed to work with private or current data. By 2025, enterprise RAG deployments had grown 280% year-over-year, and roughly 60% of production LLM applications were using some form of retrieval-augmented generation.
Then context windows expanded. In 2023, most models topped out at 4,000–8,000 tokens. By early 2025, Gemini 1.5 Pro offered 2 million tokens. GPT-4.1 reached 1 million. Claude extended to 200,000 tokens standard, with 1 million in beta. Meta’s Llama 4 pushed to 10 million tokens. And Magic’s experimental LTM-2-Mini demonstrated a 100-million-token context window — enough for 10 million lines of code or roughly 750 novels.
A radical architectural option emerged from this expansion: skip the retrieval stack entirely. Load your documents directly into the prompt. Let the model reason over the complete text.
This is the “no-stack stack” — an AI architecture defined not by what it includes, but by what it eliminates. And for a surprising range of use cases, it is not just simpler — it performs better.
What the RAG Stack Actually Looks Like
Before appreciating what the no-stack stack eliminates, it is worth inventorying what a production RAG system requires. Each layer introduces decisions, failure modes, latency, and maintenance overhead.
The Ingestion Layer
- Document parsers — PDF extractors, HTML scrapers, code file readers, each with their own edge cases and failure modes
- Chunking logic — Splitting documents into retrievable units, with decisions about chunk size, overlap, and boundary detection
- Preprocessing — Text cleaning, metadata extraction, deduplication, language detection
The Embedding Layer
- Embedding model — Selected, hosted, and maintained (or accessed via API)
- Batch processing — Embedding thousands or millions of chunks, managing throughput and rate limits
- Model versioning — When you update embedding models, all existing vectors become incompatible and must be regenerated
The Storage Layer
- Vector database — Deployed, configured, scaled, backed up, and monitored (Pinecone, Weaviate, Chroma, pgvector, or others)
- Index management — Creating and maintaining search indices for efficient retrieval
- Metadata storage — Storing source information, timestamps, and access controls alongside vectors
The Retrieval Layer
- Query embedding — Converting user questions into vectors at query time
- Similarity search — Finding relevant chunks, with tuning for top-k, similarity threshold, and distance metrics
- Reranking — Optional but often necessary second pass to improve relevance
- Result assembly — Combining retrieved chunks into a coherent context for the LLM
The Synchronization Layer
- Change detection — Monitoring source documents for updates
- Re-indexing — Chunking, embedding, and storing updated content
- Stale data management — Removing vectors for deleted documents
- Consistency guarantees — Ensuring the index reflects the current state of source data
The Orchestration Layer
- Pipeline management — Coordinating all components in the correct sequence
- Error handling — Managing failures at any point in the pipeline
- Monitoring — Tracking latency, accuracy, and system health across components
- Configuration — Managing parameters across all layers
This is substantial infrastructure. The vector database category alone attracted over $800 million in venture investment across 2025 — Pinecone reported 340% year-over-year revenue growth, and Weaviate closed a $163 million Series C round. The tooling exists because the problem is real. But for many use cases, this machinery is solving a problem that longer context windows have already dissolved.
What the No-Stack Stack Looks Like
Remove all of the above. Replace it with:
- Load documents into the prompt
- Ask the model your question
- Get the answer
That is it. No embeddings. No vectors. No retrieval logic. No synchronization. No reranking. No chunk boundary problems. The model receives the complete documents and reasons over them directly.
The engineering effort shifts from building and maintaining infrastructure to crafting effective prompts and managing context efficiently — skills that are more accessible to a broader range of developers. There is no vector database to provision, no embedding model to version, and no chunking strategy to debug when answers come back wrong.
The Evidence: When Long Context Outperforms Retrieval
This is not just a theoretical simplification. A January 2025 study published on arXiv (“Long Context vs. RAG for LLMs: An Evaluation and Revisits”) found that long context generally outperforms RAG in question-answering benchmarks, especially for knowledge-intensive queries. Chunk-based retrieval — the most common RAG implementation — consistently lagged behind full-context approaches.
Google’s own evaluations of Gemini 1.5 Pro demonstrated greater than 99.7% recall on needle-in-a-haystack tests across up to 1 million tokens, and maintained above 99% recall extending to 10 million tokens for text modality. These are not synthetic benchmarks — they measure the model’s ability to locate and use specific information within massive contexts.
However, the same research found that RAG maintains advantages for dialogue-based queries and general question-answering, and that summarization-based retrieval performs comparably to long context. The choice is not binary. It depends on the model, the task type, and the implementation details.
Five Use Cases Where the No-Stack Stack Wins
1. Document Analysis and Summarization
Analyzing a legal contract, summarizing a research paper, extracting key findings from a report — these tasks require the model to understand the complete document structure. Chunking a 30-page contract and retrieving pieces defeats the purpose. Loading the entire contract (roughly 30,000 tokens) into the context window lets the model reason across the full text, capturing cross-references, conditional clauses, and structural dependencies that chunking destroys.
2. Code Review and Analysis
Reviewing a pull request, analyzing a codebase for security vulnerabilities, or understanding how a feature is implemented across multiple files. These tasks benefit from seeing the complete code in context rather than retrieving isolated functions. A code review of 15 files totaling 50,000 tokens fits comfortably in any modern context window and produces better analysis than retrieving snippets, because the model can trace dependencies across files.
3. Comparative Analysis
Comparing two versions of a document, evaluating competing proposals, or analyzing differences between product specifications. These tasks fundamentally require the model to hold multiple complete documents simultaneously and reason across them. RAG is not designed for comparison — it is designed for retrieval. Loading both documents into context lets the model perform genuine comparative analysis.
4. Meeting and Communication Processing
Analyzing a meeting transcript, summarizing an email thread, or extracting action items from a conversation. These are bounded, sequential documents where the order and context of every statement matters. Chunking a meeting transcript destroys the conversational flow and temporal context. Long context preserves it, producing summaries that accurately capture the discussion arc, speaker dynamics, and the evolution of decisions.
5. Personal Knowledge Management
An individual developer’s notes, a small team’s project documentation, a researcher’s collection of papers — data sets that are important but bounded. A developer with 200 pages of personal notes has roughly 100,000 tokens of text. That fits comfortably within current context windows and does not justify the overhead of deploying and maintaining a retrieval pipeline.
Advertisement
When the No-Stack Stack Breaks Down
The Scale Wall
The no-stack stack has a hard limit: the context window. When your data exceeds what the window can hold, you are forced to either truncate (losing information) or add retrieval (adding the stack back). For a company with 10,000 pages of documentation, even a 2-million-token context window is insufficient. You need a way to filter to the relevant pages before loading them into context.
The Cost Ceiling
At high query volumes, processing large contexts per query becomes expensive. The economics are stark: RAG query costs average roughly $0.00008 per query, while long-context queries average around $0.10 — making RAG approximately 1,250 times cheaper per query. A customer support system handling thousands of queries per day against a product manual cannot afford to load the entire manual for each query, even with prompt caching reducing costs by 50-90%.
For context on raw pricing: Gemini 2.5 Pro charges $1.25 per million input tokens, GPT-5.2 charges $1.75 per million, and Claude Opus 4.5 charges $5.00 per million. These costs multiply quickly when you are processing hundreds of thousands of tokens per request at scale.
The Latency Floor
Processing 500,000 tokens takes time. For interactive applications where sub-second response times matter, the latency of ingesting a large context window is a dealbreaker. RAG pipelines can average around 1 second for end-to-end queries, while equivalent long-context configurations may take 30 to 60 seconds on the same workload.
The Lost-in-the-Middle Problem
Research by Liu et al., published in Transactions of the Association for Computational Linguistics in 2024, documented a persistent challenge: LLMs perform best when relevant information appears at the beginning or end of the context, but accuracy drops 10 to 20 percentage points when critical information sits in the middle of long contexts. While models are improving — Gemini 1.5 Pro achieves near-perfect recall on single-needle tests — the problem persists for multi-needle retrieval and complex reasoning tasks. Real-world effective context capacity typically runs at 60-70% of advertised limits.
For applications where finding one specific fact in a large corpus matters — compliance checking, medical information retrieval, legal research across thousands of documents — RAG’s targeted retrieval still produces more reliable results than hoping the model will attend to the right section of a massive context.
The Precision Requirement
Some applications demand not just the right answer, but provenance — which document, which page, which paragraph the answer came from. RAG’s retrieval step naturally provides source attribution. With long context, extracting precise citations from a 500,000-token input requires additional prompting and is less reliable.
Building the No-Stack Stack Effectively
For teams adopting this approach, several practices improve reliability and output quality.
Context Organization
Structure matters even without retrieval. Organize documents in the context with clear headers, section markers, and metadata. The model navigates structured context more effectively than a raw text dump.
“`
=== DOCUMENT: Q4 Financial Report ===
Date: January 2026
Type: Quarterly Report
[document content]
=== END DOCUMENT ===
“`
Selective Loading
Not every document needs to be in every query. Build simple logic to select which documents are relevant to the current question — not a full RAG pipeline, but lightweight filtering based on keywords, document type, or recency. This is the minimal retrieval that keeps you below the context ceiling without requiring vector infrastructure.
Context Budgeting
Monitor how much of the context window you are using. Leave room for the model’s response and for any chain-of-thought reasoning. A context window filled to 95% capacity leaves no room for the model to think. Given that effective capacity runs at 60-70% of advertised limits, plan accordingly.
Prompt Caching
For repeated queries against the same documents, prompt caching is essential. Anthropic’s implementation reduces costs by up to 90% and latency by up to 85% for long prompts. OpenAI offers automatic caching with 50% cost savings, enabled by default for prompts of 1,024 tokens or longer. Google provides manual cache setup with a default one-hour lifespan. Across all providers, cached input tokens cost roughly 10 times less than regular input tokens. For stable document sets queried repeatedly, caching transforms the economics of long-context approaches.
Graceful Degradation
Design systems that can fall back to retrieval when the data exceeds the context window. Starting with the no-stack stack does not mean you can never add retrieval — but building with the simpler approach first means you add complexity only when the use case demands it. This progressive architecture lets teams ship faster and add infrastructure only when they hit a concrete wall.
The Hybrid Future
The industry is converging on a pragmatic middle ground. The most effective production architectures in 2026 use retrieval to identify relevant context, then feed that retrieved context into long context windows for reasoning. This hybrid approach captures the precision of RAG with the reasoning depth of long context.
Think of it as a spectrum rather than a binary choice:
- Pure no-stack — Documents fit in context, bounded use case, moderate query volume
- Lightweight filtering + long context — Simple keyword or metadata filtering narrows documents before loading
- Hybrid RAG + long context — Vector retrieval identifies relevant chunks, long context window reasons across them
- Full RAG stack — Enterprise scale, high volume, precision-critical applications
The no-stack stack is not anti-engineering. It is a recognition that unnecessary infrastructure has costs — maintenance overhead, debugging complexity, failure modes, and cognitive load. Every component should earn its place by solving a problem that simpler approaches cannot handle.
Conclusion
The no-stack stack is the right starting point for most AI applications that work with bounded document sets. Load the documents. Ask the question. Get the answer. Add infrastructure only when you hit a wall — scale, cost, latency, or precision — that the simple approach cannot clear.
Enterprise-scale knowledge bases, high-volume production systems, and precision-critical applications still need retrieval infrastructure. The $800 million flowing into vector database companies in 2025 reflects real enterprise demand. But for the vast number of AI applications being built by small teams, startups, and individual developers — the default should be simplicity, not complexity. Start with the fewest moving parts. Add machinery only when the data or the workload forces your hand.
FAQ
Is RAG dead now that context windows have reached millions of tokens?
No. RAG deployments grew 280% in 2025 and remain essential for enterprise-scale applications. Long context windows handle bounded document sets well, but when your data exceeds the context window, when you need sub-second latency, or when query volumes make per-token costs prohibitive, retrieval infrastructure is still necessary. The two approaches are complementary, not competitive.
How much does it cost to use long context windows compared to RAG?
The cost difference is significant at scale. Long-context queries average around $0.10 per query, while RAG queries average roughly $0.00008 — making RAG approximately 1,250 times cheaper per query. However, prompt caching can reduce long-context costs by 50-90%, and for low-volume use cases the simplicity savings in engineering time often outweigh the per-query cost difference.
What is the “lost in the middle” problem and does it affect the no-stack stack?
Research published in 2024 by Liu et al. found that LLMs perform best when relevant information appears at the beginning or end of long contexts, with accuracy dropping 10-20 percentage points for information positioned in the middle. This affects any long-context approach. Mitigation strategies include structuring documents with clear section markers, placing the most important content at the beginning or end, and using context budgeting to avoid filling the window to capacity.
Frequently Asked Questions
How much venture investment did the vector database category attract in 2025 despite the no-stack stack trend?
The vector database category attracted over $800 million in venture investment across 2025. Pinecone reported 340% year-over-year revenue growth, and Weaviate closed a $163 million Series C round. Enterprise RAG deployments grew 280% year-over-year, with roughly 60% of production LLM applications using some form of retrieval-augmented generation. This signals that the no-stack stack is complementary to, not a replacement for, retrieval infrastructure at scale.
What context window capacities did major models reach by 2025 that enable the no-stack architecture?
By early 2025, Gemini 1.5 Pro offered 2 million tokens, GPT-4.1 reached 1 million tokens, Claude extended to 200,000 standard with 1 million in beta, and Meta’s Llama 4 pushed to 10 million tokens. Magic’s experimental LTM-2-Mini demonstrated a 100-million-token context window — enough for 10 million lines of code or roughly 750 novels. These capacities allow loading entire document sets directly into the prompt without retrieval infrastructure.
What recall rate did Google’s evaluations show for Gemini 1.5 Pro in long-context information retrieval?
Google’s own evaluations of Gemini 1.5 Pro demonstrated greater than 99.7% recall on needle-in-a-haystack tests across up to 1 million tokens, and maintained above 99% recall extending to 10 million tokens for text modality. However, the same research found that RAG maintains advantages for dialogue-based queries and general question-answering, and that summarization-based retrieval performs comparably to long context — indicating the no-stack approach works best for knowledge-intensive lookups rather than all query types.
Sources & Further Reading
- Long Context vs. RAG for LLMs: An Evaluation and Revisits — arXiv (January 2025)
- Lost in the Middle: How Language Models Use Long Contexts — Liu et al., TACL 2024
- 100M Token Context Windows — Magic
- Prompt Caching with OpenAI, Anthropic, and Google Models — PromptHub
- RAG vs Long Context: Do Vector Databases Still Matter in 2026? — MarkAICode
- The Needle in the Haystack Test and How Gemini Pro Solves It — Google Cloud Blog
- Context Length Comparison: Leading AI Models in 2026 — Elvex
- RAG and Long-Context Windows: Why You Need Both — Google Cloud Community
















