Blog AI

LLM Context Limits: What Engineers Actually Found

Developers share what LLM context window limits mean in production, including surprises, workarounds, and which tasks break first.

Sam McKay 20 June 2026

What practitioners expected vs what shipped

The pitch was simple. Bigger context, fewer hacks. By mid-2024 the vendor slides were stacking up: 100K, 200K, then 1M and 2M token windows across the major providers. The early reaction on r/LocalLLaMA and Hacker News was mostly “finally,” followed quickly by a long thread of “but actually.”

Practitioners who wired these windows into production pipelines kept reporting a familiar gap between nominal capacity and usable capacity. A 200K window was billed as “fits an entire codebase.” What teams found in practice was that the model could fit the tokens, but the quality of attention across them was uneven. The lost-in-the-middle effect, first documented in academic work in 2023 and replicated in countless community benchmarks since, kept showing up in anecdotal reports. Developers on r/LocalLLaMA repeatedly described placing critical instructions at the top and bottom of a long prompt, then watching the model miss details buried in the middle of the same input.

The 1M token window from Gemini 1.5 Pro triggered similar posts. Several HN commenters ran their own needle-in-a-haystack style tests and found that raw retrieval stayed high for short prompts but performance on cross-document reasoning dropped noticeably once they crossed roughly 100-200K tokens of input. The community consensus, distilled from dozens of threads I read while preparing this piece, was that vendor benchmarks and real workloads were measuring different things.

Where large contexts actually deliver

The wins are real for specific tasks, and they are worth naming.

Summarization of long documents, where the goal is to capture overall themes rather than pinpoint individual facts, is a clear fit. Practitioners on practitioner blogs and YouTube walkthroughs reported processing 60-90K-token contract bundles or research paper collections and getting useful thematic summaries on the first pass. Latency for Sonnet-class models on inputs around 80K tokens landed in the 6-12 second range, with cost around $0.30 to $0.80 per call depending on caching and region.

Code review at the file or small-repo level also works well. Engineers walking through repo-level coding assistants consistently said that with 5-15 files of source, the model held the structure and produced useful comments. Below roughly 30K tokens of input, the behavior felt close to a normal short-prompt interaction, and the longer context mostly disappeared as a constraint.

Multi-document Q&A against a fixed corpus of 30-50 pages was another pattern that worked. The model cited the right document, paraphrased correctly, and did not get confused about which source said what. Customer onboarding flows that walk a new hire through 40 pages of internal documentation also fit this shape, and several r/LocalLLaMA posters running local 70B-120B models reported similar results at 32-64K context windows.

For these tasks, the bump from 32K to 200K mattered. Teams stopped having to chunk and stitch. That chunking infrastructure, custom embeddings, retrieval pipelines, was often the single biggest source of bugs in their RAG stack, and dropping it felt like a real productivity gain. The HN thread titled something like “I removed our RAG pipeline and just used a long context model” stayed on the front page for two days.

Where the window breaks down

The failure modes cluster around three patterns that come up over and over in community reports.

First, reasoning quality degrades well before the window fills. Multiple HN commenters described asking a 200K model to compare two sections of a 150K-token document and getting answers that were confidently wrong. The model mixed up details across sections, or invented a position that no document actually held. This is the “context rot” pattern that became a recurring phrase in late 2024 and 2025 threads, and it tracks with the published evaluations showing performance curves that bend well before the nominal limit.

Second, specific factual retrieval past the first 20-30K tokens becomes unreliable. Even when clean needle-in-a-haystack tests showed 95% or better accuracy, real questions phrased naturally (“what was the termination clause in the third agreement”) often returned summaries of the wrong section. The information was in the window. The attention was not, or at least not consistently.

Third, structured extraction breaks down. Practitioners trying to pull JSON or table-formatted output from a long input reported the model silently truncating fields, mixing schemas, or hallucinating keys. The longer the input, the more this happened, regardless of the nominal window size. A team on a practitioner blog wrote that they had to cap their extraction calls at 40K tokens even with a 200K window available, because the error rate above that point made the output unusable.

The community reaction to these reports was often a weary “we knew this in 2023, vendors are finally admitting it.” Several teams responded by capping their effective window at 50-60% of the advertised maximum, even when they were paying for the full capacity.

The cost and latency math nobody puts on the slide

The other gap between promise and reality is the bill, and it is the one that hits hardest in production.

For Claude Sonnet-class models, a 200K input runs roughly 4x the per-token cost of an 8K input on the same model, and prefill latency scales sub-linearly but still meaningfully. Practitioners in r/LocalLLaMA benchmarks reported prefill times of 3-8 seconds for 100K tokens and 15-30 seconds for 500K-1M tokens, depending on provider and region. Time-to-first-token matters because it gates the whole interaction. A 25-second prefill before a single token appears is a UX problem on its own, separate from whether the answer is correct.

A single agent loop that re-feeds 80K tokens of context on every step can burn $2-6 per task even at the cheaper model pricing tiers. Teams that built agent systems without a cost dashboard found their daily API spend jumping 5-10x the week they turned on long-context features. The HN thread on “context is the new compute bill” stayed active for weeks, with several founders posting screenshots of their invoices as a warning to others.

KV cache memory on the inference side is the related pain for self-hosted setups. A 200K context at typical precisions needs 40-80GB of GPU memory just for the cache, before any model weights. That math pushed several r/LocalLLaMA posters toward smaller models with tighter contexts, or quantized long-context runs, rather than chasing headline numbers. One detailed breakdown compared a 200K Qwen-class model against a 32K Mistral-class model and found the Mistral won on both cost and latency for the team’s actual workload, even though the Qwen had a longer advertised window.

The pairing stack the community converged on

The teams that got the most out of large windows were the ones that paired them with retrieval, not the ones that replaced retrieval with them.

The most common pattern in practitioner write-ups and Discord transcripts: use a 100-200K context model as the final reader over a small, curated set of retrieved chunks. The retrieval step handles finding the right 20-50 pages, and the long context handles the synthesis, the cross-reference, the multi-document reasoning. This avoids both the lost-in-the-middle problem and the cost blowup, because the input stays in a range where the model is paying attention to the whole thing.

Other pairings that showed up repeatedly across the threads I tracked:

A short-context router model that decides whether a query needs the long-context reader at all. The cheaper model handles 80% of traffic. The expensive one handles the rest. Several teams reported cost reductions of 60-75% with no measurable quality drop on their evaluation set.

Caching layers. The same 200K prompt run twice in an hour should not pay full price twice. Practitioners leaned on prompt caching features where available, sometimes cutting repeat-call costs by 70-90%. Teams on Anthropic and OpenAI APIs both reported this, and the pricing math made caching the highest-leverage optimization in the whole stack for them.

Structured output validators downstream. Because long-context extraction is unreliable, a small validation step that re-checks the JSON shape caught most failures before they reached the user. This is a one-line addition that pays for itself many times over.

Who should care and who should ignore the hype

Teams getting the most value: 5-30 person companies doing document-heavy work. Legal review, due diligence, research synthesis, code migration across multiple repos, anywhere the alternative is a human reading 200 pages. For them, a 200K context model replaces hours of work, the cost is justified, and the synthesis quality is good enough to act on. Several practitioner case studies in this segment reported 4-8x speedups on their core workflow.

Teams getting the least value: high-volume customer support, single-page summarization, anything that fits comfortably in 16K tokens. Paying for a 200K window and using 4K is leaving money on the table. Smaller models with 32K windows, often self-hosted, handled these tasks at a fraction of the cost with comparable quality. The HN comment I keep coming back to was a senior engineer writing “we upgraded to the 200K model and our bill went up 12x. The quality went up zero. We went back.”

The honest summary from the community threads I read: context windows are a real capability, the upper end of 200K-1M works for synthesis tasks and breaks for precise retrieval, the cost curve is steeper than vendor pricing pages suggest, and pairing long context with a smart retrieval layer beats going all-in on either approach. Treat the advertised window as a ceiling, not a target. Measure your effective context the way you would measure any other production system, with a held-out eval set and a cost dashboard. The teams doing that consistently got the most out of the technology and spent the least chasing the hype.

If you’re working through which tools belong in your stack, book a 60-min Omni Audit.

Enterprise DNA Resources