Blog AI

Claude 4: What Production Teams Actually Found

Real latency numbers, cost breakdowns, and edge cases from teams running Claude Opus 4-8 and Sonnet 4-6 in production workloads.

Sam McKay 13 June 2026

Claude 4 landed in production environments about three months ago, and the gap between what Anthropic promised and what engineering teams actually experienced tells you more than any benchmark chart.

Teams expected faster responses than Claude 3.5. They got that, but with caveats that don’t show up in the marketing materials. The real story sits in Slack channels where developers compare token costs at 2am, in GitHub issues tracking mysterious refusal patterns, and in the spreadsheets where CTOs calculate whether the accuracy gains justify the 40% price increase over Sonnet 3.5.

What Changed From Claude 3.5

The jump from Claude 3.5 Sonnet to Claude Sonnet 4-6 brought measurable improvements in code generation. Developers on r/LocalLLaMA consistently reported that Sonnet 4-6 handles multi-file refactoring better than its predecessor. One thread tracked a team migrating a React codebase — Sonnet 4-6 maintained context across 8 files where 3.5 would lose the thread around file 5.

Claude Opus 4-8 sits at the top of the lineup. Anthropic positions it as their reasoning-heavy model, and the context window expanded to 200k tokens. That number matters less than you’d think. Most production use cases don’t need 200k tokens. What matters more: Opus 4-8 handles complex instructions with fewer clarification loops.

A technical lead at a fintech company mentioned their team tested Opus 4-8 against GPT-4o for financial document analysis. Opus caught edge cases in contract language that GPT-4o missed, but took 3.2 seconds average response time versus GPT-4o’s 1.8 seconds. The accuracy was worth the latency for their use case. For customer-facing chat, it wasn’t.

Haiku 4-5 replaced Haiku 3 as the speed option. Response times dropped to 400-600ms for typical prompts under 2k tokens. Teams using it for classification tasks or simple extraction reported the speed gain made the upgrade worthwhile even though per-token costs went up 15%.

The Latency Reality

Benchmarks show one thing. Production traffic shows another.

Claude Opus 4-8 averages 2.8-4.5 seconds for prompts in the 5k-15k token range based on reports from teams running it in US East regions. That’s input plus output generation. Sonnet 4-6 comes in around 1.2-2.1 seconds for similar loads. Haiku 4-5 consistently hits 400-700ms.

These numbers shift based on time of day. A developer in a HN thread noted Opus response times spiked to 7+ seconds during US business hours in early May. Anthropic’s status page showed green, but the community consensus was that demand surges caused throttling that didn’t trigger official incidents.

Compare that to GPT-4o, which holds steadier at 1.5-2.3 seconds regardless of time of day for equivalent prompt sizes. The tradeoff: Claude Opus 4-8 produces more thorough analysis when you need it to reason through ambiguous requirements. GPT-4o gives you faster responses but sometimes needs a second prompt to catch nuance.

For real-time applications, Haiku 4-5 became the default. One team building a coding assistant switched from Sonnet 4-6 to Haiku for autocomplete suggestions. The latency drop from 1.8s to 500ms made the feature feel responsive instead of laggy. They kept Sonnet for the more complex “explain this function” requests where users expect to wait a beat.

Cost Structure That Surprised Teams

Anthropic’s pricing puts Claude Opus 4-8 at $15 per million input tokens and $75 per million output tokens. Sonnet 4-6 runs $3 input, $15 output. Haiku 4-5 costs $0.25 input, $1.25 output.

The sticker shock comes from output tokens. A team processing legal documents found that Opus 4-8 generates verbose explanations even when you prompt for concision. Their average output per document analysis: 3,200 tokens. At $75 per million output tokens, that’s $0.24 per document. Multiply by 50,000 documents per month and you’re at $12,000 just in output costs.

They tried prompt engineering to reduce verbosity. Adding “respond in under 500 tokens” cut output to 1,800 tokens average, but accuracy dropped enough that they went back to the longer responses. Their solution: route simpler documents to Sonnet 4-6, reserve Opus for the 20% that need deep analysis. Monthly costs dropped to $4,800.

A developer building a code review tool shared numbers in a YouTube comment section. Their tool analyzes pull requests and suggests improvements. Sonnet 4-6 costs them $0.08 per PR review on average (input + output combined). They tested Opus 4-8 and found it caught 12% more issues but cost $0.31 per review. The math didn’t work for their freemium model.

Haiku 4-5 changed the economics for high-volume tasks. A customer support automation team replaced a fine-tuned GPT-3.5 model with Haiku for ticket classification. Haiku costs $0.0003 per classification versus their previous $0.0008. Accuracy stayed within 2 percentage points. At 2 million tickets per month, the savings covered the migration work in six weeks.

Where Claude 4 Actually Excels

Code generation with context became Claude Sonnet 4-6’s standout use case. Developers consistently mention it maintains architectural patterns better than GPT-4o across multi-file changes. A backend engineer described asking Sonnet to add a new API endpoint following their existing patterns. The model generated the route handler, database migration, test file, and updated documentation — all matching the team’s conventions without additional prompting.

That same consistency shows up in technical writing. Claude Opus 4-8 produces documentation that feels like a human wrote it, not an AI. The tone stays even, examples make sense, and it doesn’t hallucinate API methods. A DevRel team tested it against GPT-4o for generating SDK documentation. Opus required 30% fewer edits before publishing.

Long-context analysis works when you actually need it. A research team fed Opus 4-8 a 180k token corpus of interview transcripts and asked it to identify recurring themes. The output included specific quote references with accurate token positions. GPT-4o with a similar context window produced more generic themes without the precise citations.

Claude’s refusal behavior became more nuanced in version 4. Earlier versions would refuse prompts that mentioned anything medical or legal, even in obviously non-harmful contexts. Opus 4-8 and Sonnet 4-6 distinguish between “write a phishing email” (refuses) and “analyze this email for phishing indicators” (complies). That distinction matters for security tools and compliance software.

Structured output improved noticeably. When you ask Claude Sonnet 4-6 to return JSON, it rarely breaks the schema. A team building a data extraction pipeline reported their error rate dropped from 8% with Claude 3.5 to under 1% with Sonnet 4-6. They still validate the output, but the reduction in malformed responses cut their error-handling code in half.

The Edge Cases That Bite

Math remains inconsistent. Claude Opus 4-8 handles arithmetic better than previous versions, but developers on r/LocalLLaMA documented cases where it fails at multi-step calculations that GPT-4o solves correctly. One example: calculating compound interest over 30 years with variable rates. Opus got the formula right but made errors in the execution. The thread consensus: use a calculator API for anything beyond basic math.

Prompt sensitivity creates frustration. Small wording changes produce different results more than they should. A developer shared a test where “summarize this in 3 bullet points” versus “create a 3-point summary” gave notably different outputs — not just in phrasing but in which information got included. That inconsistency makes it harder to build reliable prompts.

The context window doesn’t always help. Teams putting 150k+ tokens into Claude Opus 4-8 found it sometimes loses track of details from the early part of the context. A data analyst described feeding in a large dataset with instructions at the top. Opus referenced the instructions correctly for the first 50k tokens of analysis, then started ignoring specific constraints around token 80k. Anthropic’s documentation doesn’t mention this degradation.

Rate limits hit harder than expected. The API returns 429 errors more aggressively than OpenAI’s. A team doing batch processing found they needed to implement more conservative retry logic and request spacing. Their GPT-4 integration could burst to 100 requests per minute during off-peak hours. Claude’s limits forced them to cap at 40 requests per minute even with a paid tier.

Streaming responses occasionally break. The HN thread about Claude 4 included reports of streams cutting off mid-sentence with no error message. The workaround: implement timeout logic that retries the full request if streaming stalls for more than 10 seconds. This happens rarely (maybe 1 in 500 requests), but it happens often enough that production code needs to handle it.

Who Should Actually Use It

Claude Sonnet 4-6 fits teams doing code generation, technical writing, or document analysis where accuracy matters more than speed. If you’re building developer tools, documentation systems, or internal automation that runs async, Sonnet delivers better results than GPT-4o at comparable cost.

Small teams (3-8 developers) benefit most from Sonnet’s code generation. It reduces the back-and-forth that happens with less capable models. One startup founder mentioned their team of 5 engineers ships features 20% faster using Sonnet for boilerplate generation and test writing. The cost runs them $400-600 per month, which they consider cheap for the velocity gain.

Claude Opus 4-8 makes sense for specialized analysis: legal document review, research synthesis, complex technical explanations. If your use case involves reasoning through ambiguity and you can tolerate 3-4 second response times, Opus outperforms alternatives. The cost only works if the accuracy improvement has measurable value.

Haiku 4-5 belongs in high-volume, latency-sensitive applications. Customer support classification, content moderation, simple extraction tasks — anywhere you need thousands of inferences per hour and can’t wait 2 seconds per response. The cost per inference drops low enough that it competes with fine-tuned smaller models.

Claude doesn’t fit real-time conversational AI well. The latency variance makes it hard to deliver consistent user experience. A team building a voice assistant tried Claude Sonnet 4-6 and found the 1-2 second response times felt sluggish compared to GPT-4o’s sub-second responses. They kept Claude for the backend analysis that happens after the conversation ends.

Teams with strict data residency requirements hit limitations. Anthropic doesn’t offer the same regional deployment options as OpenAI or Google. If you need models running in EU-only infrastructure, Claude isn’t an option yet.

What Teams Pair It With

The most common pattern: Claude Sonnet 4-6 for generation, GPT-4o for speed-critical paths. A SaaS company routes user-facing chat to GPT-4o (faster responses), background report generation to Sonnet (better quality), and high-volume classification to Haiku (lower cost). Their orchestration layer picks the model based on task type and latency requirements.

Cursor IDE users often combine Claude with Cursor’s Bugbot feature. Bugbot completes code reviews in 90 seconds and finds 10% more bugs at 22% lower cost than manual review. The integration between Cursor and Claude Sonnet 4-6 handles the context management automatically, which matters when reviewing changes across multiple files.

Some teams run Claude alongside Perplexity’s Computer for research tasks. Perplexity routes work across 20+ models including Claude variants. A product team uses this combo for market research — Perplexity gathers information, Claude Opus synthesizes it into actionable insights. The two-step process costs more but produces better output than either tool alone.

Local model fallbacks became more common. Teams running Mistral Large 2 locally use it for internal documents and sensitive data, route public-facing tasks to Claude. The hybrid approach keeps costs down (local inference is free after hardware costs) while maintaining quality for external use cases.

If you’re working through which tools belong in your stack, book a 60-min Omni Audit — https://calendly.com/sam-mckay/discovery-call

The Verdict From Three Months In Production

Claude 4 delivers on code generation and technical analysis. The quality improvement over Claude 3.5 is real, not marketing. Teams that need accurate, context-aware responses for complex tasks find the upgrade worth the cost increase.

The latency variance remains the biggest operational challenge. You can’t rely on consistent response times during peak hours, which limits where Claude fits in your architecture. Budget for 4+ second responses even when averages look better.

Cost optimization requires active management. The output token pricing punishes verbose responses, so you’ll spend time tuning prompts or implementing routing logic to use cheaper models where possible. This isn’t set-it-and-forget-it pricing.

The model choice matters more than the version number. Opus, Sonnet, and Haiku serve genuinely different use cases. Teams that match the model to the task see better ROI than those who default to Opus for everything.

Claude 4 earned its place in production stacks, but it’s not a universal replacement for other models. The teams getting the most value treat it as one tool in a multi-model strategy, not the only tool they need.

Enterprise DNA Resources