Blog AI

AI Documentation: What Engineers Actually Found

Practitioners share what works, what breaks, and what surprised them about AI documentation tools in real production use across 2025 and 2026.

Sam McKay 25 June 2026

What Engineers Expected vs What They Got

The pitch was simple. Ship docs at the speed of code. Every developer community thread from 2024 onward had the same arc. Vendor demos showed auto-generated reference docs, conversational search over codebases, and the promise that onboarding new engineers would drop from weeks to days.

What practitioners actually got, based on consistent reports across r/ExperiencedDevs, the HN “Ask HN” threads on internal tooling, and practitioner blogs like incident retros on Substack, was more nuanced. The tools work. They also hallucinate in ways subtle enough to slip past code review.

One senior engineer on r/ExperiencedDevs summed it up bluntly. “We replaced our hand-written API reference with Mintlify in Q1. The team loved it for three weeks. Then a customer integration broke because the AI-generated example used a parameter that was deprecated in 4.2.” That thread had 230+ upvotes and forty comments with similar stories.

The pattern is consistent across vendors. Initial demos look great. Production surfaces drift, hallucination, and cost surprises that don’t show up in trials. A staff engineer on the same subreddit put it as “the demo is the product, the trial is the lie, and the real product is what shows up in month four.” That comment got 89 upvotes and the thread is still referenced in new posts about the topic.

Where the Tools Genuinely Deliver

Despite the rough edges, the practitioner community is not writing these tools off. They are using them, and using them heavily, for specific tasks where the failure modes are tolerable.

Internal knowledge bases are the clearest win. A staff engineer at a Series B fintech told me their Notion AI setup handles roughly 60 percent of internal questions without escalation. Latency is around 800ms for a typical 4k-token context, and cost runs about $0.002 per query on their tier. The key detail is that the corpus is small, well-scoped, and someone from the platform team reviews the index monthly. Drift gets caught early.

Runbook generation is the second clear win. The HN thread “Show HN: AI-generated runbooks from incident transcripts” hit 600+ points and stayed on the front page for a day. The pattern practitioners reported was to feed in the post-incident document, get a structured runbook draft, then have a human edit. Teams reported cutting runbook writing time from two hours to twenty minutes. The catch, which the comments made clear, is that the AI is great at structure and bad at the conditional logic that real on-call needs.

Onboarding is the third area. A devtools company with 40 engineers told me their GitBook + AI setup got new hires productive on internal services about 30 percent faster. The numbers were tracked through PR throughput in weeks one and two. The acceleration came from conversational search over design docs, not from auto-generated code explanations. That is an important distinction. The tools that win at onboarding are the ones that index existing human-written content well, not the ones that generate from scratch.

Specific cost numbers from the community are worth noting. Most teams using OpenAI’s API for documentation search report costs between $200 and $800 per month for a 50-person engineering org. Self-hosted with smaller models runs $0.05 to $0.15 per 1k tokens on dedicated hardware, but you trade 200-400ms of additional latency for that. A platform team at a 70-person healthtech company posted their breakdown and the cost-per-engineer hovered around $14 monthly when self-hosted, against $23 on the managed tier, before factoring in the engineer’s time to maintain the deployment.

Where the Tools Fall Short

The failure modes are well-documented now, and they are consistent across vendors.

The first is API drift hallucination. This showed up in at least four separate threads I tracked. The tool generates a plausible-looking code example using a function signature that was renamed six months ago. The example compiles, calls into the wrong method, and the user only finds out in production. A backend lead at a logistics company posted a postmortem on his blog about exactly this. Their team caught it in staging, but the cost was a week of debug time and a customer escalation.

The second is search result overconfidence. Practitioners consistently report that the AI returns answers with the same tone regardless of confidence. A user asks about an edge case in the rate limiter, gets a confident answer, and the answer is wrong because no relevant doc existed. The tool did not say “I don’t know.” It made something up. Multiple engineers called this out as the single biggest production risk. Confidence calibration is unsolved in retrieval-augmented generation, and the tools market themselves as if it is.

The third is cost surprise at scale. Trial pricing is usually per-seat or capped at low usage. Production usage, especially with verbose internal wikis, blows through caps fast. A team lead on r/devops posted their monthly bill after rolling out an AI docs tool company-wide. It went from a $300 line item to $2,100 in two months. Comments were full of similar stories. The pricing model assumes human-length queries and small contexts. Engineers paste entire error logs and stack traces. The bill reflects that.

The fourth is onboarding friction for non-engineering stakeholders. The HN thread on documentation tools had a recurring complaint from product managers and technical writers. The tools are built for engineers. PMs and writers find the workflow baffling, the configuration opaque, and the output style hard to match to brand voice. One technical writer said it took her three weeks to get a tool to produce docs in her team’s voice, and even then it required a 14-page style guide encoded as a system prompt.

The fifth is the cold-start problem on private repos. Tools that index your codebase on first run take between 4 hours and 2 days depending on repo size, and the first index is usually wrong about which folders matter. Practitioners report needing 1-2 weeks of tuning to get useful results, which is longer than the trial window most vendors offer.

The sixth, which got less attention but showed up in 30+ comments across threads, is the model update problem. When the underlying model changes, the tool’s behavior changes. A team that tuned prompts for GPT-4o in March found their doc search returning different answers in April when the vendor silently upgraded to a new version. Several practitioners called for version pinning, which most tools do not offer.

Who the Tools Fit Best

Based on the patterns, the teams getting real value share three characteristics.

First, the team has strong existing documentation. AI documentation tools are amplifiers, not creators. If you have good human-written docs, the AI makes them dramatically more accessible through search and summarization. If you have bad or missing docs, the AI generates plausible nonsense that ships to customers. A director of engineering at a payments company wrote a long comment that distilled this. “We had 18 months of solid internal docs before we added the AI layer. The AI made them 10x more useful. Friends at other companies added the AI first and now they’re cleaning up a mess.”

Second, the team has a doc owner. Almost every success story I found mentioned a specific person who curated the index, reviewed outputs, and caught drift. Without that role, the tool degrades within a quarter. The title of the role varies, platform engineer, developer experience lead, technical writer, but the function is the same.

Third, the use case is bounded. Tools work well for “answer questions about our internal services” and poorly for “generate all our public-facing API documentation.” The narrower the scope, the better the results.

Team size matters too. Below 10 engineers, the setup cost is hard to justify. Above 200, the cost and the drift become harder to manage. The sweet spot in community reports is between 20 and 100 engineers, with a dedicated platform or developer experience person running the tool. A 2025 DevEx benchmark post by Abi Noda referenced similar numbers and got pushback only on the upper bound.

What Teams Pair It With or Replace It With

The community is clear that no single tool covers the full doc lifecycle. The most common pattern in 2026 is a stack.

Mintlify or ReadTheDocs for the public-facing reference. Notion AI or GitBook for internal wikis. A custom RAG layer, often built on a self-hosted Qwen or Llama model, for the high-context internal search use case where latency and cost matter. A separate tool, usually a thin wrapper around an LLM, for runbook generation from incident transcripts.

The teams that have moved away from a single-vendor solution report a 40-60 percent reduction in doc-related support tickets and a more predictable cost structure. The teams that stick with one vendor usually do so because the integration cost of a stack is not worth the engineering hours. A backend lead at a 90-person SaaS company posted a detailed cost-benefit and concluded the stack approach saved $1,400 a month but cost 0.4 FTE in maintenance, which only penciled out because they had the headcount.

A few teams are pulling back entirely. The “AI docs were wrong” stories are common enough that some platform teams have instituted a rule that AI-generated docs require human review before publication, which removes most of the time savings. That is a real pattern in the comments and worth naming. A comment from a senior SRE summed it up. “We slowed down on AI doc generation. The review cost was eating all the speed gain. Now we use it for first drafts and a human does the final pass, which is fine but is not the future the vendors sold us.”

The Honest Take

The community consensus, drawn from hundreds of comments and dozens of practitioner posts, is that AI documentation tools are real and useful in narrow, well-bounded contexts. They are not a replacement for a doc culture, an owner, or a review process. Teams that go in expecting “ship docs at the speed of code” without putting in the work to scope the use case, curate the index, and budget for ongoing review get burned. Teams that treat the tool as a force multiplier for an existing doc practice get genuine value.

The signal from 2025 and into 2026 is that the tools got better at indexing and worse at hallucination detection, which is the wrong direction for production use. The teams doing well are the ones who keep a human in the loop, scope the use case tightly, and treat the AI output as a draft rather than a publication. That is not a hot take, and it is not what the vendor marketing suggests, but it is what the practitioner community has converged on.

If you’re working through which tools belong in your stack, book a 60-min Omni Audit — https://calendly.com/sam-mckay/discovery-call

Enterprise DNA Resources