Blog AI

Computer Use: What Practitioners Actually Found

Anthropic's Computer Use promised autonomous agents. Six months of real testing show where it works, where it breaks, and what it costs.

Sam McKay 18 June 2026

When Anthropic dropped Computer Use into public beta back in October 2024, the developer community split into two camps almost overnight. Half the threads on r/LocalLLaMA were breathless demos of Claude booking flights and editing spreadsheets. The other half were engineers pointing out that a single failed click loop could burn through dollars in minutes. Six months of production testing later, the picture is a lot more nuanced than either camp wanted to admit.

This is what practitioners actually found when they wired Computer Use into real workflows.

The Setup vs The Reality

The marketing pitch was simple. Claude sees a screen, decides what to do, moves the mouse, types keys, and repeats until the task is done. Developers who watched the launch demos expected something close to AGI in a box. The HN thread that week had comments like “we’re back to 1995 screen scraping, but with a 200ms inference bill attached.”

The reality, based on practitioner reports across Reddit, Discord servers, and a dozen YouTube walkthroughs, sits in an awkward middle. Computer Use is genuinely capable of completing tasks that would have required custom automation code six months ago. It is also slow, expensive in ways that surprise teams, and brittle in edge cases that don’t show up in curated demos.

One common pattern from the r/Anthropic subreddit: developers spent the first week amazed, then spent the second week adding retry logic, screenshot validation, and cost guards. The tool works. It just doesn’t work the way the demos suggested.

Where Computer Use Genuinely Works

The strongest signal from the community is that Computer Use handles well-defined browser tasks with predictable UI surprisingly well. Form filling on legacy web apps, navigating multi-step checkout flows, extracting data from sites without APIs, these are the use cases where practitioners reported the highest success rates.

A few specifics that came up repeatedly. Latency per action typically lands between 2 and 5 seconds, which includes the screenshot capture, the model reasoning step, and the coordinate output. For a 15-step workflow, that puts you at 30 to 75 seconds of pure model time, before any retry overhead. Practitioners testing on Claude 3.5 Sonnet reported roughly 85 to 90 percent task completion on first attempt for simple browser flows. That number drops sharply once the UI has dynamic elements, modals, or animations.

Cost is where things get interesting. Each Computer Use call includes a screenshot, which means token counts are high. Practitioners on the API Discord reported costs ranging from $0.05 to $0.40 per single action depending on screenshot resolution and context length. A typical 20-step task lands somewhere between $1 and $8 in API spend. That is not nothing, especially when you compare it to a deterministic Playwright script that costs fractions of a cent.

The places where Computer Use genuinely delivered, based on community testing, were tasks where the alternative was either expensive human time or no automation at all. Pulling data from government portals with weird JavaScript. Filing expense reports in legacy ERP systems. Testing UI flows that don’t have stable selectors. These are the wins people actually posted about.

The Failure Modes Nobody Warned Us About

The failure modes are where Computer Use gets honest criticism. The most reported issue, by a wide margin, is the coordinate precision problem. Claude clicks near the right spot, but not exactly on it. Buttons get missed by a few pixels. Dropdown menus close before the selection registers. Practitioners building production systems reported needing to add visual confirmation steps after every action, which doubles the cost and triples the latency.

The second most common complaint is loop failures. Computer Use gets stuck on cookie banners, captchas, and unexpected popups. A practitioner on the Latent Space Discord described a workflow that worked perfectly in testing, then failed in production because a single A/B test variant added a newsletter modal. The agent clicked the close button, but the modal had moved. The agent then tried to click again on the cached coordinates, which were now off-screen.

Safety concerns came up constantly in HN threads. Because Computer Use reads screenshots, anything visible on screen becomes potential prompt injection territory. A malicious webpage could display text instructing Claude to perform certain actions. Anthropic added some guardrails, but the practitioner consensus is that this is not solved. Teams running Computer Use against untrusted web content reported needing to wrap it in additional validation layers.

There is also the onboarding friction. Setting up Computer Use requires a virtual display, screenshot capture, and action execution environment. Developers used to clean API integrations found themselves debugging X11 forwarding issues and Selenium compatibility problems. The first deployment took most teams 2 to 4 days, according to practitioner reports on r/MachineLearning.

The Cost Picture

Cost deserves its own section because it surprised almost everyone. The token pricing for Computer Use is not the same as standard Claude API pricing. Screenshots eat tokens. A typical 1920x1080 screenshot, even compressed, can add 1,000 to 4,000 tokens to a request. Multiply that by the number of steps in a task, and the bill grows fast.

A team I read about on a practitioner blog ran a benchmark of 100 Computer Use tasks. The median cost was $3.20 per task. The 95th percentile was $14 per task, driven by retry loops and context accumulation. For comparison, the same tasks done with a scripted Playwright approach cost $0.02 each.

The community response has been to add cost monitoring as a first-class concern. Practitioners are setting hard limits per task, killing runs that exceed thresholds, and routing only high-value workflows through Computer Use while keeping deterministic tasks on traditional automation.

One pattern that emerged: teams are using Computer Use as a fallback, not a primary tool. The main flow runs on Playwright or Selenium. If that fails or if the UI changes, Computer Use takes over. This hybrid approach reportedly cuts costs by 70 to 80 percent while maintaining reliability.

Who Should Actually Deploy This

The fit question is where the community has gotten clearest. Computer Use is not a fit for high-volume, low-latency production workflows. If you need to process 10,000 records per hour, this is the wrong tool. The latency and cost make it economically unviable.

It is a fit for small teams dealing with legacy systems that have no API. A two-person operations team automating expense reports across three different web portals will get value here. The alternative is hiring another human, which costs more than the API bill.

It is a fit for prototyping and exploration. Developers building agent systems need to understand what is possible. Computer Use is the fastest way to prototype a workflow that would take weeks to build with traditional automation. The prototype is not the production system, but it answers the question of whether the workflow is automatable at all.

It is less of a fit for regulated industries, at least right now. The safety concerns around prompt injection via screenshots make compliance teams nervous. Practitioners in finance and healthcare reported getting blocked on deployment for this reason.

Team size matters too. Solo developers and small teams (under 10 people) seem to get the most value because they can absorb the setup friction. Larger teams often find the maintenance burden exceeds what they expected, especially when UI changes break workflows that were working fine last week.

What Teams Pair It With

The community has converged on a few common pairings. The most popular is Computer Use alongside Browser Use, the open source alternative that has gotten significantly better over the past six months. Practitioners report using Browser Use for the happy path and falling back to Computer Use when Browser Use hits a wall it cannot navigate.

Skyvern is another frequent pairing. It uses a different approach, combining computer vision with structured reasoning, and practitioners report it handles some workflows more reliably than Computer Use, particularly around form filling. The cost is lower too, though the setup is more complex.

Stagehand from Another AI Lab has emerged as a third option in this space. It focuses on browser-based automation with a more deterministic layer underneath. Teams report using Stagehand for stable workflows and Computer Use for the long tail of edge cases.

Traditional RPA tools like UiPath and Automation Anywhere are not really competing here. They serve different needs. But practitioners migrating from RPA report that Computer Use handles unstructured data better, while RPA handles high-volume structured workflows better. The two coexist in many enterprise stacks.

For teams building internal tools, the pattern is to use Computer Use for the messy 20 percent of tasks that resist automation, and to invest engineering effort in the clean 80 percent. This is the opposite of how most teams approached automation before, but it is the pattern that keeps showing up in practitioner reports.

The Honest Take

Six months in, Computer Use is a real tool that does real things. It is also slower, more expensive, and more brittle than the demos suggested. The teams getting value are the ones treating it as a specialized component in a larger automation stack, not as a magic autonomous agent.

The biggest shift in community sentiment over the past few months has been from “wow, look at this” to “okay, how do we deploy this responsibly.” Practitioners are building guardrails, monitoring costs, and setting clear boundaries on what gets routed through Computer Use. The teams that did this early are seeing ROI. The teams that tried to replace their entire automation stack with it are the ones posting cautionary tales.

If you are evaluating Computer Use for your stack, the question is not whether it works. It does, within limits. The question is whether your workflow has the right shape for it. Unstructured, low-volume, high-value tasks with no API access. That is the sweet spot. Everything else is better served by something more deterministic.

If you’re working through which tools belong in your stack, book a 60-min Omni Audit — https://calendly.com/sam-mckay/discovery-call

Enterprise DNA Resources