Gemini Flash-Lite Goes GA With 60% Enterprise Cost Cuts

Google’s Gemini 3.1 Flash-Lite is now generally available, and the numbers it’s putting up in production environments are hard to ignore. Announced on May 8, 2026, the model is now live on Vertex AI and the Gemini API — and it’s positioned squarely at enterprises that need to run AI agents at scale without burning through budget.

The headline stat: Gladly, which runs millions of customer interactions weekly across SMS, WhatsApp, and Instagram, is seeing 60% lower costs compared to thinking-tier models, with a p95 latency of around 1.8 seconds for full reply generation and sub-second response times for classifiers and tool calls. Their success rate under heavy concurrent load sits at 99.6%.

That’s the kind of production reliability story that moves enterprise procurement conversations.

What Flash-Lite Actually Does

The model sits in a specific sweet spot: tasks that need to be fast and cheap but can’t afford to be dumb. Tool calling, pipeline orchestration, data classification, and high-volume customer interactions — these are the workloads it’s been optimised for.

Compared to its predecessor Gemini 2.5 Flash, the 3.1 version delivers 2.5 times faster time to first token and 45% higher output throughput. It’s priced at $0.25 per million input tokens and $1.50 per million output tokens.

At those economics, the calculus for deploying AI agents across repetitive, high-volume tasks changes significantly.

Why This Matters for Enterprise AI Builders

For years, the practical constraint on enterprise AI deployment hasn’t been capability — it’s been the cost of running things at scale. A single AI-assisted customer conversation might be impressive in a demo. Ten million of them per month is where most organisations hit a wall.

Flash-Lite directly addresses that constraint. It’s designed to sit in the non-reasoning slots of a multi-model workflow: handling the fast, structured, high-volume calls while more powerful models tackle complex reasoning tasks.

JetBrains is using it to power real-time developer support inside IDEs. OffDeal uses it for live data lookups during investment banking calls on Zoom. Ramp, the financial operations platform, runs latency-sensitive features on it. AlphaSense processes market intelligence data through it across their entire data stack.

These aren’t pilot projects. These are companies that have moved well past experimentation.

The Bigger Pattern: Cost Commoditisation Is Happening

Flash-Lite’s GA is part of a broader pattern worth paying attention to. Over the past 18 months, the cost of running capable AI models has dropped dramatically across every major provider. Google, Anthropic, and OpenAI have all shipped faster, cheaper models on aggressive timelines.

For businesses still on the sidelines, the economic argument for waiting is shrinking. The AI productivity gains that used to require significant infrastructure investment are now achievable at a fraction of the cost. The question is no longer whether AI agents can do the work — it’s whether your organisation has the data and process foundations in place to deploy them properly.

That second part is where most companies still get stuck.

What This Means for Business

The immediate opportunity for enterprise teams is reassessment. If you’ve been running cost-of-AI projections based on older model pricing, those numbers are probably wrong now.

The more interesting opportunity is looking at processes that were previously dismissed as “too high-volume to automate intelligently.” Customer support triage, internal knowledge queries, data extraction from documents, classification and routing across business workflows — these are tasks where Flash-Lite class models now make economic sense at serious scale.

The companies pulling ahead in AI adoption aren’t just the ones deploying the most powerful models. They’re the ones matching the right model to the right task — and building the infrastructure to run those models reliably at production scale.

If your organisation hasn’t done a proper audit of which AI workloads could run on faster, cheaper models versus which genuinely need heavy reasoning, that’s a useful exercise to run right now. The cost savings can fund the more ambitious deployments.

Enterprise DNA helps organisations build the data foundations and AI strategy needed to move from experiments to production. Talk to us about your AI readiness.

Source

Google Cloud Blog

Enterprise DNA Resources

Gemini Flash-Lite Goes GA With 60% Enterprise Cost Cuts

What Flash-Lite Actually Does

Why This Matters for Enterprise AI Builders

The Bigger Pattern: Cost Commoditisation Is Happening

What This Means for Business