What Whisper Actually Is
Whisper is an open-source automatic speech recognition model released by OpenAI in 2022. The current version you interact with through the API is a hosted variant trained on roughly 680,000 hours of multilingual audio. It takes an audio file as input and returns text, either as plain transcription or as translated text in English.
Strip the marketing language and Whisper is a single neural network doing sequence-to-sequence mapping from audio spectrograms to text tokens. It handles punctuation, capitalization, and basic sentence boundaries without you having to assemble a pipeline of separate models. For most practical purposes that means one API call replaces what used to be a stitched-together stack of voice activity detection, a speech-to-text engine, a punctuation model, and often a language identification step.
There are two ways to use it. The first is self-hosting the open-source weights, which gives you full control and no per-minute cost. The second is the hosted Whisper API, which charges $0.006 per minute of audio and handles all the infrastructure. This guide focuses on the hosted API because most readers want something working in an afternoon, not a GPU provisioning project.
Setup and Authentication
You need three things before any code runs: an OpenAI account, an API key with billing enabled, and a recent version of the official Python or Node client library. Audio files also need to be uploaded through multipart form data, which the client libraries handle for you.
Create or sign in to your OpenAI account at platform.openai.com. Navigate to the API keys section and generate a new secret key. Copy it once because OpenAI shows it only at creation time. Store it somewhere safe. A .env file with OPENAI_API_KEY=sk-... works for local development. For production, use a secrets manager or environment variables injected by your deployment platform.
Set usage limits in the billing dashboard. Whisper is cheap but a runaway script processing hours of audio can produce a noticeable bill. A hard cap at the account level is cheap insurance.
Install the Python client:
pip install openai --upgrade
The current version of the library exposes the audio transcription endpoint through client.audio.transcriptions.create. Older code using the v0 syntax still works but is on a deprecation path, so write new code against the new endpoint.
Verify your setup with a one-liner before going further. If this returns a list of available models, your auth is correct:
python -c "from openai import OpenAI; c = OpenAI(); print(c.models.list())"
First Working Example
The minimal transcription call takes an audio file and returns text. Save a short recording as sample.mp3 and run this:
from openai import OpenAI client = OpenAI() with open("sample.mp3", "rb") as f: transcript = client.audio.transcriptions.create( model="whisper-1", file=f ) print(transcript.text)
The model identifier whisper-1 is the only one exposed through the API and refers to the hosted Whisper large-v2 weights. Output is a Transcription object with .text containing the full transcript as a single string.
You can also request structured output with timestamps. Switch to the verbose response format and you get a JSON object with segment-level timing:
transcript = client.audio.transcriptions.create( model="whisper-1", file=f, response_format="verbose_json", timestamp_granularities=["segment"] )
Each segment includes start, end, and text fields, which is what you need when you want to align transcripts with video, build chapter markers, or feed chunks downstream into a language model. You can request word-level timestamps by passing ["word", "segment"] to the granularities parameter, which is useful for caption alignment and clickable transcripts.
Supported input formats include mp3, mp4, mpeg, mpga, m4a, wav, and webm. The current file size limit is 25 MB per request. Anything larger needs to be split or compressed first.
Key Settings That Matter
The endpoint exposes a handful of parameters that quietly change the output quality. Most tutorials skip these, which is why their results look mediocre on real audio.
language is the most important one people ignore. Whisper can auto-detect the language of the audio, but auto-detection is a separate inference step with its own error modes. On short clips under a few seconds it often guesses wrong, and on bilingual audio it tends to commit to one language early. If you know the language in advance, pass the ISO-639-1 code such as en, es, or zh. The transcription is faster and noticeably more accurate.
prompt is a string of up to 224 tokens that primes the model. This is not a system prompt in the chat sense. It is fed directly into the decoder as prior context. Use it for three things: spelling out names that Whisper would otherwise hallucinate, providing domain vocabulary it might miss such as medical terms, product names, and internal jargon, and setting the expected style. If your meeting transcripts keep rendering a colleague’s name incorrectly, a prompt containing the correct spelling fixes it for the entire file.
temperature controls sampling. Default is 0, which is greedy decoding and what you want for accuracy. Raising it produces more varied output but introduces hallucinations on silent or noisy sections. Leave it at 0 unless you have a specific reason to want creative variation in a transcript.
response_format accepts json, text, srt, vtt, and verbose_json. If you need captions for video, srt and vtt are already formatted for you. Skip post-processing entirely and just write the response straight to disk.
For the translation endpoint, client.audio.translations.create works the same way but always outputs English regardless of the source language. Useful for cheap English summaries of multilingual content, less useful for accurate translation since you lose the nuance of a dedicated translation model.
Where It Shines
Whisper handles clean, single-speaker audio better than almost anything else at this price point. Podcast episodes recorded in a studio, dictated notes captured on a decent microphone, and customer support call recordings with reasonable audio quality all transcribe with very low word error rates in the range you’d expect from a production-grade system.
It is also strong on multilingual content without you having to switch models. The same whisper-1 endpoint transcribes English, Spanish, Mandarin, Japanese, and several dozen other languages. For teams working across regions this is a meaningful operational simplification.
For batch processing of archives, the cost is hard to beat. At $0.006 per minute, transcribing 100 hours of audio runs around $36. That is significantly cheaper than most commercial transcription services and the quality is competitive for clean audio.
Long-form content works because Whisper’s underlying context window handles files up to the 25 MB size limit, which translates to roughly 30 to 40 minutes of compressed audio. Longer files need to be split, a topic covered in the workflow section below.
The output quality of punctuation and capitalization is also notably good. Most older speech-to-text systems returned a wall of lowercase text and required a separate model to restore formatting. Whisper produces publication-ready transcripts in a single pass.
Where It Fails
Whisper was trained primarily on clean audio. Performance degrades sharply on the things real-world recordings throw at it: heavy background noise, multiple overlapping speakers, phone calls with compression artifacts, and far-field microphone pickup in conference rooms.
On noisy audio Whisper does something worse than failing. It hallucinates. If a chunk contains more noise than speech, the model often produces fluent, plausible sentences that were never said. This is the single biggest footgun in production use. Always review samples from your actual audio conditions before trusting the output blindly, and never feed Whisper transcripts directly into automated workflows without a human check somewhere upstream.
Accented speech is a mixed bag. Native accents in well-represented languages work well. Heavy regional accents, code-switching between two languages mid-sentence, and technical jargon outside the training distribution all produce elevated error rates. The prompt parameter helps but only up to a point.
There is no built-in way to identify speakers, no confidence score per segment in the standard JSON output, and no streaming endpoint for real-time transcription. If you need any of these, plan to layer additional tooling on top.
PII handling is also your responsibility. Audio gets sent to OpenAI’s servers and is retained for abuse monitoring per their data usage policy. If you are transcribing customer calls or health information, that has compliance implications. Self-hosting the open-source weights is the workaround when data residency is non-negotiable.
Latency is another constraint. A typical 10-minute file takes around 30 to 60 seconds to transcribe over the API. That is fine for batch jobs but unsuitable for live captioning.
Practical Workflow Pattern
The pattern that works in production is split, transcribe, reassemble. Audio files longer than the size limit need to be cut into chunks before submission. Most teams use ffmpeg for this because it handles every format Whisper accepts and lets you control chunk boundaries by silence detection.
A typical pipeline looks like this:
- Run ffmpeg with the
silencedetectfilter to find natural pause points in the audio - Split the file at those pauses into segments under 24 MB each to leave headroom
- Submit each segment to the transcription endpoint with the same
languageandpromptparameters - Concatenate the returned text segments
- If you requested
verbose_json, merge the segment timestamps with a small offset adjustment based on where each chunk started in the original file — typically the cumulative duration of all preceding chunks
Run the chunks in parallel using a thread pool or async client. The API is rate-limited per account but a modest concurrency of 4 to 8 workers processes an hour of audio in roughly the time of the longest chunk rather than the total duration.
Store the raw transcripts alongside the original audio with a consistent naming convention. You will want to re-run transcription when you improve your prompt strings or when OpenAI ships a new model version. Having the inputs intact makes that a one-line job.
For analysis downstream, the verbose_json output with segment timestamps is the right format to feed into a language model. Most retrieval-augmented generation setups treat transcript segments as the chunking unit. A 30-minute interview becomes 80 to 120 segments that index cleanly into a vector store.
Build a small review loop into the workflow. Whisper hallucinations on noisy audio mean at least a sample of every batch should be read by a human before the output is treated as authoritative. The cost of the review is small compared to the cost of acting on fabricated content.
The model has been stable for long enough that you can build real systems on top of it without expecting the API to disappear. That stability, combined with the low per-minute pricing, is what makes Whisper a practical default rather than an interesting experiment. Pair it with good prompt engineering, a splitting script, and a human review step, and you have a transcription pipeline that handles 90 percent of what most teams actually need.
To see how tools like this fit into a complete AI operating layer for your business, book a 60-min Omni Audit, https://calendly.com/sam-mckay/discovery-call