Enterprise DNA

Omni by Enterprise DNA

Enterprise DNA Resources

Step-by-step how-tos. Practical AI operating-system thinking for owners, operators, and teams doing real work.

220k+

Data professionals

Omni

AI agents and apps

Audit

Map the manual work

Guide Intermediate General

Replicate API Tutorial: Run AI Models Without Managing GPUs

A working Replicate API tutorial covering auth, your first model call, version pinning, webhooks, and when to use Replicate versus building your own inference stack.

Sam McKay |
Replicate API Tutorial: Run AI Models Without Managing GPUs

What Replicate Actually Is

Strip away the marketing and Replicate is a hosted runtime for AI models. You send it an input, it runs the model on rented GPU hardware, and it sends you the output. The company does not train its own frontier models. It hosts other people’s models, mostly community and open source ones, behind a uniform HTTP API.

Three things are worth understanding about the platform before you start using it.

First, the model catalog. Replicate hosts thousands of models across image generation, image editing, video, audio, speech, text, embeddings, and a long tail of specialised tasks. Anything from Stable Diffusion variants to Whisper to Meta’s Llama derivatives to niche fine-tunes. Each model has a model page with a playground UI, input schema, output schema, and a version history.

Second, the packaging system. Most models on Replicate run inside Cog, an open source container format the company maintains. Cog packages a model, its weights, its dependencies, and a predict.py script into a Docker image. When you push a Cog model, Replicate can deploy it as an API endpoint. This is the technical foundation and it explains why the catalog moves fast. Anyone can package a model and push it.

Third, the billing model. Replicate charges by the second of GPU time. Different hardware tiers cost different rates. A typical A100 inference run lands somewhere in the range you’d expect for on-demand cloud GPUs, while smaller T4 jobs run cheaper. You top up your account, the meter runs while your prediction is active, and you get billed for what you actually used.

Compared to alternatives, Replicate sits in a specific niche. Hugging Face Spaces is similar in spirit but oriented toward demos rather than production APIs. Modal and RunPod give you raw GPU compute where you bring your own model. AWS Bedrock and Azure AI Foundry offer managed frontier models with enterprise SLAs but a narrower catalog. Replicate is the middle ground, a managed model catalog with a uniform API and pay-per-use pricing, no infra to babysit.

Setup and Authentication

The setup is straightforward but the details matter because the authentication pattern shows up in every integration.

  1. Create an account at replicate.com. You can sign up with GitHub or email. New accounts typically receive a small amount of free credit so you can run a handful of test predictions without adding a card.

  2. Generate an API token. Go to your account page, then API tokens, then create one. Treat this token like any other secret. It grants full access to your account and billing.

  3. Store the token as an environment variable. The official clients read REPLICATE_API_TOKEN by default.

export REPLICATE_API_TOKEN=r8_xxxxxxxxxxxxxxxxxxxxxxxx
  1. Install the Python client.
pip install replicate

The Node client is also available (npm install replicate) and the HTTP API works with any language since it is plain JSON over HTTPS.

  1. Add a payment method. Replicate bills against your account balance. You top up manually or set up auto-recharge above a threshold. There is no separate API plan to buy.

A few practical notes. Do not commit the token to source control. Do not pass it as a query string parameter on URLs you log. If you are deploying to a server, store it in your secret manager. If you are running locally, a .env file with python-dotenv is the conventional pattern.

First Working Example

Here is the shortest path from zero to a working model call. We will use the Python client and a small text-to-image model because image outputs are visually obvious.

import replicate

output = replicate.run(
    "black-forest-labs/flux-schnell",
    input={
        "prompt": "a studio photo of a vintage typewriter on a wooden desk",
        "aspect_ratio": "16:9",
        "output_format": "jpg"
    }
)

print(output)

What happens under the hood. The client posts a prediction request to Replicate’s API. The platform queues it, spins up the model container if one is not warm, runs the prediction, then returns the result. For short-running models the call blocks until the prediction finishes. Flux Schnell typically returns in a few seconds.

The output is a list of URLs, one per generated image. Save them or pass them downstream.

import urllib.request

for idx, url in enumerate(output):
    urllib.request.urlretrieve(url, f"typewriter_{idx}.jpg")

If you would rather use raw HTTP without the client, the equivalent call is a POST to /v1/predictions.

curl -s -X POST \
  -H "Authorization: Bearer $REPLICATE_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"version": "<VERSION_ID>", "input": {"prompt": "a cat"}}' \
  https://api.replicate.com/v1/predictions

That returns a prediction object with an ID and a status of starting or processing. You poll GET /v1/predictions/{id} until status becomes succeeded or failed.

Try this with the playground first. Open any model page on replicate.com, fill in the inputs in the web UI, and confirm it produces the output you expect. Then mirror those inputs in code. This saves you from debugging whether your code is wrong or whether the model just behaves differently than you assumed.

Key Settings That Matter

Several knobs affect cost, latency, and reliability, and most demos skip past them.

Version pinning. Each model has a version hash, a long string of hex that uniquely identifies a snapshot of the model code and weights. When you reference a model by its slug like black-forest-labs/flux-schnell, you are implicitly pinning to the latest version, which changes when the model owner pushes an update. For production, always pin to a specific version ID. This is the difference between a working system that breaks silently six months from now and one that stays reproducible.

You can find the version hash on the model page under the API tab, or by reading the response of a recent prediction.

Webhooks versus polling. By default, the Python client’s run function blocks until completion. For long-running jobs such as video generation or batch upscaling, this means holding an HTTP connection open for minutes. Webhooks are better. Pass a webhook URL when creating the prediction, Replicate POSTs the result to your endpoint when it finishes, and your code can do other things in the meantime.

Streaming. Some models support streamed output, particularly LLMs and any model that produces tokens or frames progressively. Use streaming when you want to start rendering or processing before the full prediction completes. The exact mechanism varies by model and you will find it documented on each model page.

GPU selection. A handful of models let you choose the hardware tier. Faster GPUs cost more per second. For a quick prototype the default is fine. For production where you are processing volume, picking a smaller GPU for a model that does not need an H100 can cut your bill meaningfully.

Cold starts. If a model has not been called recently, the container has to spin up, which adds latency on the first request. Subsequent requests reuse a warm container and return faster. For latency-sensitive paths, send a periodic warm-up ping or accept the cold start tax.

Cancellation. Predictions can be cancelled mid-run. If a user abandons an upload or you detect bad inputs, POST to /v1/predictions/{id}/cancel and you stop being billed for that prediction. This is a real cost lever when you have user-driven generation flows.

Where Replicate Shines

Replicate is at its best when you need a model that does not exist on the major hosted APIs, and you need it this week.

Image generation and editing. The image generation ecosystem moves fast and many of the strongest models are open source. Flux, SDXL variants, Stable Diffusion 3, ControlNet, IP-Adapter, LoRA stacks, inpainting, outpainting, all available on Replicate within days of release. If your product needs an image feature and you do not want to host the GPU yourself, this is the path of least resistance.

Voice and audio. Whisper for transcription, multiple TTS voices, voice cloning models, music generation. The audio side of the catalog is deep and the quality is good enough for many production uses.

Video. The video model space is younger but Replicate has been quick to host new entrants. Costs are higher because video inference is expensive, but for prototyping a video feature without committing to infrastructure, it works.

Specialised fine-tunes. Community members publish fine-tunes for very specific tasks, like removing backgrounds from product photos, restoring old images, generating pixel art, or running particular style transfers. You will find models here that you will not find anywhere else with a clean API.

Async and batch workflows. The webhook and version-pinning primitives make it straightforward to fire off many predictions, collect results out of band, and build batch processing pipelines.

Where It Fails

Honest list of limitations.

Latency. Cold starts can push first-request latency into the multi-second range, which is too slow for real-time interactive features. If your UX needs sub-second response, Replicate is the wrong layer.

Cost at scale. Pay-per-second GPU pricing is great for prototypes and bad for high-volume production. Once you are running thousands of predictions per day, the math usually favours self-hosting on your own GPU pool or moving to a dedicated inference provider with committed pricing.

Fine-tuning on your own data. Replicate runs models other people have trained. If you need to fine-tune a model on proprietary data and serve the resulting weights, Replicate can host your Cog image but you still need to do the training elsewhere.

Compliance posture. If you operate in regulated industries with strict data residency, audit logging, or BAA requirements, Replicate is a consumer cloud service and may not match your control expectations. Check with your security team.

Model churn. Open source models get deprecated, owners delete them, or behaviour changes between versions. The version-pinning discipline I mentioned earlier is the mitigation, but it adds operational overhead.

No guaranteed SLAs. You get a working API but no contractual uptime guarantee in the way enterprise providers offer. For non-critical features this is fine. For revenue-critical paths it is a risk you have to own.

Practical Workflow Pattern

Here is the pattern I have seen work for teams adopting Replicate as part of a real product stack.

Start in the playground. Pick three or four candidate models for your task. Generate outputs, compare quality, pick a winner. This takes an afternoon and saves you from committing code to the wrong model.

Wrap the call in a thin internal client. Do not sprinkle replicate.run calls across your codebase. Create a single module, for example services/image_gen.py, that owns the model version, the input defaults, and the retry logic. When the model is deprecated, you change one file.

Pin versions explicitly. Read the version hash from the model page and store it in your config. Add a quarterly review process where someone checks for new versions and tests them in a staging environment.

Use webhooks for anything over a few seconds. Build a small endpoint that receives the webhook, validates the signature, and writes the result somewhere durable. Treat the prediction as fire-and-forget from the client side.

Cache outputs. Image generation is a great candidate for content-addressed caching. If two users send the same prompt with the