What Ollama actually is

Ollama is a runtime that bundles open-weight large language models with the infrastructure needed to run them locally on consumer hardware. Under the hood it wraps llama.cpp, a C++ port of the original LLaMA inference code, and adds a model registry, a REST API, and a simple CLI. When you run ollama pull llama3.2, you are downloading a quantized GGUF file that has been pre-configured with a chat template, a system prompt, and a set of sampling defaults.

What this means in practice is that Ollama turns a multi-step process (download weights, configure context, set up a server, manage memory mapping) into a single command. The model itself is just a file on disk. Ollama is the wrapper that knows how to load it, serve it, and talk to it.

The project is open source and runs on macOS, Linux, and Windows. It exposes a local HTTP API on port 11434 by default, which is compatible with the OpenAI API shape, so anything that can talk to OpenAI can usually talk to Ollama with a base URL swap. This compatibility is the single most important design decision in the project, because it means the local model can drop into existing toolchains without rewriting client code.

The model registry is a curated list rather than a free-for-all. Each entry in the registry is a Modelfile, which is a small text recipe that defines the base weights, the chat template, the default parameters, and an optional system prompt. When you pull a model, Ollama resolves the Modelfile, downloads the underlying GGUF blob, and caches both locally. This is why two models with similar names can behave very differently, because the Modelfile changes the runtime behavior even when the weights are identical.

Setup and authentication

Installation is the easiest part. On macOS you download the signed installer from the Ollama website and drag it to Applications. On Linux the recommended path is the one-line curl install script, which places the binary at /usr/local/bin/ollama and registers a systemd service so the daemon restarts on reboot. On Windows you use the MSI installer, which sets up the service the same way.

Once installed, the ollama CLI is your main interface. There is no account to create, no API key to manage, no billing dashboard. The only thing you might want to set is the location where models are stored, since the default is the operating system home directory and some of these files are large.

On macOS and Linux you can override this by exporting OLLAMA_MODELS before starting the daemon, for example pointing it at a directory on an external SSD if your laptop drive is small. On Windows you set the equivalent environment variable through the system settings panel. The daemon reads this on startup, so changing it requires a restart of the service.

There is no authentication on the local API by default because the assumption is that nothing on the network should be able to reach it. The default bind address is 127.0.0.1, which means only processes on the same machine can connect. If you bind the server to a non-loopback address with OLLAMA_HOST=0.0.0.0, you should put it behind a reverse proxy with auth, or treat the port like any other exposed service and firewall it appropriately.

On Linux the systemd unit is at /etc/systemd/system/ollama.service and you can inspect it with systemctl status ollama. Logs go to journald by default. On macOS the daemon runs as a launchd service and logs land in the system log. On Windows it runs as a Windows service.

First working example

With the daemon running, open a terminal and run ollama run llama3.2. The first invocation downloads the model, which for a small variant like the 3B parameter Llama 3.2 is roughly 2 GB. Once the download finishes, you are dropped into a chat REPL. Type a question, hit enter, get a response. That is the entire first example.

To use the API instead of the REPL, hit the local endpoint with curl:

curl http://localhost:11434/api/generate -d '{"model": "llama3.2", "prompt": "Explain what a database index does in one paragraph."}'

The response streams JSON lines back, one chunk per token. Each line has a response field with the next piece of text and a done field that flips to true at the end. If you want a non-streaming response, add "stream": false to the request body.

For an OpenAI-compatible call, point any client at http://localhost:11434/v1 as the base URL and use llama3.2 as the model name. This is the path most third-party tools take, and it is the reason Ollama slots into existing stacks without glue code. The Python openai library, the JS openai package, LangChain, LlamaIndex, and most agent frameworks all work against this endpoint with only a base URL change.

From Python the simplest call looks like this:

from openai import OpenAI; client = OpenAI(base_url="http://localhost:11434/v1", api_key="ollama"); print(client.chat.completions.create(model="llama3.2", messages=[{"role": "user", "content": "Hello"}]).choices[0].message.content)

The api_key argument is required by the client library but ignored by Ollama. Any string works.

Key settings that matter

The defaults are sensible for casual use, but a few dials change the experience meaningfully.

Context length is controlled by the /set parameter num_ctx command inside the REPL, or by passing num_ctx in the API request. The default is 2048 tokens, which is fine for short Q&A but truncates longer documents. Bumping it to 8192 or 16384 lets you feed in real documents, at the cost of more RAM and slower inference. The KV cache for context scales roughly linearly with context length, so doubling the context can double the memory used.

Quantization is the second dial. Ollama ships most models in Q4_0 quantization, which is a good balance between quality and memory. If you have headroom on VRAM or RAM, pulling a Q6 or Q8 variant gives noticeably better output for tasks like code generation. If you are running on a machine with limited memory, the Q3 variants stay under 4 GB but lose some coherence on complex reasoning. The tag on the model name tells you the quantization, for example llama3.2:7b-instruct-q4_0.

Temperature and top_p control output randomness. The defaults of temperature 0.8 and top_p 0.9 are tuned for chat. For deterministic extraction tasks you want temperature 0 and top_p 1, or you will get different answers on every run. For creative work, higher temperature produces more varied output.

GPU acceleration is automatic on Apple Silicon, NVIDIA, and AMD GPUs with ROCm support. On Apple Silicon the Metal backend is enabled by default and uses unified memory, which is why a Mac with 32 GB of RAM can run a 13B model comfortably. On Linux you need the NVIDIA drivers and CUDA toolkit installed, and you can verify the GPU is being used by watching ollama ps while a request is in flight. The output shows VRAM usage per loaded model.

Keep alive is a setting most people miss. By default Ollama unloads a model from memory after 5 minutes of inactivity, which means the next request pays the full load cost. If you are building a tool that makes frequent calls, set keep_alive to a longer duration in your API requests, or to -1 to keep the model resident indefinitely. The load cost for a 7B model is typically in the range of a few seconds, but for a 70B model it can be tens of seconds.

Modelfiles let you create custom model variants with pinned parameters, a fixed system prompt, and even custom templates. ollama create mymodel -f Modelfile builds a new tag from a recipe, and ollama run mymodel uses it. This is the right way to standardize a model configuration across a team.

Where it shines

Local models through Ollama are the right answer when privacy or cost dominates the decision. Anything that touches sensitive data, medical records, legal documents, internal company financials, customer support transcripts, can be processed without that data ever leaving the machine. For regulated industries this is often the only viable path.

They also shine for high-volume, low-complexity work. Summarization, classification, extraction, and rewriting tasks at scale can run for free once the hardware is in place, and the marginal cost per request is essentially zero. A single workstation with a mid-range GPU can handle throughput that would cost real money per hour against a hosted API at the volumes typical for content pipelines.

Development and iteration is another strong fit. When you are prototyping prompts or building evaluation harnesses, the round-trip latency to a local model is faster than to a hosted endpoint, and there is no rate limit to design around. You can hammer a local model with thousands of test cases without worrying about billing or throttling.

Ollama is also the fastest way to evaluate open-weight models. Pulling a new model takes a minute, running it through your test set takes an afternoon, and you can compare it directly against the hosted models you already use. This makes it practical to stay current with the open-weight ecosystem, which moves faster than any single vendor.

Finally, offline operation is a genuine advantage. On an air-gapped network, on a plane, or in any environment without reliable connectivity, a local model just works. For field work, embedded systems, and edge deployments, this is the only option.

Where it fails

The honest limitations matter as much as the strengths.

The largest open-weight models still cannot match the frontier hosted models on reasoning, coding, and multimodal tasks. A 70B local model is competitive with a mid-tier hosted model from a year ago, not with the current top tier. If you need the absolute best output quality, local is not there yet, and the gap is wider on tasks that require long-horizon planning or tool use.

Hardware requirements are real. Running anything beyond a 7B model comfortably requires a machine with at least 16 GB of RAM, and serious work needs 32 GB or more. A 70B model at Q4 quantization wants around 40 GB of unified memory or VRAM, which puts it out of reach of most laptops. The cost of the hardware is a real upfront investment that needs to be amortized against the inference savings.

Tool use and function calling work in Ollama but the ecosystem is younger than the hosted providers. The model support is uneven, the schemas vary, and you will hit edge cases that the documentation does not cover. For production agent systems, hosted APIs are still more reliable and better documented.

Latency on the first token after a model has been unloaded can be tens of seconds, which breaks interactive UX patterns. Even with keep_alive set, cold starts are expensive on large models. If your application makes infrequent requests, the cold start tax dominates the total response time.

Fine-tuning is supported but the tooling is rougher than hosted offerings. If you need to train a custom model on your own data, the path through Ollama is less polished than the hosted fine-tuning APIs. For serious custom model work, dedicated training infrastructure is still the better choice.

Finally, there is no built-in observability. If you want to log requests, track token usage, or debug a prompt, you need to build that layer yourself or wrap Ollama with a proxy that does it for you. For production deployments this is a real gap.

Practical workflow pattern

A setup that works well for a solo developer or small team looks like this.

Run Ollama on a dedicated machine, either a workstation with a discrete GPU or a Mac Studio with unified memory. Expose the API only on localhost, and use a reverse proxy if you need to reach it from other machines on the network. Set OLLAMA_MODELS to a directory on fast storage with room to grow, since the model library can easily reach several hundred GB if you pull widely.

Pull a small set of models that cover your use cases rather than everything available. A typical kit might be a 7B chat model for general work, a code-specialized model for development, and a larger reasoning model for tasks that need it. Pin specific versions with tags like llama3.2:7b-instruct-q4_0 so updates do not silently change behavior. Treat the local model library like any other dependency, with version pins and a deliberate update cadence.

Build your application against the OpenAI-compatible endpoint so you can swap between local and hosted models without changing code. Use environment variables for the base URL and model name, and keep a hosted fallback for the cases where the local model is not good enough. This routing logic is usually a few lines in a config layer, and it pays off the first time you need to compare outputs or handle a capability gap.

For evaluation, run a fixed prompt set against each candidate model and score the outputs. Treat the local model as one option in a menu, not a replacement for everything. The evaluation harness is the same code regardless of which model is behind it, which is the whole point of the OpenAI-compatible interface.

For data-sensitive workloads, route those requests to Ollama explicitly and everything else to your hosted provider. This gives you the privacy benefit where it matters and the quality benefit where it does not. Most teams end up with a hybrid pattern, and the routing rules are usually driven by data classification rather than performance.

For team usage, wrap Ollama with a thin proxy that adds logging, rate limiting, and basic auth. This is a few hours of work and turns Ollama from a personal tool into a shared service. The proxy can also expose a single endpoint that fans out

Enterprise DNA Resources

Ollama Local Models Tutorial: Setup to First Run