What OpenAI Assistants actually is
OpenAI Assistants is a managed runtime that lets you build stateful AI agents without standing up your own orchestration layer. Strip the product framing and what you have is a server-side construct that bundles three things together: a system prompt, a set of tools the model can call, and persistent conversation state.
The key word is stateful. A normal chat completion call is a single request and response with no memory of the previous call unless you pass prior messages in manually. An Assistant is created once, configured with instructions and tools, and then maintains a Thread which is the conversation history. You append messages to the thread, kick off a Run, and poll until the run finishes. The thread keeps accumulating turns across calls, so the model always has the full context of the conversation.
Under the hood the Assistants API exposes a few resource types. An Assistant defines the model, the system instructions, and which tools are enabled. A Thread holds messages. A Run is an execution of a thread against an assistant. A Run Step is one of the actions the assistant took during that run, which becomes useful when the model is calling tools and you want to inspect what happened.
The tools available to an assistant are the same primitives that drive most of the agent ecosystem right now. Code Interpreter runs Python in a sandboxed environment with a working filesystem and the ability to produce downloadable artifacts. File Search indexes uploaded documents and retrieves relevant chunks when a user asks a question. Function Calling lets you register your own tool schemas and the assistant will emit structured calls that your application executes. Web Search was added more recently for assistants that need live data from the public web.
The practical effect is that you can ship a working document Q&A system, a data analysis agent, or a custom-tool agent in a few hundred lines of code, with the model reasoning and tool orchestration handled by OpenAI’s infrastructure. The cost of that convenience is the limitations covered later in this guide.
Setup and authentication
The API is a REST endpoint at api.openai.com, with first-class SDKs for Python and Node.js. The Python SDK is the best-supported and what this walkthrough uses.
Install the package and set your API key as an environment variable. Create a project directory, initialize a virtual environment, and install the openai package version 1.x or later. Export OPENAI_API_KEY in your shell, or load it from a .env file using python-dotenv. The SDK picks it up automatically.
For local development that is enough. For anything beyond a hobby project you should be using a project-scoped key rather than the default user key. In the OpenAI dashboard create a new project, generate an API key tied to that project, and set usage limits at the project level. Keys created this way can be revoked independently and they keep your main account billing clean.
One thing people miss: Assistants resources (the assistants themselves, their threads, and their files) live in your account until you delete them. If you build a proof of concept, make a habit of cleaning up assistant IDs and thread IDs you no longer need, or you will end up with a long list of orphan objects in your dashboard.
First working example
Here is a minimal end-to-end example that creates an assistant, opens a thread, posts a user message, runs the thread, and prints the final response.
In Python, import the OpenAI client. The client object exposes the high-level resources you need: client.beta.assistants, client.beta.threads, and client.beta.threads.runs.
Create the assistant by calling client.beta.assistants.create with a name, instructions, model, and a list of tools. For the first run, leave tools empty and use a fast tier model so you can iterate cheaply. The instructions are the system prompt, so write them the same way you would write any other system prompt. Be specific about tone, constraints, and output format.
Create a thread with client.beta.threads.create. A thread starts empty. Add a user message by calling client.beta.threads.messages.create with the thread_id and a role of user.
Kick off a run with client.beta.threads.runs.create, passing the thread_id and assistant_id. The response is a Run object with a status, typically queued, in_progress, requires_action, completed, or failed. You poll by calling client.beta.threads.runs.retrieve with the run_id, sleeping for a short interval between checks, until the status is no longer queued or in_progress.
When the run status is completed, list the messages on the thread with client.beta.threads.messages.list, paginated with the most recent first. The last message from the assistant role is the answer.
That is the basic loop. Every real application you build on top of the Assistants API will be a variation of create assistant, open thread, post message, run, poll, read response, repeat.
Key settings that matter
Most of the documentation focuses on the obvious knobs, which are model selection, instructions, and tool enablement. There are several dials that materially change behavior and that people tend to leave at defaults.
The first is the temperature, top_p, and reasoning_effort trio. Temperature controls randomness, top_p controls nucleus sampling, and reasoning_effort (on supported models) controls how much internal deliberation the model does before answering. For a Q&A assistant over internal documents, lower temperature and minimal reasoning effort gives you fast, deterministic answers. For a research assistant that needs to weigh tradeoffs, higher reasoning effort is worth the latency and cost.
The second is the tool_choice parameter on a run. By default the model decides freely whether to call a tool. Setting tool_choice to required forces at least one tool call, which is useful for assistants that always need to ground their answer in a file search query. Setting it to a specific function name forces that exact function. For most workflows the default is correct, and over-constraining the model tends to produce worse results.
The third is the truncation_strategy on the assistant. Threads grow indefinitely, and once a thread exceeds the model’s context window you have a problem. The auto truncation strategy tells the API to drop the oldest messages in the thread when context overflow is approaching. This is almost always what you want for long-lived conversations. The alternative, none, is a footgun.
The fourth is response_format. If your assistant is expected to return structured data, set response_format to a JSON schema. The model will then constrain its output to match the schema, which is far more reliable than asking for JSON in the system prompt and hoping for the best.
The fifth is the max_completion_tokens and max_prompt_tokens limits on the assistant. These cap how much the run can spend on input and output. Without them, a single run can balloon in cost when a tool returns a large payload or the assistant decides to write a long response. Set these explicitly in production.
The sixth is metadata. Assistants, threads, and runs all accept a key and value metadata bag. Use it. Store your internal user ID, the workspace the assistant is acting in, or the source of the request. When you have thousands of runs and need to debug, the dashboard search by metadata is the only way to find anything.
Where it shines
The Assistants API is a strong fit for three patterns.
The first is internal document Q&A. Upload a knowledge base, enable File Search, and you have a working retrieval system in a day. The model is good at answering questions that require synthesizing across multiple chunks, and the chunking, embedding, and retrieval are all handled for you. For a single-tenant internal tool with a few thousand pages of policy docs, product specs, or support transcripts, this is the cheapest path to a usable answer engine.
The second is structured data--- title: “OpenAI Assistants API Tutorial 2026: A Practical Walkthrough” description: “A hands-on guide to building, configuring, and deploying OpenAI Assistants with real code examples and production workflow patterns.” publishDate: “2026-06-25” author: “Sam McKay” difficulty: “intermediate” service: “general” tags:
- ai-tools
- tutorial draft: false
What OpenAI Assistants actually is
OpenAI Assistants is a managed runtime that lets you build stateful AI agents without standing up your own orchestration layer. Strip the product framing and what you have is a server-side construct that bundles three things together: a system prompt, a set of tools the model can call, and persistent conversation state.
The key word is stateful. A normal chat completion call is a single request and response with no memory of the previous call unless you pass prior messages in manually. An Assistant is created once, configured with instructions and tools, and then maintains a Thread which is the conversation history. You append messages to the thread, kick off a Run, and poll until the run finishes. The thread keeps accumulating turns across calls, so the model always has the full context of the conversation.
Under the hood the Assistants API exposes a few resource types. An Assistant defines the model, the system instructions, and which tools are enabled. A Thread holds messages. A Run is an execution of a thread against an assistant. A Run Step is one of the actions the assistant took during that run, which becomes useful when the model is calling tools and you want to inspect what happened.
The tools available to an assistant are the same primitives that drive most of the agent ecosystem right now. Code Interpreter runs Python in a sandboxed environment with a working filesystem and the ability to produce downloadable artifacts. File Search indexes uploaded documents and retrieves relevant chunks when a user asks a question. Function Calling lets you register your own tool schemas and the assistant will emit structured calls that your application executes. Web Search was added more recently for assistants that need live data from the public web.
The practical effect is that you can ship a working document Q&A system, a data analysis agent, or a custom-tool agent in a few hundred lines of code, with the model reasoning and tool orchestration handled by OpenAI’s infrastructure. The cost of that convenience is the limitations covered later in this guide.
Setup and authentication
The API is a REST endpoint at api.openai.com, with first-class SDKs for Python and Node.js. The Python SDK is the best-supported and what this walkthrough uses.
Install the package and set your API key as an environment variable. Create a project directory, initialize a virtual environment, and install the openai package version 1.x or later. Export OPENAI_API_KEY in your shell, or load it from a .env file using python-dotenv. The SDK picks it up automatically.
For local development that is enough. For anything beyond a hobby project you should be using a project-scoped key rather than the default user key. In the OpenAI dashboard create a new project, generate an API key tied to that project, and set usage limits at the project level. Keys created this way can be revoked independently and they keep your main account billing clean.
One thing people miss: Assistants resources (the assistants themselves, their threads, and their files) live in your account until you delete them. If you build a proof of concept, make a habit of cleaning up assistant IDs and thread IDs you no longer need, or you will end up with a long list of orphan objects in your dashboard.
First working example
Here is a minimal end-to-end example that creates an assistant, opens a thread, posts a user message, runs the thread, and prints the final response.
In Python, import the OpenAI client. The client object exposes the high-level resources you need: client.beta.assistants, client.beta.threads, and client.beta.threads.runs.
Create the assistant by calling client.beta.assistants.create with a name, instructions, model, and a list of tools. For the first run, leave tools empty and use a fast tier model so you can iterate cheaply. The instructions are the system prompt, so write them the same way you would write any other system prompt. Be specific about tone, constraints, and output format.
Create a thread with client.beta.threads.create. A thread starts empty. Add a user message by calling client.beta.threads.messages.create with the thread_id and a role of user.
Kick off a run with client.beta.threads.runs.create, passing the thread_id and assistant_id. The response is a Run object with a status, typically queued, in_progress, requires_action, completed, or failed. You poll by calling client.beta.threads.runs.retrieve with the run_id, sleeping for a short interval between checks, until the status is no longer queued or in_progress.
When the run status is completed, list the messages on the thread with client.beta.threads.messages.list, paginated with the most recent first. The last message from the assistant role is the answer.
That is the basic loop. Every real application you build on top of the Assistants API will be a variation of create assistant, open thread, post message, run, poll, read response, repeat.
Key settings that matter
Most of the documentation focuses on the obvious knobs, which are model selection, instructions, and tool enablement. There are several dials that materially change behavior and that people tend to leave at defaults.
The first is the temperature, top_p, and reasoning_effort trio. Temperature controls randomness, top_p controls nucleus sampling, and reasoning_effort (on supported models) controls how much internal deliberation the model does before answering. For a Q&A assistant over internal documents, lower temperature and minimal reasoning effort gives you fast, deterministic answers. For a research assistant that needs to weigh tradeoffs, higher reasoning effort is worth the latency and cost.
The second is the tool_choice parameter on a run. By default the model decides freely whether to call a tool. Setting tool_choice to required forces at least one tool call, which is useful for assistants that always need to ground their answer in a file search query. Setting it to a specific function name forces that exact function. For most workflows the default is correct, and over-constraining the model tends to produce worse results.
The third is the truncation_strategy on the assistant. Threads grow indefinitely, and once a thread exceeds the model’s context window you have a problem. The auto truncation strategy tells the API to drop the oldest messages in the thread when context overflow is approaching. This is almost always what you want for long-lived conversations. The alternative, none, is a footgun.
The fourth is response_format. If your assistant is expected to return structured data, set response_format to a JSON schema. The model will then constrain its output to match the schema, which is far more reliable than asking for JSON in the system prompt and hoping for the best.
The fifth is the max_completion_tokens and max_prompt_tokens limits on the assistant. These cap how much the run can spend on input and output. Without them, a single run can balloon in cost when a tool returns a large payload or the assistant decides to write a long response. Set these explicitly in production.
The sixth is metadata. Assistants, threads, and runs all accept a key and value metadata bag. Use it. Store your internal user ID, the workspace the assistant is acting in, or the source of the request. When you have thousands of runs and need to debug, the dashboard search by metadata is the only way to find anything.
Where it shines
The Assistants API is a strong fit for three patterns.
The first is internal document Q&A. Upload a knowledge base, enable File Search, and you have a working retrieval system in a day. The model is good at answering questions that require synthesizing across multiple chunks, and the chunking, embedding, and retrieval are all handled for you. For a single-tenant internal tool with a few thousand pages of policy docs, product specs, or support transcripts, this is the cheapest path to a usable answer engine.
The second is structured data