What RAG Actually Is
RAG stands for Retrieval Augmented Generation. Strip away the marketing and it is a three-stage pipeline that gives a language model access to external information at query time.
The first stage is the retriever. When a user asks a question, the retriever searches a database of pre-indexed documents and returns the most relevant chunks. The database is usually a vector store, which means each chunk has been converted into a numerical embedding that captures its semantic meaning.
The second stage is the augmenter. The retrieved chunks are stuffed into the prompt that gets sent to the language model, usually with instructions like “answer the question using only the context below.”
The third stage is the generator. The language model produces an answer grounded in the retrieved context rather than relying solely on its training data.
Why bother with this at all. Three reasons. First, language models have knowledge cutoffs and cannot answer questions about events after training. Second, they hallucinate confidently when they do not know something, which is a problem for any business use case where accuracy matters. Third, they have no access to private data such as internal wikis, customer contracts, or proprietary research.
RAG solves all three by injecting fresh, verified, private context into the prompt. The model is no longer guessing. It is reading.
Setup And Authentication
You will need a Python environment, a few packages, and an API key. The current versions of the libraries mentioned here are typical for a stack that works as of mid-2026.
Create a virtual environment and install the dependencies. The core stack is langchain for orchestration, openai for the language model, chromadb for the local vector store, sentence-transformers for embeddings, and pypdf for loading PDF documents. You can swap any of these for alternatives, but this combination gets you a working pipeline in under ten minutes.
Set your API key as an environment variable. Never hardcode it in your script. On macOS or Linux the command is export OPENAI_API_KEY=your_key_here. On Windows use setx OPENAI_API_KEY "your_key_here" and restart your terminal.
If you prefer not to use OpenAI, the same architecture works with Anthropic, Google Gemini, or any open-source model served through Ollama. The retrieval layer stays identical. Only the generator changes.
For local-only setups where you cannot send data to external APIs, run an embedding model locally with sentence-transformers and a small language model through Ollama. The retrieval quality drops a bit but your data never leaves your machine.
First Working Example
Here is a complete, runnable pipeline that loads a PDF, chunks it, embeds it, stores the chunks in a local vector database, and answers a question about the content.
Save a PDF to a folder called docs in your working directory. The script below assumes the file is docs/handbook.pdf.
The pipeline has five steps. Load the document. Split it into chunks. Embed each chunk. Store the embeddings in Chroma. Query the database and send the top results to the language model along with the original question.
The chunk size matters and we will come back to it in the next section. For now, 1000 characters with 200 characters of overlap is a reasonable starting point.
When you run the script, you should see an answer that references specific details from the PDF rather than a generic response. If the answer looks like the model is making things up, your retrieval is missing the right chunks and you need to adjust the chunk size or the embedding model.
A common first-time mistake is using a chunk size that is too large. The embedding model compresses too much text into a single vector and the meaning gets diluted. Another common mistake is forgetting to persist the vector store, which means you re-embed the entire document every time you run the script.
Key Settings That Matter
Most tutorials gloss over the dials that actually determine whether your RAG system is useful or useless. Here are the ones worth obsessing over.
Chunk size controls how much text goes into each embedding. Too small and you lose context. Too large and you lose precision. The typical range is 500 to 1500 characters for prose, and 100 to 300 tokens for code. Start at 1000 and adjust based on the answers you get.
Chunk overlap prevents information from being split across a boundary where it would lose meaning. A 10 to 20 percent overlap is standard. Without overlap, a sentence that spans two chunks gets cut in half and neither chunk is useful on its own.
The embedding model determines what “similar” means. The current default in most stacks is a sentence-transformer model like all-MiniLM-L6-v2 for local work or OpenAI’s text-embedding-3-small for hosted work. Larger embedding models are more accurate but slower and more expensive. For most business use cases the small models are good enough.
Top-k is the number of chunks you retrieve for each query. Three to five is a common starting point. More chunks give the model more context but also more noise and a longer prompt, which costs more and can confuse the model.
Similarity threshold filters out chunks that are not relevant enough. If your top result has a similarity score below 0.7 in cosine distance, it is probably not useful and you should consider the retrieval a miss.
Reranking is a second pass that reorders the retrieved chunks using a more expensive model. It usually improves answer quality by 10 to 20 percent but adds latency. Use it when accuracy matters more than speed.
Temperature should be set low for RAG, typically 0 to 0.2. You want the model to stick to the retrieved context rather than improvise. Higher temperatures invite hallucination, which defeats the purpose of retrieval.
System prompt wording matters more than people think. A prompt that says “answer using only the context below and say you do not know if the answer is not in the context” performs measurably better than one that says “answer the question.”
Where It Shines
RAG genuinely excels at a handful of use cases. If your problem matches one of these, the technology is mature and the results are reliable.
Internal knowledge bases are the canonical use case. Company wikis, HR policies, product documentation, and onboarding guides all benefit. The model can answer questions in natural language without forcing employees to search through folders.
Customer support is another strong fit. Pair RAG with your help center articles and the model can draft responses that cite the relevant article. A human agent reviews and sends. This cuts response time substantially without sacrificing accuracy.
Legal and compliance document review works well when the documents are static and the questions are bounded. Contract analysis, regulatory lookup, and policy comparison are all within reach of a well-tuned pipeline.
Code documentation is a surprisingly good fit. Embed your codebase or its documentation and the model can answer questions about APIs, configuration options, and common error messages. Tools like Cursor and Continue use this pattern under the hood.
Personal note-taking and research are the use case most people overlook. If you have a few hundred PDFs of papers, reports, or book excerpts, RAG turns them into a searchable conversation partner.
Where It Fails
RAG is not magic. There are classes of problems where it struggles and you should know about them before committing to a build.
Multi-hop reasoning is the biggest weakness. If answering a question requires combining information from three different documents in a specific order, retrieval often returns the wrong chunks and the model cannot reason across them. You can mitigate this with agentic patterns that do multiple retrieval passes, but it adds complexity.
Very long documents are a problem. If your corpus is millions of pages, retrieval quality degrades and latency increases. For very large corpora you need a more sophisticated indexing strategy such as hierarchical retrieval or document summarization.
Real-time data does not work. RAG only knows what is in the index. If your information changes every minute, you need to re-embed constantly, which is expensive. For real-time use cases, a tool with native web search is a better fit.
Ambiguous queries fail silently. If the user asks a vague question, retrieval returns mediocre chunks and the model produces a mediocre answer. You need query rewriting or clarification steps to handle this, which most basic tutorials skip.
Tables, charts, and structured data are poorly served by standard text embeddings. If your documents are mostly spreadsheets, RAG is the wrong tool. Use a SQL-augmented agent instead.
Evaluation is hard. Unlike traditional software, there is no obvious pass or fail. You need a test set of questions with known good answers, and you need to measure retrieval recall and answer faithfulness separately. Most teams skip this and ship a system that looks good in demos but fails in production.
Practical Workflow Pattern
Here is how a working team typically slots RAG into a real workflow.
The first phase is ingestion. Documents arrive in a watched folder or are pushed through an API. A script loads them, chunks them, embeds them, and writes them to the vector store. This runs on a schedule, typically nightly or whenever a new document is added.
The second phase is query handling. A user asks a question through a chat interface, a Slack bot, or an API endpoint. The system retrieves the top chunks, builds a prompt, calls the language model, and returns the answer with citations to the source documents.
The third phase is evaluation. You maintain a test set of 50 to 100 questions with verified answers. You run your pipeline against this set weekly and track retrieval recall, answer faithfulness, and latency. When quality drops, you know to re-embed with a better model or adjust chunk sizes.
The fourth phase is monitoring. In production you log every query, the retrieved chunks, and the generated answer. You sample these weekly to spot failure patterns. Common patterns include retrieval misses, hallucinations, and user questions the system cannot answer.
The fifth phase is iteration. Based on what you find in monitoring, you update your chunking strategy, swap embedding models, add reranking, or expand your test set. RAG is not a one-time build. It is a system that improves with feedback.
A practical tip. Start with a single document type and a narrow question domain. Get that working well before expanding. Teams that try to build a universal knowledge assistant on day one usually ship something that works poorly across the board instead of something that works well in one area.
Another tip. Always show the user the source chunks. This builds trust and lets you debug retrieval failures quickly. A RAG system without citations is a black box and users will not trust it.
To see how tools like this fit into a complete AI operating layer for your business, book a 60-min Omni Audit — https://calendly.com/sam-mckay/discovery-call