What Embeddings Actually Is
An embedding is a list of numbers that represents a piece of content, usually text, in a way a computer can compare mathematically. The list might be 384 numbers long, or 1,536, or 3,072 depending on the model. Each number is a coordinate in a high-dimensional space. Semantically similar content lands near each other in that space. Dissimilar content lands far apart.
That’s the whole concept. Everything else is engineering around that core idea.
A model takes your input, runs it through a neural network, and spits out a vector. The vector isn’t random. It has been trained so that two sentences about banking policy end up with vectors that point in roughly the same direction, while a sentence about baking bread points elsewhere. The training objective is what makes this work, and different models have been trained with different objectives, which is why some embeddings are better for code, some for legal text, and some for general English.
You don’t need to train your own model. Several providers expose embedding endpoints you can call. You send text, you get back a vector. You can then do things with that vector that would be hard to do with raw text, like cluster it, search it, classify it, or feed it into a retrieval pipeline.
This is the foundation of modern semantic search, RAG systems, recommendation engines, and most of what gets called “AI memory.” If you’ve used a tool that finds documents by meaning rather than keyword, you’ve used embeddings. If you’ve chatted with a bot that seemed to remember a 200-page PDF, embeddings were involved.
The output looks like a JSON array of floats. You store it, index it, and query it. That’s the entire pipeline at its most basic.
Setup and Authentication
For this walkthrough, I’ll use OpenAI’s text-embedding-3-small as the concrete example because it’s the most widely used and the pricing is straightforward, around $0.02 per million tokens in the current version. The pattern is nearly identical for Voyage, Cohere, or open source models running on your own hardware.
First, install the SDK. If you’re using Python, the OpenAI client is one line.
Then set your API key. The simplest way is an environment variable called OPENAI_API_KEY. On macOS or Linux, you add it to your shell profile or export it for the session. On Windows, you set it through the system environment variables panel or use a .env file with a library like python-dotenv.
Never hardcode the key in a script you plan to commit. Use environment variables, a .env file in your .gitignore, or a secrets manager like Doppler or AWS Secrets Manager. This is the kind of mistake junior developers make once, and only once, because their key gets scraped and they get a $40,000 bill in two days.
If you’d rather run everything locally and skip the API call entirely, sentence-transformers from Hugging Face works well. You install it with pip, pick a model like all-MiniLM-L6-v2, and embed on your own GPU or CPU. Quality is lower than the hosted frontier models in the range you’d expect for free local tools, but the privacy and cost story is different.
For storage, you need a vector database once you have more than a few thousand embeddings. Options range from simple to industrial. For learning, sqlite with the sqlite-vss extension is fine for tens of thousands of vectors. For production, Pinecone, Weaviate, Qdrant, or pgvector inside Postgres are the common picks. pgvector is appealing if you already have a Postgres instance because there’s nothing new to operate.
The setup loop looks like this: get an API key, install one client library, pick a vector store, write a small function that takes text and returns a vector, and another function that takes a query and returns the nearest stored vectors. Everything else builds on those two functions.
Your First Working Example
Here’s a runnable Python script that embeds a handful of sentences, stores them, and runs a semantic search.
You start with a list of documents. Each one is just a string. You’ll embed each one and save the vector alongside the original text in memory or in your database.
Then you take a query, embed it the same way, and compute cosine similarity between the query vector and every document vector. The documents with the highest similarity score are your results.
Cosine similarity is a value between negative one and positive one. In practice, well-embedded text clusters tend to give scores in the 0.3 to 0.95 range for relevant matches. Don’t read too much into absolute values across different models, since they aren’t calibrated to each other. What matters is the relative ranking.
The script does this in under thirty lines. Embed every document once, store the vectors, and the search step is just a sorted list. No machine learning knowledge required to read the code, just basic Python and an understanding of lists.
Once that works, the next step is persistence. Replace the in-memory list with a database call. If you’re using pgvector, the query becomes a SELECT with an ORDER BY embedding <=> $1 LIMIT 5, where <=> is the cosine distance operator. Same logic, different syntax.
If you want to test whether your pipeline is actually doing semantic search rather than keyword matching, try searching for “how do planes fly” against a corpus that doesn’t contain that phrase but does contain a document about lift and Bernoulli’s principle. If your retrieval returns the Bernoulli document near the top, you have a working embedding search. If it returns nothing, you have a problem with how the vectors are being compared, stored, or indexed.
Key Settings That Matter
The first dial is the model itself. Larger embedding models generally produce better quality, but slower and more expensive responses. text-embedding-3-small is a sensible default. text-embedding-3-large is better for harder retrieval tasks where the difference in quality is worth the cost. ada-002 is the older model and rarely worth choosing now.
The second dial is dimensionality. Some models let you specify a smaller output dimension through a parameter, which trades a small amount of quality for big savings in storage and lookup speed. text-embedding-3 models support this through an “encoding_format” or “dimensions” parameter in the API call. If you’re indexing a million documents, halving the dimension halves your storage and roughly halves your query latency.
The third dial is chunking. Embeddings work on a chunk of text, and the right chunk size depends on what you’re searching. For general documents, 200 to 500 tokens per chunk is a common starting point. For code, smaller chunks around function boundaries work better. For long-form Q&A, chunking by paragraph or section often beats arbitrary token windows. The chunking strategy quietly determines whether your retrieval feels magical or broken.
The fourth dial is preprocessing. Lowercasing, stripping punctuation, removing stopwords. For classic keyword search these matter a lot. For modern embeddings they matter much less, because the model has already learned that “The cat sat” and “the cat sat” mean the same thing. Don’t over-engineer your text cleaning for an embedding pipeline.
The fifth dial is the distance metric. Cosine similarity is the most common and what most models are trained for. Euclidean distance works for normalized vectors and is faster in some databases. Dot product is the default in pgvector and is fine for most use cases. The differences are small in practice for normalized vectors, so pick whatever your database supports natively and move on.
The sixth dial, often ignored, is the prompt. Some embedding models accept an optional input string that biases the embedding toward a particular task, like “Represent this query for searching relevant documents” versus “Represent this passage for retrieval”. OpenAI’s models don’t expose this, but Cohere and Voyage do, and it can improve retrieval quality by a noticeable margin when used correctly.
Where Embeddings Shine
Semantic search is the obvious one. A user types a vague question and the system finds documents that don’t share any keywords with the question. This is what makes RAG possible. If you’ve built a chatbot that pulls from your company’s internal wiki, embeddings are the part doing the actual finding.
Duplicate detection is another strong fit. Two support tickets that say the same thing in different words will have similar embeddings. You can cluster your entire ticket history to find recurring themes, or flag duplicates before they reach an agent.
Recommendation systems work well on embeddings. If a user reads document A, and documents B and C have embeddings close to A, the user probably wants B and C. This is the same logic Netflix uses, just over text instead of viewing behavior.
Classification is a less obvious use case but works surprisingly well for small label sets. You embed a few examples per class, average them, and classify new items by nearest centroid. For sentiment analysis, topic tagging, or intent detection across five to twenty categories, this beats training a custom classifier for the cost difference between one afternoon and one month.
Anomaly detection is also worth mentioning. Anything that lands far from your cluster centroids is unusual. Useful for fraud signals, content moderation queues, or just finding weird data in a large corpus.
Where Embeddings Fail
Embeddings are not knowledge. A vector doesn’t tell you a fact, it tells you where text lives in a meaning space. If you ask your retrieval system “what was the GDP of France in 2019” and the answer is in a document you’ve embedded, great. If the answer isn’t anywhere in your corpus, the system will still return the closest document it has, and confidently give you the wrong answer. This is the failure mode of every RAG demo you’ve seen go viral and then quietly disappear.
Embeddings are not great at precise keyword matching. If a user searches for a part number like “ABC-1234-XYZ”, an embedding search may not find it reliably because the model treats that string as one token among many. Hybrid search, where you combine vector search with a traditional keyword search and re-rank the combined results, handles this much better.
Embeddings are not cheap at scale if you don’t plan. Indexing a million documents through a hosted API can run into hundreds of dollars for a one-time job, and re-indexing when you change models is the same cost again. Plan for it, or run open source models on your own hardware.
Embeddings drift subtly when you change the underlying model. If you index with model A and query with model B, the vectors live in different spaces and your retrieval degrades. Always re-index when you switch models, and don’t mix vectors from different models in the same index.
Embeddings can also embed bias. The training data had biases, and the model learned them. Search for “CEO” and you may get results dominated by men because the training corpus had more male CEOs in the surrounding context. This isn’t a flaw specific to embeddings, it’s inherited from the broader model ecosystem, but it’s worth knowing about for any customer-facing application.
A Practical Workflow Pattern
The pattern that actually works in production is a small, boring pipeline. You have a source of documents, a chunker, an embedding function, a vector store, and a query handler. Each step is replaceable. The chunker can be swapped without touching anything else. The embedding model can be swapped. The vector store can be swapped. That’s why people talk about RAG architecture more than they talk about RAG itself.
Start by indexing a small representative slice of your data, maybe 100 to 1,000 documents. Build the query handler. Run a handful of real user queries through it by hand and look at the results. Inspect the failures. Most retrieval problems are chunking problems in disguise, not model problems. Tweak the chunk size, the chunk overlap, and the metadata you attach to each chunk. Iterate.
Once the quality is acceptable, scale up the indexing. Add monitoring, because embedding pipelines fail silently. A document store changes format, the chunker breaks, and suddenly your retrieval is returning nothing useful. Log every query, the top results, the similarity scores, and occasionally sample the outputs by hand.
The last step, often skipped, is evaluation. Build a small set of test queries with known correct answers. Run them through your system on every change. Track the score over time. This is the difference between a system that feels good because you built it and a system that is actually good.
Embeddings are a foundational building block, not a finished product. The value comes from what you build around them, the data you feed them, and the discipline you apply to evaluating the results. Treat them as one component in a larger system and you’ll be in a strong position to build something that actually works in production.
To see how tools like this fit into a complete AI operating layer for your business, book a 60-min Omni Audit — https://calendly.com/sam-mckay/discovery-call