AI Field Guide

Search & Retrieval

Embeddings

Vectors that represent text meaning for similarity.

Embeddings illustration

A developer searches their company’s docs for “how do we lock accounts after failed logins” and gets nothing. The right article exists, but it’s titled “Brute force protection for the auth service.” It’s the same idea worded differently but the search engine can’t tell.

This is the gap embeddings close. They turn text into a list of numbers (a vector) that captures something about its meaning, so two pieces of text can be compared by similarity instead of exact wording. The technical name for this comparison is semantic similarity.

Embeddings quietly power many of the search, recommendation, and retrieval features we use every day. They are also the workhorse behind retrieval-augmented generation (RAG), where an app pulls in relevant context before asking a language model to answer.

In this article, we’ll cover what embeddings are, how they capture meaning, and where they show up in real systems.


What is an embedding in plain terms?

An embedding is a vector, which is just a list of floating-point numbers. We can think of it as a coordinate system for language. A GPS coordinate points to a place on a map, and an embedding points to a piece of text in a high-dimensional space of meaning.

Take three sentences:

  • “The user needs to authenticate before accessing the system.”
  • “Login security requires multi-factor verification.”
  • “The weather forecast predicts rain tomorrow.”

The first two land near each other in that space because the model has learned that authentication, login, security, and verification belong to the same conceptual neighborhood. The third sentence sits far away, somewhere closer to weather and forecasts. Similar pieces of text produce vectors that cluster together, while unrelated pieces of text drift apart.

Embeddings visualize text as points in a semantic space.

If we send a sentence to an embedding model through something like OpenAI’s embeddings API endpoint, the response can look something like this:

{
  "embedding": [
    -0.006929283495992422,
    -0.005336422007530928,
    -0.00004547132266452536,
    ...,
    -0.024047505110502243
  ]
}

The real vector is much longer. OpenAI’s text-embedding-3-small returns 1536 numbers by default, and text-embedding-3-large returns 3072. The individual numbers do not mean much on their own. The shape of the vector as a whole is what matters.

Where embeddings help

The wording-doesn’t-match problem comes up constantly in product support, internal tools, code search, recommendations, and analytics. Embeddings gives these systems a way to surface the right content even when the same exact vocabulary isn’t used.

ScenarioQueryUseful match
Support search“Customer can’t log in after reset”“Expired password token troubleshooting”
Internal assistant“How do we rotate production keys?”“Incident runbook for credential rotation”
Code search“Where do we validate payment webhooks?”verifyStripeSignature()
Feedback analysis“Users are confused by pricing”Similar survey comments and support tickets

Generating an embedding

Most teams don’t train embedding models from scratch. Instead, they call an embedding API and store the returned vector alongside the original text and any metadata they’ll want later.

Here’s a small OpenAI example using the text-embedding-3-small model:

import OpenAI from "openai";

const openai = new OpenAI();

const response = await openai.embeddings.create({
  model: "text-embedding-3-small",
  input: "Our service uses JWT tokens for user authentication.",
  encoding_format: "float",
});

const vector = response.data[0].embedding;

console.log(vector.slice(0, 4));

As mentioned earlier, the API returns a long array of numbers but in the example below, we’re only printing a preview.

Interactive example

52/500 characters

Output | Indices 0 - 3

[-0.040619, -0.034546, -0.011627, -0.019684]

model: text-embedding-3-small
dimensions: 1536

Comparing meaning with similarity

Once we have embeddings, we need a way to compare them. Most retrieval systems use cosine similarity, which measures how closely two vectors point in the same direction.

function cosineSimilarity(a, b) {
  const dotProduct = a.reduce((sum, value, index) => {
    return sum + value * b[index];
  }, 0);

  const magnitudeA = Math.sqrt(a.reduce((sum, value) => sum + value ** 2, 0));
  const magnitudeB = Math.sqrt(b.reduce((sum, value) => sum + value ** 2, 0));

  return dotProduct / (magnitudeA * magnitudeB);
}

The above helper takes two vectors and returns a similarity score. Larger scores mean the two pieces of text are closer in meaning.

Imagine three short embeddings, kept tiny so the math is easy to follow:

const auth = [0.81, 0.12, 0.05, 0.04];
const login = [0.78, 0.18, 0.09, 0.06];
const weather = [0.04, 0.07, 0.92, 0.11];

cosineSimilarity(auth, login);   // ~0.99
cosineSimilarity(auth, weather); // ~0.12

Two vectors that point in nearly the same direction return a score close to 1. Two that point off in different directions return a score close to 0. This is the core intuition behind every embedding-powered search system.


A small internal docs corpus

Imagine a small internal docs corpus with three short articles: an authentication guide, a billing policy, and a rate-limit reference. We want a search experience that finds the right article even when the question phrases things differently.

The first step is to embed every document we want to be searchable:

async function embed(text) {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: text,
    encoding_format: "float",
  });

  return response.data[0].embedding;
}

const documents = [
  {
    id: "auth",
    text: "Our service uses JWT tokens for user authentication and session renewal.",
  },
  {
    id: "billing",
    text: "We retry failed card payments three times before pausing billing.",
  },
  {
    id: "limits",
    text: "Standard accounts can send 1000 API requests per minute.",
  },
];

const embeddedDocuments = await Promise.all(
  documents.map(async (document) => ({
    ...document,
    embedding: await embed(document.text),
  }))
);

Each document is now paired with a vector that represents its meaning. We’ll need the original text again later, so we keep both.

When a question comes in, we embed the question and rank the documents by similarity:

const question = "How does our service handle user authentication?";
const questionEmbedding = await embed(question);

const [match] = embeddedDocuments
  .map((document) => ({
    ...document,
    score: cosineSimilarity(questionEmbedding, document.embedding),
  }))
  .sort((a, b) => b.score - a.score);

console.log(match.id, match.score);

The authentication document should rank highest because its meaning is closest to the question, even though the question doesn’t repeat every word from the document.

We can then drop the retrieved text into the final prompt:

const finalPrompt = `
Context:
${match.text}

Question:
${question}

Answer:
`;

Interactive example

48/500 characters

Knowledge base

AU

Authentication

Our service uses JWT tokens for user authentication and session renewal.

BI

Billing retries

We retry failed card payments three times before pausing billing.

RL

Rate limits

Standard accounts can send 1000 API requests per minute.

Retrieval results

Ask a question to retrieve the closest document.

That last step, where we pull retrieved text into the prompt, is the core of retrieval-augmented generation (RAG). Embeddings are the part that makes the retrieval step work!


Scaling up and going multimodal

Storing embeddings at scale

The in-memory example above works for a handful of documents. It falls over once we need to search thousands or millions of vectors.

A vector database is built for this. It stores embeddings alongside the original text and metadata, and it can find the nearest neighbors of a query vector quickly. Popular options include dedicated services like Pinecone, Weaviate, and Chroma, as well as vector features bolted onto existing databases such as pgvector for Postgres.

To stay fast at that scale, most of these use approximate nearest neighbor algorithms that trade a small amount of precision for much quicker retrieval.

Multimodal embeddings

Embeddings don’t have to stop at text. Some models map images, video, audio, and documents into the same kind of vector space. Gemini’s gemini-embedding-2, for example, supports multimodal embeddings and places text, images, video, audio, and PDFs into one unified space.

This opens up new patterns, like searching product screenshots with a text query, finding video clips that match a description, or organizing PDFs, images, and notes by topic.

One thing to remember is that embedding spaces are usually model-specific. Vectors generated by one model generally shouldn’t be compared directly with vectors from another. If we switch embedding models, we typically need to re-embed our content so everything lives in the same vector space.


Pitfalls to avoid

Embeddings are powerful, but a few specific failure modes show up over and over in real systems.

  1. Mixing vectors from different models breaks comparisons. Embedding spaces are not interchangeable, so we re-embed our content whenever we change models.
  2. Chunks that are too broad or too narrow hurt retrieval. Broad chunks add noise, and narrow chunks lose the surrounding context the snippet needs.
  3. Stale embeddings drift away from the source. When the underlying content changes, the vectors do not change with it unless we re-embed.
  4. General embedding models can struggle with specialized text. Domain language, code, and acronyms often need a model that has seen similar material during training.
  5. High similarity is not the same as a correct answer. Embeddings surface candidates, not truth, and the rest of the system still has to verify or rerank the results.

When embeddings are the right tool

Embeddings are not a universal solution. They earn their place when the question is fuzzy or semantic, and they get in the way when the task has a clean exact-match answer.

Use embeddings when:

  • We need semantic search over content where users phrase questions differently than the source text.
  • We want fuzzy matching across phrasing, synonyms, or natural-language descriptions.
  • We’re building recommendations or “more like this” features.
  • We need to cluster or deduplicate similar pieces of text.
  • We want to ground a language model’s answers in our own content (the retrieval step in RAG).

Reach for something else when:

  • The task is an exact identifier lookup (an order ID, a SKU, a username).
  • We need structured filtering on known fields like date ranges, status, or tags.
  • We’re ranking by hard rules such as most recent, highest priced, or alphabetical.
  • A regular expression or a SQL query already solves the problem cleanly.

Key takeaways

  1. An embedding is a vector that captures the meaning of a piece of text. It lives in a high-dimensional space where similar pieces of text sit close together and unrelated pieces drift apart.
  2. Cosine similarity is the comparison primitive once we have the vectors. A score near 1 means the two pieces of text are close in meaning, and a score near 0 means they are not.
  3. Retrieval is the most common application, including the retrieval step inside RAG. We embed our content, embed the user’s question, and pull back the closest matches.
  4. Embedding spaces are model-specific and cannot be mixed. Switching embedding models means re-embedding everything so the vectors live in the same space.
  5. Embeddings help us find the right source, not the right answer. They surface candidates that the rest of the system still has to read, verify, and use carefully.