Skip to main content

Embeddings and Similarity

Embeddings are how we convert text into numbers that computers can compare and reason about. This page covers the practical concepts you need for building retrieval systems.

Prerequisites for:


What Are Embeddings?

An embedding is a vector (list of numbers, typically 384 to 3072 dimensions) that represents the meaning of text. Similar texts have similar vectors.

Why Embeddings Matter

Traditional keyword search only finds exact matches. Vector search finds semantically similar content.

Embedding Examples


How Embeddings Work

The Embedding Process

Common Embedding Models

ModelDimensionsCostBest For
OpenAI text-embedding-3-small1536LowGeneral purpose, cost-sensitive
OpenAI text-embedding-3-large3072MediumHigher accuracy needs
OpenAI text-embedding-ada-0021536LowestLegacy compatibility
Cohere embed-english-v3.01024MediumEnglish text
Cohere embed-multilingual-v3.01024MediumMulti-language
sentence-transformers all-MiniLM-L6-v2384Free (open-source)Local/privacy-sensitive
Google text-embedding-004768MediumVertex AI users

Embedding Generation in Code

from agentflow.core.embedding import EmbeddingModel

# Initialize embedding model
embedding_model = EmbeddingModel("text-embedding-3-small")

# Generate embeddings
query = "How do I reset my password?"
doc = "Click 'Forgot Password' to reset your credentials"

query_vector = embedding_model.embed(query)
doc_vector = embedding_model.embed(doc)

# Compute similarity
similarity = embedding_model.cosine_similarity(query_vector, doc_vector)
print(f"Similarity: {similarity:.2f}") # e.g., 0.89

Embedding Output

# What an embedding looks like
embedding = embedding_model.embed("Hello, world!")

print(f"Type: {type(embedding)}") # numpy array
print(f"Shape: {embedding.shape}") # (1536,)
print(f"Sample values: {embedding[:5]}") # First 5 values
# Output: [-0.002 0.004 -0.001 0.023 -0.008]

Semantic Similarity

Semantic similarity measures how related two pieces of text are in meaning, not just word overlap.

Why Semantic > Keyword

Similarity Examples

Text AText BSimilarityWhy
"How to reset password""Click forgot password link"0.89Same concept
"How to reset password""Weather forecast"0.12Different topics
"The cat sat on mat""A feline rested on carpet"0.85Paraphrase
"I love pizza""I enjoy Italian food"0.78Related concept
"Call the doctor""Phone medical professional"0.82Synonyms

Cosine Similarity Explained

Cosine similarity measures the angle between two vectors. Values range from -1 to 1:

The Math (Visualized)

Formula

similarity = (A · B) / (||A|| × ||B||)

Where:

  • A · B = dot product of vectors
  • ||A|| = magnitude of A (length)
  • ||B|| = magnitude of B (length)

Intuition


Vector Storage and Databases

Vector Databases

Vector databases are specialized databases optimized for storing and searching embeddings:

When to Use Each

OptionBest ForScalabilityCost
PineconeManaged cloud serviceHighPay-per-use
QdrantSelf-hosted or cloudHighFree (self-hosted)
pgvectorAlready using PostgreSQLMediumIncluded with PG
FAISSOffline/batch processingHighFree
In-memorySmall datasets, testingLowFree

AgentFlow Integration

from agentflow.storage.store import QdrantStore
from agentflow.core.embedding import OpenAIEmbedding

# Initialize
embedding_model = OpenAIEmbedding("text-embedding-3-small")
vector_store = QdrantStore(collection_name="knowledge_base")

# Add documents
documents = [
"How to reset your password",
"Click 'Forgot Password' on the login page",
"Contact support for account issues"
]

for i, doc in enumerate(documents):
vector = embedding_model.embed(doc)
vector_store.add(
id=f"doc_{i}",
vector=vector,
payload={"text": doc}
)

# Search
query_vector = embedding_model.embed("I forgot my login")
results = vector_store.search(vector=query_vector, top_k=2)

Nearest-Neighbor Retrieval

In vector search, we find the k nearest neighbors to a query vector:

Implementation

def retrieve_top_k(
query: str,
documents: list[str],
embedding_model,
top_k: int = 5
) -> list[dict]:
"""Retrieve top-k similar documents."""

# 1. Embed the query
query_vector = embedding_model.embed(query)

# 2. Score all documents
scored = []
for doc in documents:
doc_vector = embedding_model.embed(doc)
score = embedding_model.cosine_similarity(query_vector, doc_vector)
scored.append({"text": doc, "score": score})

# 3. Sort and return top-k
scored.sort(key=lambda x: x["score"], reverse=True)
return scored[:top_k]

# Usage
results = retrieve_top_k(
query="password reset help",
documents=[
"How to reset your password",
"Weather forecast for today",
"Forgot password flow",
"Configure email notifications"
],
embedding_model=embedding_model,
top_k=2
)

# Returns:
# [
# {"text": "Forgot password flow", "score": 0.92},
# {"text": "How to reset your password", "score": 0.89}
# ]

Limits of Embeddings

Embeddings are powerful but have limitations:

⚠️ Critical Warning: High Similarity ≠ Factual Correctness

Both documents have similar high similarity scores, but only one is correct. Retrieval finds related content; verification is still required.

Mitigation Strategies

LimitationMitigation
AmbiguityUse context in query, rerank with cross-encoder
Domain mismatchFine-tune embeddings or use domain-specific models
Stale representationsRefresh embeddings periodically
Accuracy ≠ relevanceAlways cite sources, validate facts

Cosine Similarity vs Cosine Distance

TermDefinitionRangeUse Case
Cosine SimilarityHow alike vectors are-1 to 1Higher = more similar
Cosine DistanceHow different vectors are0 to 2Lower = more similar
cosine_distance = 1 - cosine_similarity
similarity = 0.9
distance = 1 - similarity # 0.1

# Most vector DBs expose similarity, not distance
# But internally use distance for ranking

Hybrid Search: Combining Keyword + Vector

Vector search finds semantic matches but can miss exact keyword matches. Hybrid search combines both:

Implementation

def hybrid_search(
query: str,
vector_store,
embedding_model,
top_k: int = 10
) -> list[dict]:
# 1. Vector search
query_vector = embedding_model.embed(query)
vector_results = vector_store.search(query_vector, top_k * 2)

# 2. Keyword search (simplified with in-memory)
keyword_scores = bm25_search(query, documents)

# 3. Reciprocal Rank Fusion
combined_scores = {}

for rank, result in enumerate(vector_results):
doc_id = result["id"]
combined_scores[doc_id] = combined_scores.get(doc_id, 0) + (1 / (60 + rank))

for rank, (doc_id, score) in enumerate(keyword_scores.items()):
combined_scores[doc_id] = combined_scores.get(doc_id, 0) + (1 / (60 + rank))

# 4. Sort by combined score
ranked = sorted(combined_scores.items(), key=lambda x: x[1], reverse=True)

return ranked[:top_k]

Key Takeaways

  1. Embeddings are semantic vectors — Similar meaning produces similar vectors in high-dimensional space.

  2. Cosine similarity measures closeness — Values near 1.0 mean related content; near 0 means unrelated.

  3. Nearest-neighbor retrieval finds relevant documents — Sort by similarity, return top-k.

  4. Vector databases enable efficient search — Specialized for storing and searching millions of vectors.

  5. High similarity ≠ factual correctness — Retrieval finds related content; you must still verify accuracy.

  6. Hybrid search improves recall — Combining keyword and vector search catches more relevant results.


What You Learned

  • Embeddings convert text to semantic vectors in high-dimensional space
  • Cosine similarity measures the angle between vectors (higher = more similar)
  • Nearest-neighbor retrieval finds semantically similar content
  • Vector databases enable efficient storage and search
  • Retrieval quality depends on both embeddings and source data quality
  • High similarity doesn't guarantee factual correctness

Prerequisites Map

This page supports these lessons:

CourseLessonDependency
BeginnerLesson 4: Retrieval, grounding, and citationsFull page
AdvancedLesson 4: Advanced RAGFull page
AdvancedLesson 3: Context engineeringEmbedding costs

Next Step

Continue to Chunking and retrieval primitives to learn how to prepare documents for retrieval.

Or jump directly to a course: