Match Embedding Dimensions to Your Domain, Not Defaults¶

Vector database costs scale with embedding dimensions. Most systems use 768-3072 dimensions. You may only need 256-768.

When you choose an embedding model, you skip the most important decision. You glance at MTEB leaderboards, check the costs, then move straight into chunking strategies and RAG architecture. Most practitioners treat the embedding model as just a configuration variable. But when your embedding model cannot match relevant chunks in the vector search step, everything downstream suffers.

I was upgrading embedding models after Google announced text-embedding-004's sunsetting, though since reversed. The newer gemini-embedding-001 uses Matryoshka Representation Learning (MRL), which lets you truncate embeddings to smaller dimensions by taking the first N elements. The model trains so the most important information lives in those earlier dimensions.

While reading that documentation for a RAG based system I was building around UK farming subsidies, I found inconsistent guidance about dimensionality. Then I spotted a Google engineer's notebook that just truncated output with output[:768] and no explanation. That one line of unexplained code of arbitary truncation made me wonder what else we take for granted about embedding dimensions. Many practitioners do not realise you can truncate MRL embeddings at all.

Matryoshka Representation Learning embeddings let you truncate to smaller dimensions without retraining. Vector databases and embedding providers recommend 768 to 3072 dimensions as standard. OpenAI's text-embedding-3-large uses 3072 dimensions. Google recommends at least 768 dimensions for production systems. I decided to test this systematically. I ran evaluations with Google's suggested 768, 1536 and 3072 dimensions on my RAG system. The 768 dimension embeddings performed as well as the 3072 dimension ones consistently across all tests. In theory you could use the first 512 dimensions, 256, 128, or even 64.

I tested this across five different domains using BEIR benchmark datasets with proper ground truth labels. BEIR documents are passage-sized (typically 200-400 words), similar in length to many RAG chunk sizes. I measured Recall@1, Recall@5, Recall@10 and MRR across over 1,800 queries spanning simple factual content to complex medical literature.

What I found¶

For RAG systems, Recall@5 and Recall@10 are most relevant since you typically retrieve multiple candidates for the LLM. Here's how different domains plateau at different dimensions:

Domain	Optimal Dimensions	Recall@1	Recall@5	Recall@10
Simple factual (FEVER subset)	256-512	99%	100%	100%
Scientific abstracts (SciFact)	768	82%	96%	98%
Financial discussions (FiQA)	768	55%	76%	84%
Climate science (Climate-FEVER subset)	768-1536	34%	66-71%	82-84%
Medical literature (NFCorpus)	768+	56%	75-77%	81%

When you use 3072 dimensions, you get no benefit over 768 dimensions for four out of five domains I tested. Most use cases let you reduce dimensions by 4x from model and service provider recommendations with no quality loss.

Simple factual content like Wikipedia claims plateaus at 256-512 dimensions. Complex content like scientific papers and financial discussions plateaus at 768 dimensions. Medical literature was the only domain where dimensions beyond 768 provided meaningful improvement, and only for recall@10 metrics.

Recall@10 - Performance plateaus at different dimensions by domain complexity

The chart shows the plateau effect clearly. FEVER shoots up and plateaus early at 256d. SciFact and FiQA plateau at 768d. Only NFCorpus and Climate-FEVER continue improving modestly beyond 768d.

Model and service providers make their recommendations assuming worst-case scenarios. Your data is probably not worst-case.

Why do leaderboard models use full dimensions?¶

Almost every RAG tutorial starts with the MTEB leaderboard. You pick a highly ranked model, use its default dimensions and deploy. This approach works backwards.

MTEB tests across 56 diverse tasks spanning 8 categories: classification, clustering, retrieval, semantic similarity and more. The benchmark scores models on their ability to generalise across all these different domains simultaneously. High-dimensional embeddings win because they need to encode information useful for many different tasks at once.

Your production system works differently. You have a specific corpus and specific queries. You are not building a general-purpose model that must handle scientific papers, news articles, tweets and financial documents equally well. You likely only care about one domain.

When you use MTEB to choose dimensions for your RAG system, you are using a Swiss Army knife benchmark to decide which specialised tool to buy. The rankings tell you which model is most versatile, not which configuration works best for your specific task.

Weaviate's research on OpenAI embeddings found that 512 dimensions captures where "the structure of the vector space is already well defined". However, their testing only examined 512 dimensions and above on general content. My testing extends this by examining dimensions from 64 dimensions upwards across diverse domains, revealing that optimal dimensionality varies significantly by domain complexity rather than being a universal threshold.

Domain complexity determines optimal dimensions for RAG systems¶

Different domains require different embedding dimensions based on their semantic complexity:

Domain	Content Type	Corpus Size	Queries	Optimal Dims	Recall@1	Recall@10
FEVER	Wikipedia fact verification	3,281 docs (subset)	300	256-512d	99%	100%
SciFact	Scientific abstracts	5,183 docs	300	768d	82%	98%
FiQA	Financial Q&A	57,638 docs	648	768d	55%	84%
Climate-FEVER	Climate science claims	3,435 docs (subset)	300	768-1536d	34%	82-84%
NFCorpus	Medical literature	3,633 docs	323	768-1024d	56%	81%

Simple factual claims (FEVER) plateau at 256-512 dimensions. Technical content with domain-specific terminology (SciFact, FiQA) plateaus at 768 dimensions. Medical literature (NFCorpus) was the only domain where dimensions beyond 768 provided meaningful improvement, and only for Recall@10. This was also the only domain where retrieval depth (top-1 vs top-10) significantly affected optimal dimensions.

Climate-FEVER shows the lowest absolute recall across all metrics, reflecting the difficulty of climate science claim verification. Performance plateaus at 768-1536 dimensions depending on the metric

Corpus size shifted optimal dimensions higher¶

I tested the same domains at different corpus sizes to see how scale affects optimal dimensionality:

Domain	Small Corpus	Optimal Dims	Recall@1	Large Corpus	Optimal Dims	Recall@1
SciFact	348 docs	256d	98%	5,183 docs	768d	82%
Climate-FEVER	708 docs	256d	39%	3,435 docs	768-1536d	34%

Corpus size and domain matter. An increase in scientific corpus size by 15x shifted optimal dimensions from 256d to 768d. Larger corpora in more complex domains need more dimensions to discriminate between similar documents. The exact threshold depends on both corpus size and domain complexity.

Having worked with production RAG implemenations, lower document and chunk quantities are not that unusual. If you have fewer than 1,000 documents/chunks, 256 dimensions may suffice. Between 1,000-5,000 documents/chunks, test 256-512 dimensions. Above 5,000 documents/chunks, expect to need 512-768 dimensions depending on domain complexity.

Implications for your system¶

Your storage costs scale with dimensions. When you reduce from 768 to 256 dimensions, you reduce storage by 3x. For 1 million documents, this changes your monthly costs on managed vector database services. Your vector database provider's costs often scale with total vector size. As your corpus grows, the savings compound.

Memory requirements matter when you self-host systems. AWS OpenSearch loads vectors into RAM for fast search. When you reduce from 768 dimensions to 256 dimensions, you fit 3x more vectors in the same instance. This can allow you to consider smaller instance types.

Similarity computation scales linearly with dimensions. When you reduce dimensions from 768 to 256, you get 3x faster search. For high throughput systems that process thousands of queries per second, this means you need fewer compute resources or you could have improved latency.

How to test this¶

Start small and validate. Here is a concrete example using SciFact-style content:

Step 1: Establish your baseline

Embed 100-200 sample documents from your corpus at full dimensions, 768 dimensions or 3072 dimensions. Create 20-30 test queries with known relevant documents. Measure recall@5 with your current setup.

from google import genai
import numpy as np

def evaluate_recall_at_k(query_embs, corpus_embs, ground_truth, k=5):
    """Calculate recall@k: fraction of queries where any relevant doc appears in top-k"""
    hits = 0
    for i, q_emb in enumerate(query_embs):
        scores = [np.dot(q_emb, c_emb) / (np.linalg.norm(q_emb) * np.linalg.norm(c_emb)) 
                  for c_emb in corpus_embs]
        top_k_indices = np.argsort(scores)[-k:]
        relevant_docs = ground_truth[i]  # List of relevant doc indices for this query
        if any(idx in relevant_docs for idx in top_k_indices):
            hits += 1
    return hits / len(query_embs)

genai_client = genai.Client(api_key='YOUR_API_KEY')
batch_size = 100

def embed_batch(texts, dimensions=768):
    embeddings = []
    for i in range(0, len(texts), batch_size):
        batch = texts[i:i+batch_size]
        result = genai_client.models.embed_content(model='models/gemini-embedding-001', contents=batch)
        embeddings.extend([emb.values[:dimensions] for emb in result.embeddings])
    return embeddings

corpus_sample = load_documents(limit=200)
test_queries = load_test_queries()
ground_truth = load_relevance_labels()

corpus_texts = [doc['text'] for doc in corpus_sample]
full_embeddings = embed_batch(corpus_texts, dimensions=768)
query_embeddings = embed_batch(test_queries, dimensions=768)
baseline_recall = evaluate_recall_at_k(query_embeddings, full_embeddings, ground_truth, k=5)
print(f"Baseline recall@5 at 768d: {baseline_recall:.3f}")

Step 2: Test at 256 dimensions

Truncate your embeddings to 256 dimensions and measure again:

corpus_256 = [emb[:256] for emb in full_embeddings]
query_256 = [emb[:256] for emb in query_embeddings]
recall_256 = evaluate_recall_at_k(query_256, corpus_256, ground_truth, k=5)
print(f"Recall@5 at 256d: {recall_256:.3f}")
print(f"Performance retention: {recall_256/baseline_recall*100:.1f}%")

Step 3: Find your optimal point

If 256d performance meets your requirements (typically >95% of baseline), stop there. If not, test 512d:

corpus_512 = [emb[:512] for emb in full_embeddings]
query_512 = [emb[:512] for emb in query_embeddings]
recall_512 = evaluate_recall_at_k(query_512, corpus_512, ground_truth, k=5)
print(f"Recall@5 at 512d: {recall_512:.3f}")

For a SciFact-style corpus with 5,000 scientific documents, you would likely find that 768 dimensions provides optimal performance. For a FEVER-style corpus with 3,000 factual claims, you would find 256-512 dimensions sufficient.

Your domain complexity and corpus size determine optimal dimensions. Run evaluations with ground truth labels before you accept model and service provider defaults. Most domains plateau by 768 dimensions.

Methodology¶

I used Google's gemini-embedding-001 model which produces 3072 dimension embeddings with Matryoshka support. All tests used the same embeddings truncated to different dimension counts. This isolates the effect of dimensionality from model quality differences.

I tested five BEIR benchmark datasets. BEIR provides standardised retrieval benchmarks with ground truth labels. Each dataset includes documents, queries and relevance judgements.

The FEVER subset contains 3,281 Wikipedia passages with 300 queries testing fact verification. The SciFact subset contains 5,183 scientific abstracts with 300 queries. The FiQA set contains 57,638 financial discussion posts with 648 queries. The NFCorpus set contains 3,633 medical literature documents with 323 queries. The Climate-FEVER subset contains 3,435 documents with 300 queries testing climate science claims.

I measured recall@1, recall@5, recall@10 and mean reciprocal rank (MRR). Recall@k measures whether any relevant document appears in the top-k results. MRR measures the rank of the first relevant document. These metrics directly measure retrieval quality for RAG systems.

You can reproduce the tests using the public BEIR datasets and Google's embedding API.

Conclusion¶

The industry recommendation of 768-3072 dimensions assumes worst-case scenarios. Simple factual content plateaus at 512 dimensions. Complex technical content plateaus at 768 dimensions. Medical literature is the only domain where dimensions beyond 768 provided meaningful improvement.

Corpus size matters. A 15x increase in corpus size shifted optimal dimensions from 256-768. If you have fewer than 5,000 documents/chunks, you probably need fewer dimensions than model and service provider recommendations suggest.

If you are using defaults, the embeddings you are using today are probably 4-6x larger than they need to be. Test with your own data before you accept defaults.

In this follow-up post Selection and Ensemble Strategies for Embedding Retrieval, I experiment with extended techniques for operating below optimal dimensions and achieving 2x speed improvements through cascaded retrieval.

This research was carried out by me using SolveIt. SolveIt is my AI collaborator and is like having a shared brain. The SolveIt methodology and approach I described in The Human is the Agent made these experiments and research feasible.

Chris Thomas is an AI consultant and solutions developer with over 25 years of programming experience. He has been learning about and using AI since 1997. You can find more of his work at christhomas.co.uk.

Connect with me on LinkedIn

Follow me on X (Twitter)