Selection and Ensemble Strategies for Embedding Retrieval¶
In my previous post, I questioned that many RAG systems use embeddings 4-6x larger than necessary. Simple factual content plateaus at 256-512 dimensions. Complex technical content plateaus at 768 dimensions. Test with your own data and you will probably find you need far fewer dimensions than model and service providers recommendations suggest.
What if you cannot afford or consider even the optimal dimensions? This post covers three techniques I tested: domain-adaptive dimension selection, ensemble approaches and cascaded retrieval. These techniques help when you operate below optimal dimensions or need better performance within resource constraints.
Selecting dimensions based on domain variance¶
This idea came from a thought by Nathan Cooper, an R&D Engineer at Answer AI, in a SolveIt Discord conversation. Nathan mentioned that hearing the most important information lives in the earlier dimensions made him wonder if you could do a radix type sorting on Matryoshka embeddings that is not possible with other types. That thought turned out to be well worth investigating.
Matryoshka embeddings are trained to pack the most important information into early dimensions. This works well across diverse datasets, but for domain-specific corpora you can do better. You can select dimensions based on which ones vary most across your documents. I computed the variance of each dimension across 100 sample documents from each domain. Dimensions with high variance discriminate better between documents.
For the SciFact full corpus at 512 dimensions, domain-adaptive selection chose 337 of the first 512 dimensions and 175 dimensions from beyond position 512 (such as dimensions 602, 715, and 733).
For example, if dimensions 50, 120, and 240 from the first 512 have low variance, they might be replaced by dimensions 602, 715, and 733 which have higher variance. You end up with 512 dimensions total, but they are the most discriminative 512 dimensions for your specific corpus, not necessarily the first 512 from the MRL training.
How much performance improves varies by domain and dimension count in percentage points (pp).
| Dataset | Corpus Size | Dimensions | Original | Selected | Improvement (pp) |
|---|---|---|---|---|---|
| SciFact | 5,183 docs | 64d | 24% | 49% | +25 |
| Climate-FEVER | 3,435 docs | 128d | 13% | 23% | +10 |
| FiQA (subset) | 408 docs | 128d | 74% | 90% | +16 |
| FiQA (full) | 57,638 docs | 128d | 26% | 23% | -3 |
| NFCorpus | 3,633 docs | 128d | 25-38% | 25-38% | ~0 |
When I tested SciFact full corpus (5,183 docs) at 64 dimensions, performance improved from 24% recall@1 to 49% recall@1 with domain-adaptive selection. This nearly matches the original 128 dimension performance with half the dimensions.
Testing a Climate-FEVER subset (3,435 docs) at 128 dimensions, performance improved from 13% recall@1 to 23% recall@1. The improvement was 10 percentage points.
For the FiQA full corpus (57,638 docs) results at 128 dimensions, selection hurt performance. The original achieved 26% recall@1 while selected dropped to 23% recall@1. However, on a smaller 408 document subset, selected dimensions improved from 74% to 90% recall@1.
I tested domain-adaptive selection on FiQA with 100, 5,000, and all 57,638 documents for variance calculation. Performance declined with every sample size tested. At 128 dimensions, even full-corpus variance (26.2% → 21.5%) underperformed the standard MRL ordering.
This demonstrates a fundamental limitation: for large diverse corpora, domain-adaptive selection sacrifices general semantic structure for corpus-specific discrimination. At sub-optimal dimensions, you need the general structure more. Use this technique only for moderate-sized corpora (1,000-10,000 documents) with focused domains.
With the NFCorpus full corpus (3,633 docs) at 128 dimensions, I found minimal difference between original and selected dimensions, both achieving around 25-38% recall@1 depending on the specific test configuration.
The tests indicate that domain-adaptive selection helps most when you operate well below optimal dimensions on moderate-sized corpora (1,000-10,000 documents) with focused domains. For simple domains like FEVER, selection provides no benefit. For very large diverse corpora like FiQA (57,638 docs), selection hurts performance because it sacrifices general semantic structure. Corpus size and diversity both affect whether selection helps.
At optimal dimensions, domain-adaptive selection converges with the original ordering. Both approaches achieve the same performance because you are using all the high-variance dimensions anyway. At 768 dimensions, all dimensions from the full set are included in both approaches, just in different order, which makes no difference to cosine similarity.
Ensemble approaches for resource-constrained scenarios¶
You can store both the original Matryoshka ordering and the domain-adaptive ordering at the same dimension count, search with both and average the similarity scores. This ensemble approach can match the performance of much higher-dimensional embeddings.
| Dataset | Approach | Dimensions | Recall@1 | Storage | vs Double Dimensions |
|---|---|---|---|---|---|
| FEVER | Ensemble | 128d | 98.7% | 2x | Beats 256d (97.3%) |
| FEVER | Single | 512d | 99% | 1x | Optimal |
| NFCorpus | Ensemble | 64d | 29.1% | 2x | Beats 128d (25.4%) |
| NFCorpus | Single | 128d | 25.4% | 1x | Worse than ensemble |
| SciFact | Ensemble | 64d | 57.7% | 2x | Loses to 128d (60.7%) |
| SciFact | Single | 128d | 60.7% | 1x | Better |
FEVER at 128-dimension ensemble achieves 98.7% recall@1, essentially matching 512-dimension single performance (99%). You store two 128-dimension vectors (original + domain-adaptive), totaling 256 dimensions of storage. Half the storage of a single 512-dimension vector while achieving similar performance.
This approach doubles storage and compute per query. However, for smaller individual embeddings, (e.g. 2x 64 dimensions instead of 1x 256 dimensions) may fit better in specific architectures, though total RAM usage remains similar.
Ensemble benefits disappear at 256 dimensions and above. This technique only helps at very low dimensions (64-128 dimensions) when optimal dimensions are much higher (512+ dimensions).
For most production systems with adequate resources, use optimal dimensions with single ordering. For edge deployments, mobile applications or systems with strict memory limits where you cannot afford optimal dimensions, ensemble approaches may provide better performance within resource constraints on some datasets.
Cascaded retrieval achieves optimal performance with 2x speed improvement¶
The Matryoshka paper mentions using low dimensions for initial retrieval and higher dimensions for re-ranking. I tested this systematically across all five datasets. Cascaded retrieval searches all documents at low dimensions, then re-ranks only top candidates at higher dimensions.
Climate-FEVER (708 documents) demonstrates the technique:
| Configuration | Stage 1 | Stage 2 | Docs Reranked | Recall@1 | Speed vs Full |
|---|---|---|---|---|---|
| Full 256d | - | 256d | 708 (100%) | 38.7% | 1x (baseline) |
| Cascade | 128d | 256d (top-20) | 20 (3%) | 38.7% | ~2x faster |
| Cascade | 64d | 256d (top-50) | 50 (7%) | 38.7% | ~1.7x faster |
| Single 128d | 128d | - | 0 | 29.3% | 2x faster |
The 128-256 dimension cascade achieves optimal 38.7% recall@1 whilst computing 256 dimension similarities for just 3% of the corpus. This is approximately 2x faster compared to full 256 dimension search.
I tested this pattern across all five datasets. Using lower dimensions for first-stage retrieval, then re-ranking top-50-100 candidates at optimal dimensions achieves zero quality loss with 2-3x compute reduction. The exact speedup depends on corpus size, dimension ratios and candidate set size.
The optimal configuration depends on your domain and corpus size. For Climate-FEVER's 708 documents, 128-256 dimensions with top-20 worked well. Larger corpora need 256-768 dimensions with top-50-100 candidates.
Beyond speed, cascaded retrieval lets you adapt compute cost to query complexity. Store both dimension sizes and use only the first stage for simple queries, both stages for complex queries. Vector databases charge per dimension, so a 128 dimension primary index costs 6x less than 768d. Re-ranking top-k candidates at higher dimensions adds minimal cost.
Implementation guide¶
Here is how to implement each technique with your own embeddings.
Domain-adaptive dimension selection:
Compute variance across your corpus and select the most discriminative dimensions for your domain.
import numpy as np
def select_dimensions_by_variance(embeddings, n_dims=256, sample_size=100):
sample = embeddings[:sample_size]
variances = [(i, np.var(sample[:, i])) for i in range(embeddings.shape[1])]
sorted_dims = sorted(variances, key=lambda x: x[1], reverse=True)
return [d[0] for d in sorted_dims[:n_dims]]
corpus_embs = np.array([embed(doc) for doc in corpus[:100]])
selected_dims = select_dimensions_by_variance(corpus_embs, n_dims=256)
def search_with_selected_dims(query_emb, corpus_embs, selected_dims, k=5):
q = query_emb[selected_dims]
corpus_selected = corpus_embs[:, selected_dims]
scores = corpus_selected @ q / (np.linalg.norm(corpus_selected, axis=1) * np.linalg.norm(q))
return np.argsort(scores)[-k:][::-1]
Ensemble search:
Combine original and domain-adaptive orderings by averaging their similarity scores.
def ensemble_search(query_emb, corpus_embs, selected_dims, n_dims=256, k=5):
original_dims = list(range(n_dims))
q_orig = query_emb[original_dims]
corpus_orig = corpus_embs[:, original_dims]
scores_orig = corpus_orig @ q_orig / (np.linalg.norm(corpus_orig, axis=1) * np.linalg.norm(q_orig))
q_sel = query_emb[selected_dims]
corpus_sel = corpus_embs[:, selected_dims]
scores_sel = corpus_sel @ q_sel / (np.linalg.norm(corpus_sel, axis=1) * np.linalg.norm(q_sel))
combined = (scores_orig + scores_sel) / 2
return np.argsort(combined)[-k:][::-1]
Cascaded retrieval:
Search at low dimensions first, then re-rank top candidates at higher dimensions.
def cascaded_search(query_emb, corpus_embs, stage1_dims=128, stage2_dims=256, stage1_k=20, final_k=5):
stage1_k = min(stage1_k, len(corpus_embs))
q1 = query_emb[:stage1_dims]
corpus1 = corpus_embs[:, :stage1_dims]
scores1 = corpus1 @ q1 / (np.linalg.norm(corpus1, axis=1) * np.linalg.norm(q1))
candidates = np.argsort(scores1)[-stage1_k:]
q2 = query_emb[:stage2_dims]
corpus2 = corpus_embs[candidates, :stage2_dims]
scores2 = corpus2 @ q2 / (np.linalg.norm(corpus2, axis=1) * np.linalg.norm(q2))
final = candidates[np.argsort(scores2)[-final_k:][::-1]]
return final
Which technique should I use¶
| Situation | Best Technique | Expected Benefit |
|---|---|---|
| Need faster search | Cascaded retrieval | 2x speedup, no quality loss |
| Cannot afford optimal dims (within 2x, e.g. 256d vs 512d) | Domain-adaptive selection | 5-25pp improvement |
| Cannot afford optimal dims (>2x below, e.g. 128d vs 512d) | Ensemble | Match higher dims, 2x cost |
| Have optimal dimensions, want more speed | Cascaded retrieval | 2x speedup |
A decision tree for choosing techniques based on my results:

Methodology¶
I used Google's gemini-embedding-001 model with 3072 dimension embeddings truncated to different sizes. I tested on five BEIR datasets spanning different domains and corpus sizes:
- FEVER (3,281 docs, 300 queries, factual claims)
- Climate-FEVER (3,435 docs, 300 queries, climate science)
- SciFact (5,183 docs, 300 queries, scientific abstracts)
- FiQA (57,638 docs, 648 queries, financial Q&A)
- NFCorpus (3,633 docs, 323 queries, medical literature)
BEIR provides standardised retrieval benchmarks with ground truth labels. I measured recall@1, recall@5, recall@10 and MRR across all datasets.
For domain-adaptive dimension selection, I computed variance for each dimension across 100 sample documents from each corpus. I selected the top N dimensions by variance, regardless of their position in the original embedding. This creates a domain-specific dimension ordering optimised for discrimination within that corpus.
For ensemble approaches, I stored both original and domain-adaptive orderings at the same dimension count. At query time I computed cosine similarity with both orderings and averaged the similarity scores before ranking. This doubles storage and compute but uses the same RAM when you load vectors for search.
For cascaded retrieval, I tested systematically across all five datasets. I searched the full corpus at low dimensions, selected the top-k candidates and re-ranked them at higher dimensions. I tested various dimension combinations (64d-256d, 128d-256d, 256d-768d) and top-k thresholds (top-10, top-20, top-50, top-100) to find optimal configurations for each dataset. The pattern held consistently: using half the optimal dimensions for first-stage retrieval with appropriate candidate set sizes achieved zero quality loss across all corpora tested.
Across all five datasets, I tested close to 2,000 queries with ground truth relevance labels.
You can reproduce the tests using the public BEIR datasets and Google's embedding API.
Conclusion¶
When you cannot afford optimal dimensions, these advanced techniques can recover most of the quality. Domain-adaptive selection provides 5-25 percentage point gains at sub-optimal dimensions for moderate complexity domains. Ensemble matches single embeddings at 2x dimensions, or beats single embeddings at equal storage for very low dimensions. Cascaded retrieval provides 2x compute reduction with zero quality loss.
The key is understanding when each technique helps and when it hurts. Test on your specific domain and corpus size to find the right combination.
Combined with domain-appropriate optimal dimensions from my previous article, you now have a method for deciding your embedding dimension selection. Find your optimal dimensions, then use these techniques when constraints force you below it.
This research was carried out by me using SolveIt. SolveIt is my AI collaborator and is like having a shared brain. The SolveIt methodology and approach I described in The Human is the Agent made these experiments and research feasible.
Chris Thomas is an AI consultant and solutions developer with over 25 years of programming experience. He has been learning about and using AI since 1997. You can find more of his work at christhomas.co.uk.