Codebase Vector Search Whitepaper Notes

This page captures the current state of MemorySmith's codebase vector-search system, the refinements already made, the benchmark evidence gathered so far, and the remaining research questions. Treat it as a working whitepaper outline grounded in current code and measured behavior rather than a marketing summary.

Executive Summary

MemorySmith now has a real codebase vector-search pipeline with:

The main current conclusion is uncomfortable but useful: on the active MemorySmith repo and CUDA host, the provider-only batch benchmark is faster than scalar inference, yet the full end-to-end code-search rebuild still gets slower as batch size increases. The correct default today is scalar document embedding, not because batching is theoretically bad, but because the current workload and pipeline shape do not convert the micro-benchmark win into a rebuild win.

System Overview

The current code-search stack is built around these responsibilities:

Layer Responsibility
CodeSearchService Scans the configured repo roots, chunks documents, reuses unchanged documents, embeds prepared chunks, writes SQLite rows, and serves code-search results plus build status.
SemanticEmbeddingSearchService Hosts ONNX embedding infrastructure, tokenizer/pooling behavior, provider selection, and provider-health reporting.
OnnxTextEmbeddingProvider Performs ONNX session setup, inference, pooling, and optional batch embedding.
SQLite index Persists chunk rows and reusable metadata under Data/Graph/code-search/code-search.db.
MCP/chat tools Expose memorysmith_code_search and memorysmith_code_search_status so indexing/search is visible to agents and operators.

The design target is local-first reliability, not cloud-scale novelty. The system is supposed to be inspectable, recoverable, and practical on a single workstation or a small self-hosted service.

Refinements Added So Far

1. Minimal End-To-End Code Search Slice

The original TSK-0195 implementation established a durable baseline:

That change made codebase search real, but not yet user-friendly under longer rebuilds or hardware-provider experiments.

2. Import Reliability And Progress Visibility

The next hardening sweep focused on trust and diagnosability:

This was the first point where the import stopped feeling like a black box.

3. Hardware Acceleration And CPU Fallback

The semantic provider now supports runtime selection of:

The runtime selection is intentionally decoupled from the published ONNX package flavor. That lets MemorySmith expose optional hardware acceleration without making the entire semantic stack brittle. When a requested hardware provider fails and CpuFallbackEnabled=true, the app keeps semantic embeddings available through CPU ONNX instead of silently disabling the feature.

4. Startup Prewarm

PrewarmOnStartupEnabled now defaults to true. The goal is simple: do not make the first real semantic or code-search request pay the full lazy ONNX initialization cost when the app can warm the provider in the background after startup.

This improves operator experience but does not solve steady-state rebuild throughput by itself.

5. Batch Document Embedding Path

An explicit batch document-embedding path was added so code-search indexing could reuse a batched provider when available. That work was valid and necessary to test, but the later live benchmark sweep showed that keeping the path available does not imply it should be the default on the current workload.

6. Fail Closed On Embedding Errors

The indexing path now treats document embedding failures as actual per-file indexing failures.

This matters because the earlier behavior could silently accept a failed TryEmbed(...) call, keep going with an empty vector, and write a document into the SQLite index without a usable embedding. That kind of partial success is worse than a visible failure because it poisons the corpus quietly and makes relevance problems look like ranking issues instead of ingestion defects.

The current behavior is stricter and more trustworthy:

Current Measured Evidence

Live CUDA Rebuild Sweep

Cold forced rebuilds were measured against the live MemorySmith repo and the same host/corpus across batch sizes:

Batch Size Elapsed Rebuild Time Embedding Calls
1 116,992 ms 2,417
2 125,565 ms 1,273
4 140,630 ms 705
8 147,988 ms 433
16 159,699 ms 300

The curve is monotonic in the wrong direction. Fewer embedding calls did not produce faster rebuilds.

Artifacts for that sweep were captured under artifacts/browser-validation/ as:

Provider-Only CUDA Micro-Benchmark

The provider itself still showed a batching win:

Mode Median Latency
Scalar 447 ms
Batch 307 ms

That is exactly why end-to-end measurement matters. The provider benchmark alone would have supported the wrong default.

Query Probe Result

Repeated and distinct memorysmith_code_search query probes were both about 18-20 ms. That means the current full chunk reload path on uncached queries is not the dominant user-visible bottleneck on the active workload.

Interpretation

The most likely current explanation is that the pipeline overhead outside raw ONNX inference is dominating the apparent batching win. Plausible contributors include:

The system is already telling us something valuable: a faster inference primitive does not guarantee a faster indexing product.

What Is Probably Not The Main Bottleneck Right Now

Current evidence argues against spending immediate engineering time on these paths first:

Those may matter later, but they are not the highest-confidence next moves on the current evidence.

High-Confidence Decisions Already Made

Keep Scalar As The Default Document Embedding Mode

The default CodeSearchOptions.EmbeddingBatchSize was reset to 1 after the live CUDA sweep. This is benchmark-backed and should remain the conservative default until a new end-to-end benchmark proves otherwise.

Keep The Batch Path Available

The batch path should stay in the codebase even though it is not the default. It remains useful for:

Keep CPU Fallback And Startup Prewarm

These are reliability features, not optional polish. Hardware acceleration support is only acceptable when failure modes remain diagnosable and the system still works on CPU.

Research Agenda

The next meaningful research questions are not generic vector-database questions. They are specific to the current MemorySmith stack.

1. Explain The Batch Regression

The highest-value technical question is why the provider-only batch benchmark wins while the live rebuild loses. The answer likely requires deeper instrumentation around:

2. Test Better Batch Shapes, Not Just Bigger Fixed Batches

Fixed 2/4/8/16 batching is a blunt instrument. The more defensible next experiments are:

3. Measure Cold, Warm, And Incremental Rebuilds Separately

MemorySmith now has enough observability to stop mixing these together. Future benchmarks should always separate:

4. Benchmark OpenVINO On An Intel Host

OpenVINO support is wired in, but the current evidence is CUDA-heavy because that is the available host. OpenVINO needs real measurements before any provider-specific recommendation is credible.

5. Decide Whether Finer-Grained Reuse Is Worth It

Document-level reuse already exists. Chunk-level reuse could reduce rebuild cost further, but only if the extra bookkeeping and hash management do not complicate the system more than they help.

Suggested Whitepaper Sections For Future Expansion

If this page grows into a fuller paper, expand it in this order:

  1. Measurement methodology and reproducibility.
  2. Corpus-shape analysis for the MemorySmith repo.
  3. Provider initialization versus steady-state throughput.
  4. Batch-shape and padding-efficiency analysis.
  5. Retrieval quality measurements, not just rebuild speed.
  6. Operator-trust design: status, errors, warnings, and fallback visibility.