Code Search Benchmark Breakdown: CPU Fallback vs CUDA (2026-05-28)
This page captures the current measured code-search indexing benchmarks from local MemorySmith runs, with direct CPU-fallback versus CUDA comparisons and a reliability-oriented interpretation.
Scope And Caveats
- These numbers are from the local host and local repo snapshot used in each run.
- The runs include two nearby corpus sizes:
225files /2,388chunks (earlier CPU/CUDA comparison set)228files /2,417chunks (later CUDA batch-sweep set)- Because corpus size changed, compare within each measurement set first, then compare across sets with caution.
- Warm runs (
ForceRebuild=false) measure reuse behavior and request overhead, not full re-embedding throughput.
Source Artifacts
Primary comparison set:
artifacts/browser-validation/code-search-index-summary-cpu-fallback-cold.jsonartifacts/browser-validation/code-search-index-summary-cpu-fallback-warm.jsonartifacts/browser-validation/code-search-index-summary-cuda-cold.jsonartifacts/browser-validation/code-search-index-summary-cuda-warm.jsonartifacts/browser-validation/code-search-index-summary-cuda-warm-postcold.json
CUDA batch sweep set:
artifacts/browser-validation/code-search-index-summary-cuda-batch1-cold.jsonartifacts/browser-validation/code-search-index-summary-cuda-batch2-cold.jsonartifacts/browser-validation/code-search-index-summary-cuda-batch4-cold.jsonartifacts/browser-validation/code-search-index-summary-cuda-batch8-cold.jsonartifacts/browser-validation/code-search-index-summary-cuda-batch16-cold.json
CPU Fallback vs CUDA: Main Comparison (225 Files / 2,388 Chunks)
| Scenario | Elapsed (ms) | Build Duration (ms) | Files/s | Chunks/s | Provider Status |
|---|---|---|---|---|---|
CPU fallback cold (ForceRebuild=true) |
418,824 | 417,000 | 0.54 | 5.73 | Requested CUDA unavailable; fell back to CPU |
CPU fallback warm (ForceRebuild=false) |
771 | 0 | n/a | n/a | Requested CUDA unavailable; fell back to CPU |
CUDA cold (ForceRebuild=true) |
525,224 | 524,000 | 0.43 | 4.56 | ONNX provider available via Cuda (0) |
CUDA warm first run (ForceRebuild=false) |
53,849 | 0 | n/a | n/a | ONNX provider available via Cuda (0) |
CUDA warm post-cold (ForceRebuild=false) |
865 | 0 | n/a | n/a | ONNX provider available via Cuda (0) |
Derived Comparison
- Cold rebuild: CUDA is slower than CPU fallback by
106,400 ms(about25.4%slower). - Warm steady-state reuse: CUDA post-cold warm (
865 ms) is close to CPU fallback warm (771 ms) but still about12.2%slower. - First CUDA warm request (
53,849 ms) is a one-time warm-up outlier and should not be compared to steady-state warm numbers.
CUDA Batch Sweep (228 Files / 2,417 Chunks)
| CUDA Batch Size | Elapsed (ms) | Build Duration (ms) | Embedding Calls | Embedded Chunks | Avg Embedding ms/call |
|---|---|---|---|---|---|
| 1 | 116,992 | 116,000 | 2,417 | 2,417 | 46.834 |
| 2 | 125,565 | 125,000 | 1,273 | 2,417 | 96.159 |
| 4 | 140,630 | 139,000 | 705 | 2,417 | 195.363 |
| 8 | 147,988 | 146,000 | 433 | 2,417 | 335.215 |
| 16 | 159,699 | 159,000 | 300 | 2,417 | 523.653 |
Batch Sweep Interpretation
- End-to-end rebuild time worsens monotonically as fixed batch size increases.
- Lower embedding call count did not produce a faster full rebuild on this workload.
- Most runtime in these runs is embedding-bound (
embeddingMillisecondsdominates timing breakdown), not DB write-bound.
Timing-Breakdown Highlights
From the batch sweep timing fields:
embeddingMillisecondsdominates total duration.databaseWriteMillisecondsremains relatively small compared to embedding time.providerInitializationMillisecondsis negligible in rebuild runs once the provider is active.
This supports the working hypothesis that pipeline-level embedding overhead (tokenization, padding, tensor assembly, and per-call orchestration) is currently the decisive factor, not SQLite writes.
Operational Conclusions
- Keep
EmbeddingBatchSize=1as the current default for code-search rebuilds on this host profile. - Keep CUDA optional and preserve CPU fallback as a reliability baseline.
- Treat first CUDA warm runs separately from steady-state warm measurements.
- Prioritize better batching strategy research (token-budget, length-aware grouping, adaptive batching) before changing defaults again.
- Continue benchmarking with strict run labeling (
cold,warm-first,warm-steady, corpus size) to avoid mixed conclusions.