Semantic Acceleration Setup Guide
This guide covers the current ONNX hardware-acceleration options in MemorySmith, how to deploy them safely, how CPU fallback behaves, and how to warm the repo code vector index into Data/Graph/code-search/code-search.db.
1. What MemorySmith Supports
MemorySmith supports three ONNX Runtime deployment flavors for semantic embeddings:
Cpu: portable default and the current repo-safe baseline.Cuda: NVIDIA-backed acceleration on supported Windows/Linux hosts.OpenVino: Intel OpenVINO acceleration on supported Windows hosts.
Runtime provider selection is controlled separately from the published binary:
MemorySmithOnnxRuntimeFlavorcontrols which native ONNX package is published with the app.MemorySmith:SemanticSearch:ExecutionProvidercontrols which provider the app requests at runtime.
The recommended rule is simple: keep the runtime provider aligned with the published flavor unless you are intentionally testing failure-and-fallback behavior.
1.1 CUDA Host Prerequisites
For the current Microsoft.ML.OnnxRuntime.Gpu 1.24.1 package in this repo, the safest Windows expectation is:
- NVIDIA driver installed
- CUDA runtime/toolkit available on
PATH - cuDNN 9 runtime available on
PATH
On the current maintainer machine, the CUDA toolkit added these paths automatically:
C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\bin\x64C:\Program Files\NVIDIA GPU Computing Toolkit\CUDA\v13.3\bin
The cuDNN local installer did not add its runtime folder automatically. The installed DLLs landed under:
C:\Program Files\NVIDIA\CUDNN\v9.22\bin\12.9\x64C:\Program Files\NVIDIA\CUDNN\v9.22\bin\13.2\x64
For MemorySmith's current ONNX package, prefer adding the 12.9\x64 cuDNN folder to the machine PATH first. That is the most conservative match for ONNX Runtime's documented CUDA 12 + cuDNN 9 baseline.
Verify that Windows can resolve the DLL before you expect CUDA to activate:
where.exe cudnn64_9.dll
If that command returns nothing, ONNX Runtime will usually fail with a message similar to:
onnxruntime_providers_cuda.dll ... depends on cudnn64_9.dll which is missing
2. CPU Fallback Contract
MemorySmith now falls back to CPU embeddings when all of these are true:
- the requested provider is
CudaorOpenVino MemorySmith:SemanticSearch:CpuFallbackEnabled=true- the requested hardware provider fails to initialize
When fallback happens:
- the app still starts
- semantic embeddings remain available through CPU ONNX
/healthreports the requested provider, the active provider, and the fallback reason- persisted document embeddings are invalidated when the provider settings or ONNX runtime version change
That means a CUDA-flavored server can be configured to prefer Cuda without turning semantic search off on machines that are missing the right runtime pieces.
3. Runtime Settings
These keys live under MemorySmith:SemanticSearch:
{
"EmbeddingsEnabled": true,
"PrewarmOnStartupEnabled": true,
"ExecutionProvider": "Cpu",
"CpuFallbackEnabled": true,
"CudaDeviceId": 0,
"OpenVinoDeviceId": ""
}
Notes:
ExecutionProvideracceptsCpu,Cuda, orOpenVino.PrewarmOnStartupEnableddefaults totrueand kicks off a background query/document probe after app startup so the first real semantic/code-search request does not pay the full lazy ONNX initialization cost.CudaDeviceIdselects the GPU when using CUDA.OpenVinoDeviceIdis optional; blank lets ONNX Runtime pick the default OpenVINO target.- Leave
CpuFallbackEnabled=trueunless you explicitly want startup failure when the requested hardware provider is unavailable.
4. Source-Run Setup
For a local source run, keep using the normal app profiles and add the ONNX flavor as an MSBuild property.
CPU default:
dotnet run --project MemorySmith.App --launch-profile https
CUDA build:
dotnet run --project MemorySmith.App --launch-profile https -p:MemorySmithOnnxRuntimeFlavor=Cuda
If CUDA still falls back after installing cuDNN, prepend the cuDNN runtime directory before starting the app:
$env:PATH = 'C:\Program Files\NVIDIA\CUDNN\v9.22\bin\12.9\x64;' + $env:PATH
dotnet run --project MemorySmith.App --launch-profile https -p:MemorySmithOnnxRuntimeFlavor=Cuda
For a persistent machine-level fix on Windows, add the same directory to the machine PATH, then restart the terminal, the app, and any installed MemorySmith service.
OpenVINO build:
dotnet run --project MemorySmith.App --launch-profile https -p:MemorySmithOnnxRuntimeFlavor=OpenVino
For source-run overrides, place appsettings.LocalOverrides.json beside the running app output or set MemorySmith:SettingsOverridePath to an explicit file. A typical local override looks like this:
{
"MemorySmith": {
"SemanticSearch": {
"EmbeddingsEnabled": true,
"PrewarmOnStartupEnabled": true,
"ExecutionProvider": "Cuda",
"CpuFallbackEnabled": true,
"CudaDeviceId": 0,
"OpenVinoDeviceId": ""
}
}
}
5. Windows Service Setup
Use Scripts/Redeploy-MemorySmithService.ps1. It now owns all of these in one place:
- publish flavor selection through
-OnnxRuntimeFlavor - semantic runtime settings through
-SemanticExecutionProvider,-CpuFallbackEnabled,-CudaDeviceId, and-OpenVinoDeviceId - writing the effective
appsettings.LocalOverrides.jsonfile for the published service - passing
MemorySmith:SettingsOverridePathto the installed service runtime
Example CUDA-preferred deploy with HTTPS and CPU fallback:
./Scripts/Redeploy-MemorySmithService.ps1 \
-UseHttps \
-OnnxRuntimeFlavor Cuda \
-SemanticExecutionProvider Cuda \
-CpuFallbackEnabled $true \
-CudaDeviceId 0
Before redeploying a Windows service, make sure the service host can also resolve cudnn64_9.dll. On this repo's current package set, the practical fix is to add:
C:\Program Files\NVIDIA\CUDNN\v9.22\bin\12.9\x64
to the machine PATH, then restart the service so the new environment is inherited.
Example OpenVINO-preferred deploy with CPU fallback:
./Scripts/Redeploy-MemorySmithService.ps1 \
-UseHttps \
-OnnxRuntimeFlavor OpenVino \
-SemanticExecutionProvider OpenVino \
-CpuFallbackEnabled $true
If you want to force a hard failure instead of fallback when the provider is unavailable, set -CpuFallbackEnabled $false.
6. Verifying The Active Provider
After startup, open /health.
The semantic-search card now tells you:
- whether embeddings are available
- the active execution provider
- the requested provider when fallback occurred
- the failure reason if the requested provider could not initialize
Typical healthy fallback example:
Embeddings available / active Cpu / Requested Cuda (0) was unavailable; fell back to CPU ...
Typical healthy CUDA example:
Embeddings available / active Cuda (0) / ONNX embedding provider is available via Cuda (0).
7. Build The Repo Code Vector Database
The code-search index is stored under:
Data/Graph/code-search/code-search.db
To warm it through the running server and capture timing, use:
./Scripts/Warm-CodeSearchIndex.ps1 \
-BaseUrl https://localhost:7090 \
-SkipCertificateCheck \
-ForceRebuild \
-SummaryPath artifacts/browser-validation/code-search-index-summary.json
The script automatically reads MemorySmith.ApiKey from artifacts/MemorySmith.App/appsettings.LocalOverrides.json when that file exists. For a different protected deployment, pass -ApiKey directly or point -SettingsOverridePath at the matching override file.
What the script does:
- calls
memorysmith_code_searchthrough the MCP endpoint to force indexing - waits for the synchronous indexing/search call to finish
- reads
memorysmith_code_search_status - reports elapsed time, indexed file count, indexed chunk count, provider mode, provider status, SQLite path, SQLite size, rough throughput, and the raw build timing breakdown
Operational note:
- Startup prewarm is now enabled by default. On a normal service or app boot, MemorySmith starts a background query/document probe immediately so the first operator-driven semantic request usually sees the warmed session instead of the full lazy initialization hit.
- If you disable
PrewarmOnStartupEnabled, the first CUDA-backed request becomes a true warm-up again. That firstWarm-CodeSearchIndex.ps1run after a fresh process start can spend tens of seconds loading the model and preparing the GPU session even when the index itself is fully reused. - After deploying or restarting a CUDA-preferred app or service with prewarm disabled, run one no-rebuild warm pass first if you want to hide that one-time startup cost from later operators or benchmark runs.
If you are using the HTTP source-run profile, swap https://localhost:7090 for http://localhost:5089.
8. Profiling Guidance
The warm-index script reports two useful timing views:
ElapsedMilliseconds: total wall-clock time for the rebuild-triggering MCP search callBuildDurationMilliseconds: the build progress timestamps reported byCodeSearchService
The script summary also now includes TimingBreakdown, which mirrors memorysmith_code_search_status.build.timings:
providerInitializationMillisecondsfileReadMillisecondscontentHashMillisecondschunkingMillisecondsembeddingMillisecondsdatabaseWriteMillisecondsremovedDocumentCleanupMillisecondsembeddingCallCountembeddedChunkCountaverageEmbeddingMilliseconds
Use -ForceRebuild when you want cold-build numbers. Omit it when you want warm incremental behavior.
When comparing CUDA against CPU fallback, treat the very first CUDA no-rebuild pass as a combined warm-up plus measurement. It includes one-time ONNX provider initialization that does not recur until the process restarts.
Current bottleneck guidance:
databaseWriteMillisecondsis usually not the dominant cost on this repo because code-search already batches SQLite document rewrites.- The main cold-build bottleneck is the synchronous per-chunk embedding loop in
CodeSearchService, which makes thousands of small embedding calls instead of handing the GPU larger batches. - A healthy CUDA provider can therefore be slower than CPU fallback for the current repo-sized workload even when
/healthandmemorysmith_code_search_statusboth report activeCuda (0).
For reproducible comparisons:
- Run once with
-ForceRebuild. - Run again without
-ForceRebuild. - Compare provider mode/status plus elapsed/build duration.
9. Expected Data Outputs
After a successful rebuild you should see:
Data/Graph/code-search/code-search.dbData/Graph/embeddings/*.jsonfor memory-record semantic caches
The code-search SQLite database holds indexed code chunks and their embedding payloads. The memory embedding cache remains file-backed JSON because it is keyed by memory id and invalidation hash rather than code chunk rows.
10. Troubleshooting
If CUDA or OpenVINO does not activate:
- Confirm you published the matching
MemorySmithOnnxRuntimeFlavor. - Confirm
ExecutionProvidermatches the intended hardware provider. - Run
where.exe cudnn64_9.dlland confirm Windows can resolve the cuDNN runtime. - If cuDNN is installed under
C:\Program Files\NVIDIA\CUDNN\..., add the matchingbin\...\x64folder to the machinePATHand restart the app/service. - Keep
CpuFallbackEnabled=trueso the app stays available while you diagnose the hardware path. - Check
/healthormemorysmith_code_search_statusfor the provider-specific initialization error. - Re-run
./Scripts/Warm-CodeSearchIndex.ps1and inspectProviderMode,ProviderStatus, andTimingBreakdownin the summary.
If the repo code index does not appear:
- Confirm the app can reach the repository root from
MemorySmith:CodeSearch:RepositoryRootPath. - Confirm
MemorySmith:CodeSearch:Enabled=true. - Run the warm-index script with
-ForceRebuild. - Check the returned
IndexPathandBuild.LastErrorvalues.