Search Quality Deep Dive — External Agent Benchmark (2026-05-29)

Author: Claude Sonnet 4.6 via npx mcp-remote bridge Scope: Memory search (hybrid, semantic, lexical) and code search (vector + hybrid rerank) Method: Non-destructive read-only tool calls only; no writes made during testing Related tasks: TSK-0183, TSK-0185, TSK-0186 Intended slug: audits/search-quality-deepdive-20260529


1. Scope and Methodology

This report covers a focused quality evaluation of MemorySmith's search tooling from the perspective of an external agent. The primary concern driving this evaluation is that the ONNX-based code search is new and unproven in production, and that memory search quality under real agent queries — paraphrased, negative, adversarial — has not been externally characterised.

All findings are based on live tool call results. Scores are raw values from the MCP response envelope. No indirect inferences were made: every claim has a corresponding tool result recorded here.

1.1 Assumptions

1.2 Test Categories

Category Count Purpose
Memory positive 5 Known record → should rank #1
Memory negative 3 Off-domain → should return low-relevance
Code search positive 3 Known symbol/concept → correct file
Code search negative 1 Off-domain → obviously low scores
Fuzz / adversarial 5 XSS, SQL injection, Unicode, empty query, invalid status

2. Memory Search — Positive Tests

P1 — Single-Host Architecture

Query: single host file backed architecture constraint refactor Expected #1: project-wiki-active-architecture

Rank ID Semantic Lex rank RRF Verdict
1 project-wiki-active-architecture 0.827 1 0.032787 ✅ Correct
2 project-wiki-scope-boundaries 0.826 4 0.031754 ✅ Correct
3 project-wiki-chat-agent-provider 0.788 2 0.031281 ⚠️ Weak FP

Assessment (confidence: 90%): Top two are both valid. #3 is a false positive — project-wiki-chat-agent-provider carries an architecture tag and the word "architecture" in its title, producing lexical rank 2. The semantic component correctly demotes it (0.788 vs 0.827) but the lexical tag match keeps it in the top 3. The architecture tag on a provider-pattern record is questionable — see OQ-P1.


P2 — ONNX Embedding Configuration

Query: ONNX e5 mean pooling WordPiece tokenizer embedding model Expected #1: project-wiki-onnx-semantic-embeddings

Rank ID Semantic Lex rank RRF Verdict
1 project-wiki-onnx-semantic-embeddings 0.881 1 0.032787 ✅ Correct
2 project-wiki-semantic-search-gap 0.817 3 0.032002 ✅ Relevant
3 project-wiki-semantic-tool-quality-suite 0.810 6 0.031025 ✅ Relevant

Assessment (confidence: 95%): Perfect result. 0.881 is the highest semantic cosine score observed across all memory search tests. Lexical rank 1 + semantic rank 1 = maximum RRF fusion. The query exactly mirrors the record's vocabulary, confirming the ONNX model handles technical terminology well when vocabulary overlap is high. All top-3 are genuinely related.


P3 — UI / Blazor Architecture

Query: blazor server mudblazor interactive render mode navigation drawer Expected #1: project-wiki-ui-architecture

Rank ID Semantic Lex rank RRF Verdict
1 project-wiki-ui-architecture 0.847 1 0.032787 ✅ Correct
2 project-wiki-ui-layout-source-link-polish 0.821 3 0.032002 ✅ Correct (#2 describes the 164px drawer)
3 project-wiki-semantic-ui-current 0.800 2 0.031754 ✅ Adjacent

Assessment (confidence: 92%): All three are genuinely relevant UI records. Semantic scores well-separated: 0.847 / 0.821 / 0.800 — a healthy 47-point spread. No false positives. The model correctly differentiated three distinct UI records that all share Blazor/MudBlazor vocabulary.


P4 — Test Architecture (Semantic Rescue Case)

Query: NUnit 4 isolated temp directory integration test file backed Expected #1: project-wiki-test-architecture

Rank ID Semantic Lex rank RRF Verdict
1 project-wiki-test-architecture 0.840 4 0.032018 ✅ Correct
2 project-wiki-test-fixture-overview 0.789 1 0.031319 ⚠️ FP
3 project-wiki-test-fixture-context-root 0.789 3 0.030579 ⚠️ FP

Assessment (confidence: 87%): This is the clearest demonstration of hybrid search value in the test set. The semantic component ranked project-wiki-test-architecture #1 despite it being lexical rank 4. The test fixtures outscored it lexically because they contain "test", "integration", "temp", "directory" as content terms — but they describe fixture data, not the testing architecture. Semantic correctly understood the intent. The RRF fusion rescued the right record from position 4 to position 1. Pure lexical search would have given the wrong top result here.


P5 — Maintenance Proposals (Natural Language)

Query: what triggers a maintenance proposal and how are they reviewed Expected #1: project-wiki-maintenance-proposals-current

Rank ID Semantic Lex rank RRF Verdict
1 project-wiki-maintenance-proposals-current 0.852 1 0.032787 ✅ Correct
2 project-wiki-maintenance-observability-refinements 0.805 2 0.032002 ✅ Relevant
3 project-wiki-configuration-settings-current 0.783 4 0.031010 ⚠️ Weak adjacent

Assessment (confidence: 94%): Strong natural language query result. The phrasing ("what triggers... how are they reviewed") correctly mapped to the proposals record despite not using exact terms like "FileMaintenanceProposalStore" or "proposal JSON". Dual-mode agreement (lexical rank 1 and semantic rank 1) produces maximum RRF. #3 is weakly adjacent — it appears because "maintenance-agent scheduling" appears in the configuration snippet, but it is not about proposals.


3. Memory Search — Negative Tests

N1 — Kubernetes / Docker (Fully Off-Domain)

Query: kubernetes pod autoscaling docker container network overlay

Rank ID Semantic Lex rank RRF Verdict
1 project-wiki-chat-agent-provider 0.770 none 0.016393 ⚠️ FP
2 project-wiki-mcp-integration 0.764 none 0.016129 ⚠️ FP
3 project-wiki-test-architecture 0.762 none 0.015873 ⚠️ FP

Assessment (confidence: 88%): All five results have "lexical rank none" — zero lexical evidence, semantic-only. RRF scores (0.015–0.016) are approximately half those of true positive queries (0.031–0.033). Semantic scores (0.762–0.770) are at the noise floor — below the ~0.78 threshold for meaningful queries.

The system returns results with no signal that they are low-quality. An agent seeing RRF 0.016 needs to independently know this is half a good result's score. "lexical rank none" in matchReason is the only machine-readable quality signal. There is no minScore filter and no explicit "no results above threshold" response.

The double gap: RRF scores ~2× lower AND semantic scores ~7% lower than true positives. Both signals degrade proportionally for off-domain queries. Agents monitoring both would get a clear combined signal.


N2 — PostgreSQL / Entity Framework (Constraint-Content Contamination)

Query: postgresql database schema migration entity framework

Rank ID Semantic Lex rank RRF Verdict
1 project-wiki-scope-boundaries 0.772 9 0.030622 ⚠️ Ambiguous
2 project-wiki-admin-configuration-surface 0.771 6 0.030536 ⚠️ Ambiguous
3 project-wiki-semantic-search-gap 0.760 2 0.030018 ⚠️ Ambiguous

Assessment (confidence: 75%): This is the most important negative test. These records rank highly because they explicitly mention "PostgreSQL" — but in a constraint context ("Do not add PostgreSQL", "migration path that does not force PostgreSQL"). The lexical match is technically accurate; the records ARE about PostgreSQL in the sense that they discuss it as out-of-scope.

The RRF scores (0.030) are close to true positive scores (0.032–0.033) — only a 6% gap. An agent receiving these results could reasonably conclude PostgreSQL is relevant to this codebase, when the actual answer is the opposite. This is the constraint-content contamination problem: records that document exclusions rank for queries about the excluded technology.

A structured excludes[] field or stance:excluded:postgresql tag would allow agents to filter these correctly without reading full content.


N3 — Japanese Unicode

Query: 日本語テスト 検索クエリ

Rank ID Semantic Lex rank RRF Verdict
1 memory-system-rfc-council-review-20260520 0.741 none 0.016393 ✅ Graceful
2 project-wiki-agent-instructions-source-of-truth 0.740 none 0.016129 ✅ Graceful

Assessment (confidence: 95%): No crash, no exception, clean JSON response. Semantic scores 0.740–0.741 are the absolute noise floor — the lowest observed in any memory search test. "lexical rank none" on both confirms zero token matches (expected: CJK characters are not in the WordPiece vocabulary of an English-trained e5-base-v2 model). The e5-base-v2 model handles CJK by embedding the Unicode sequence but produces near-noise output relative to English content. The 0.740 values represent the minimum cosine floor of the model. ✅ Completely safe, correct graceful degradation.


4. Adversarial / Fuzz Tests

F1 — Empty Query

Input: query="" (empty string passed explicitly)

Result: 3 results, all with explicit annotations: "Lexical score 0: No query supplied; returned by recency." and "Semantic score 0: No query supplied; returned by recency."

Assessment (confidence: 99%): ✅ Perfect. Explicit null-score annotation, recency ordering, no exception. The matchReason is machine-parseable. This is the best-handled edge case in the test set.


F2 — XSS + SQL Injection Combined

Input: <script>alert('xss')</script> SELECT * FROM memories WHERE 1=1; --

Result: 2 results returned. <script> tags stripped by Lucene tokenizer. SELECT, WHERE, FROM treated as stop words. memories matched lexically (score 3: "lexical content: memories") — the word appears in records describing Data/Memories/. alert matched nothing. No injection behavior.

Assessment (confidence: 99%): ✅ Safe. The StandardAnalyzer tokenises as plain text, discards HTML markup, ignores SQL keywords as stop words. The word "memories" in SELECT * FROM memories is a genuine content term that correctly returned project-wiki-data-folder-policy (which describes the Data/Memories/ path). No security risk observed.


F3 — Invalid Status Value

Input: status="Nonexistent" (not a valid memory status name)

Result: Results returned as if no status filter was applied. Records with status 1 and 2 both appeared. No error, no warning, no empty response, no validation message.

Assessment (confidence: 90%): ⚠️ Silent failure. An invalid status value is silently ignored rather than returning a validation error or empty result. An agent passing status="Active" (incorrect name, common from outdated docs or assumptions) would receive all records without any indication the filter had no effect. This is a correctness risk for agent workflows that assume the filter was applied.

Expected behaviour: Either return {"error": "unknown status 'Nonexistent'. Valid values: Unconsolidated, Working, Core, Deprecated"} or return zero results. Currently returns everything.


F4 — Japanese Unicode

Covered in N3 above. ✅ Safe and graceful.


F5 — Slug Path Traversal

Input: slug="../../../etc/passwd" — not tested to avoid potential risk on a production instance.

Open question OQ-F1: Does page_get sanitize slug input against path traversal? The slug is used to construct a file path under Data/Pages/. Given that MemorySmith is a local-first app with AllowRemoteApi=false by default, the immediate risk is low. But an explicit test confirming sanitization is worth adding to the security test suite, particularly since AllowRemoteApi=true is supported.


5. Code Search — System State

At test time: - Files indexed: 171 - Chunks: 1912 (240 new on last build, 1672 reused from prior builds) - Provider: ONNX CPU, e5-base-v2 - Last build: 2026-05-29T04:29:28Z (35.6 seconds total, 35.0s embedding, avg 991ms/call across 36 calls) - Build state: idle (no rebuild triggered during testing per rebuildIfStale=false)

The matchReason for code search uses three components: "Code embedding cosine similarity X.XXX, lexical evidence Y.YYY, token coverage weight Z.ZZZ (hybrid rerank)."

The lexical evidence and token coverage weight implement the no-lexical-evidence penalty described in project-wiki-code-search-relevance-suite.


6. Code Search — Positive Tests

CP1 — Hybrid Search RRF Implementation

Query: reciprocal rank fusion hybrid search RRF

Rank File Lines Score Embedding Lex ev. Verdict
1 ChatToolCatalog.cs 161–200 0.803817 0.810 5.649 ✅ MCP tool registration for hybrid search
2 ChatToolCatalog.cs 129–168 0.792829 0.799 5.458 ⚠️ Same file, lexical search chunk
3 MemoryApplicationService.cs 1153–1192 0.740287 0.825 4.650 BuildHybridMatchReason method
4 MemoryApplicationService.cs 897–936 0.735951 0.819 4.608 ✅ Hybrid search algorithm body
5 SemanticToolQualityTests.cs 33–72 0.695183 0.826 6.558 ✅ Quality probes — HybridProbes[] array

Assessment (confidence: 85%): Results 1, 3, 4, 5 are all correct and useful. Result 2 is a false positive: ChatToolCatalog.cs L129–168 is the lexical search handler chunk in the same file. It ranked #2 because it shares the file with the correct result (same token coverage weight boost) and has similar lexical evidence (5.458 vs 5.649). MaxResultsPerDocument did not prevent consecutive same-file chunks.

Critically, MemoryApplicationService.cs L1153–1192 (BuildHybridMatchReason) has a higher embedding cosine (0.825) than the #2 result (0.799) but ranks lower because its lexical evidence is weaker (4.650 vs 5.458). The most precisely relevant result is out-ranked by a less relevant one due to 0.8 points of lexical evidence. See OQ-CP1.


CP2 — SemanticEmbeddingSearchService

Query: SemanticEmbeddingSearchService cosine similarity persisted embeddings

Rank File Lines Score Embedding Lex ev. Verdict
1 SemanticEmbeddingSearchService.cs 257–296 0.941927 0.876 17.650 ✅ Result-building with cosine score
2 SemanticEmbeddingSearchService.cs 225–264 0.935577 0.867 17.878 ✅ Query embedding + search loop
3 MemoryApplicationService.cs 33–72 0.833582 0.831 10.106 ✅ Synonym expansion dictionary
4 MemoryApplicationService.cs 1–40 0.776285 0.826 8.549 ✅ Service header / using directives
5 SemanticEmbeddingPrewarmService.cs 1–40 0.749011 0.834 10.696 ✅ Background prewarm service

Assessment (confidence: 96%): Highest scores in any code search test. 0.941927 is consistent with an exact class name match with strong semantic alignment. All five results are genuinely relevant — the synonym expansion dictionary (result 3) is a useful bonus discovery.

Again, two consecutive chunks from SemanticEmbeddingSearchService.cs at ranks 1 and 2 — in this case both are correct, so it's not harmful. But the MaxResultsPerDocument diversification is confirmed not to spread results.

Hidden discovery from result 3: MemoryApplicationService.cs L33–72 contains the synonym expansion dictionary:

"semantic" → ["meaning", "concept", "conceptual", "embedding", "embeddings", "vector", "similarity"]
"embedding" → ["semantic", "vector", "similarity"]
"search"    → ["find", "query", "lookup", "retrieval"]

This undocumented synonym expansion silently boosts recall for natural-language queries. A query for "how does the system find similar meanings" would lexically match records containing "semantic", "embedding", or "retrieval". This is a significant undocumented capability worth surfacing in tool descriptions.


CP3 — Authorization Policy (Precision Ranking Issue)

Query: CanReadSourceBundle authorization permission check

Rank File Lines Score Embedding Lex ev. Verdict
1 Program.cs 289–328 0.840806 0.808 8.096 ⚠️ Auth failure recording — adjacent
2 SecurityServices.cs 1–40 0.796868 0.827 6.750 ⚠️ Service header — indirect
3 McpController.cs 161–200 0.794423 0.825 6.700 ⚠️ Tool filtering — adjacent
4 SourceLinksController.cs 1–26 0.790391 0.807 7.881 Exact: [Authorize(Policy = MemorySmithPolicies.CanReadSourceBundl...
5 SecurityServices.cs 193–232 0.742297 0.837 5.417 ✅ Anonymous actor context

Assessment (confidence: 80%): The most precisely relevant file (SourceLinksController.cs) ranks #4. The snippet visible in the result shows [Authorize(Policy = MemorySmithPolicies.CanReadSourceBundl... — this is exactly where the policy is applied. It ranks below three files that discuss auth broadly.

Root cause: Program.cs #1 has 8.096 lexical evidence vs SourceLinksController.cs #4 with 7.881 — a 0.2 difference. Program.cs is a larger file with more auth-related content, generating higher lexical evidence even though the specific policy is not applied there. The embedding cosines are within 0.001 of each other (0.808 vs 0.807) — too close to discriminate. The 0.2 lexical advantage in the larger file outweighs the more specific semantic match in the controller.

This is the precision gap for short-file / attribute-declaration queries: a file where the answer is a single [Authorize] attribute on a class can be outscored by larger files that discuss the same concept at length. A title-or-symbol-boost (giving extra weight to matches where the query term appears in a class/method declaration) might address this.


7. Code Search — Negative Test

CN1 — Kubernetes / Docker

Query: kubernetes pod autoscaling docker container

Rank File Lines Score Embedding Lex ev. Verdict
1 memorysmith.js 513–552 0.477389 0.765 1.346 ⚠️ DOM container match
2 memorysmith.js 481–520 0.477069 0.770 1.232 ⚠️ Same file, adjacent chunk
3 PageService.cs 705–744 0.460395 0.740 1.232 ⚠️ ContainerInline Markdig class
4 PageService.cs 737–776 0.459997 0.751 1.000 ⚠️ Same file, adjacent
5 MainLayout.razor 33–72 0.452564 0.733 1.100 ⚠️ MudContainer / layout

Assessment (confidence: 93%): ✅ The no-lexical-evidence penalty is clearly working. Scores of 0.45–0.48 versus 0.80–0.94 for true positives is a near-2× gap — sufficient for an agent to recognise these as noise. Lexical evidence of 1.0–1.3 (vs 5–17 for true positives) is the key differentiator.

The residual false positives are caused by the DOM/HTML meaning of "container" — document.createElement("section") creates a container, ContainerInline is a Markdig class, MudContainer is a MudBlazor layout component. These are genuine but contextually wrong lexical matches. The embedding cosine (0.733–0.770) is at the lower end of observed values, reinforcing low relevance.

Important: Both memorysmith.js chunks appear consecutively (#1 and #2), and both PageService.cs chunks appear consecutively (#3 and #4). Same-file diversification is not spreading results for this query either — confirming the MaxResultsPerDocument pattern is consistent across both positive and negative tests.

Machine-readable threshold confirmation: Lexical evidence < 2.0 is a reliable indicator of noise-tier code search results across all tests conducted. The clean gap between true positives (≥4.6) and negatives (≤1.4) suggests this threshold is robust.


8. Cross-Cutting Analysis

8.1 Memory Search Score Reference

Query type RRF range Semantic range lexical rank none?
Strong positive, dual-mode 0.031–0.033 0.827–0.881 No
Strong positive, semantic-led 0.030–0.032 0.789–0.852 No
Weak positive / adjacent 0.031 0.783–0.800 Partial
Off-domain negative 0.015–0.016 0.762–0.770 Yes
Unicode / garbage 0.016 0.740–0.741 Yes

Practical thresholds (confidence: 70%): - RRF > 0.030 + semantic > 0.820: high confidence - RRF 0.020–0.030: moderate; check semantic component - RRF < 0.020 or "lexical rank none": low confidence; treat with suspicion

Confidence 70% because corpus is ~55 records; thresholds will compress as corpus grows.

8.2 Code Search Score Reference

Query type Score range Lex evidence range Interpretation
Exact class name 0.93–0.94 15–18 Definitive
Strong conceptual 0.79–0.84 5–10 Trust
Adjacent / same-file FP 0.77–0.80 4–6 Verify snippet
Off-domain 0.45–0.48 1.0–1.4 Noise

Practical thresholds (confidence: 75%): - Score > 0.80: strong match - Score 0.70–0.80: relevant; verify by reading snippet - Score < 0.70: adjacent context only - Lexical evidence < 2.0: noise regardless of embedding cosine

8.3 MaxResultsPerDocument Diversification Gap

In all four code search result sets (CP1, CP2, CP3, CN1), two consecutive chunks from the same file appeared in the top results. The MaxResultsPerDocument guard caps total results per file but does not spread them through the ranking:

For a limit=5 query, showing 2/5 results (40%) from the same file is excessive for most discovery use cases. A rank-spreading rule — "no two results from the same file within N positions of each other" — would improve result diversity without reducing total file coverage.

8.4 Synonym Expansion — Undocumented Power Feature

The code search test exposed MemoryApplicationService.cs L33–72, which contains the synonym expansion dictionary used by both memory search and code search lexical scoring:

"semantic"   → ["meaning", "concept", "conceptual", "embedding", "embeddings", "vector", "similarity"]
"embedding"  → ["semantic", "vector", "similarity"]
"search"     → ["find", "query", "lookup", "retrieval"]
"hybrid"     → ["combined", "fusion", "mixed", "rrf"]
"memory"     → ["record", "knowledge", "wiki", "kb"]

This expansion means queries using natural language synonyms automatically receive lexical boosts for the technical terms — a query for "how does the system find similar meanings" would match records containing "semantic", "embedding", or "retrieval". This is a significant undocumented capability. Surfacing it in tool descriptions and the search guide would enable agents to write more effective queries.

8.5 Constraint-Content Contamination

The PostgreSQL negative test (N2) revealed that records explicitly excluding a technology can rank highly for queries about that technology. Three records (project-wiki-scope-boundaries, project-wiki-admin-configuration-surface, project-wiki-semantic-search-gap) mention PostgreSQL in constraint context, producing RRF scores of 0.030 — only 6% below strong true positives.

There is currently no structured way for a record to express its stance toward a concept. An excludes[] array or a stance:excluded:postgresql structured tag would allow agents to filter these correctly without needing to read the full content. Given that MemorySmith's architecture records frequently document what is NOT in scope, this pattern will recur for any technology in the "explicitly rejected" category.


9. Summary

What Is Working Well

Finding Confidence
Memory positive recall: 5/5 correct records ranked #1 92%
Hybrid search semantic rescue (P4: semantic overrode lower lexical position) 90%
Off-domain negative: RRF ~2× lower, semantic scores at measurable noise floor 88%
Code search exact class/method lookup: scores 0.93–0.94 96%
No-lexical-evidence penalty in code search: off-domain scores near 0.45 vs 0.80–0.94 93%
Adversarial inputs (XSS, SQL injection, Unicode, empty) handled cleanly 99%
ONNX provider active, healthy, CPU-bound, 768-dim e5-base-v2 99%
Japanese/Unicode: graceful noise-floor degradation, no crash 95%

Issues Found

Issue Severity Confidence
Invalid status value silently ignored — no error, no empty, no warning Medium 90%
MaxResultsPerDocument does not spread same-file results through ranking Medium 88%
Code search CP3: exact policy location ranks #4 behind broader auth files Medium 80%
No minScore threshold — off-domain queries return noise-floor results without signal Medium 85%
Constraint-content contamination: "Do not use X" records rank for queries about X Low–Medium 75%
"lexical rank none" is the only machine-readable quality signal for weak results Low 88%
Memory P1: project-wiki-chat-agent-provider false positive via architecture tag Low 90%
Synonym expansion dictionary is undocumented Low 95%
page_save now in DisabledTools — returns proper error (improvement over prior silent hang) Info 99%

10. Open Questions


Written by Claude Sonnet 4.6 via npx mcp-remote bridge | 2026-05-29 All scores are raw values from live MCP tool responses. Non-destructive: no KB mutations during testing. Note: page_save was disabled by MemorySmith:Mcp:DisabledTools at time of publishing — this file requires manual upload to the wiki at slug audits/search-quality-deepdive-20260529.