Council Review: PR 13 Search Diagnostics Follow-Up
Decision
Proceed with a narrow PR 13 follow-up that preserves the no-ranking-change design constraint while centralizing repeated context-pack formatting and ensuring search diagnostics are enriched only for returned results, not every candidate.
Evidence Reviewed
- PR #13 comment from TheMasonX: search is critical; optimize speed, accuracy, signal-to-noise ratio, token bloat, tests, benchmarking, and centralization.
- PR #13 Copilot review comments on lexical diagnostics N+1, semantic linear scans, context-pack warning noise, maintenance event spam, duplicated deprecation thresholds, status-like blocked tags, and Markdown formatter/doc drift.
- Follow-up commit
52476a4 Centralize search diagnostics enrichment. - AI Memory Suite Implementation Plan, especially Phase 1 and Phase 3 retrieval output safety gates.
- Memory Governance Guide, especially diagnostics-as-metadata and no-ranking-change constraints.
- MemorySmith context pack records:
project-wiki-search-roadmap,project-wiki-mcp-context-pack,project-wiki-mcp-search-tools-current, andai-memory-suite-governance-foundation-20260520. - Source evidence: MemoryApplicationService.cs, MemoryDiagnosticFormatting.cs, ChatToolCatalog.cs, McpController.cs, SearchBenchmarkTests.cs, SearchBenchmarks.cs.
- Validation evidence before this follow-up:
dotnet build MemorySmith.slnx -v minimalpassed; PR #13 earlier validation reached 220/220 tests after the page-asset repair.
Findings
| Seat | Recommendation | Confidence | Blocking concern |
|---|---|---|---|
| Source-Grounded Archivist | Accept the follow-up only if it remains source-grounded in current PR comments and does not broaden into ranking changes. | 0.88 | Any claim that search quality improved must be backed by tests/benchmarks, not prose. |
| Data Model Architect | Keep diagnostics as DTO metadata and do not promote new schema fields in this PR. | 0.86 | Schema promotion before Phase 5 would violate the plan and create migration cost. |
| Retrieval Specialist | Patch the remaining hot path: hybrid search should sort candidates first, apply the requested limit, then attach diagnostics only to returned results. | 0.90 | Enriching diagnostics for every hybrid candidate wastes time and can touch source files unnecessarily. |
| Human Learning Advocate | Keep warnings visible but concise; warning/error-only summaries and capped Markdown diagnostics are the right token budget tradeoff. | 0.84 | Too many diagnostics in context packs will reduce usefulness even when technically correct. |
| Skeptical Reviewer | Do not mark the PR comment fully satisfied until duplicated context-pack formatters are centralized or deliberately documented as separate surfaces. | 0.82 | Repeated formatter drift already caused contract problems and contradicts the user's centralization request. |
| Synthesizer | Make two changes now: limited diagnostics enrichment and shared context-pack formatting. Defer ranking/page embeddings/vector-index changes to measured phases. | 0.87 | CI, search probes, and benchmark smoke must pass before pushing. |
Synthesis
Changes now:
- Preserve current lexical, semantic, hybrid, and context-pack ranking semantics.
- Attach diagnostics after search limits are applied where possible, especially hybrid search.
- Reuse one context-pack formatter across direct MCP and chat tool catalog paths, allowing direct MCP to resolve source-link URIs while sharing projection and Markdown structure.
- Keep warning summaries warning/error-only and capped.
Deferred:
- Ranking weight changes, RRF changes, temporal decay, page chunking, page embeddings, durable vector indexes, schema promotion, and broader MCP envelope changes.
- UI redesign for diagnostics beyond the current chips/panel.
Dissent
The Retrieval Specialist would prefer broader ranking quality probes immediately because search quality is the product's core value. The Synthesizer defers ranking changes because the active design explicitly says ranking remains unchanged until probes prove a change is helpful. The compromise is to improve search infrastructure and benchmark coverage now without changing relevance math.
Acceptance Criteria
- Direct
/mcpcontext-pack output and chat catalog context-pack output use a shared formatter implementation. - Direct
/mcpJSON context-pack output still resolves source-link URIs throughVarResolver; chat catalog output preserves raw source-link URIs. - Hybrid search applies the requested limit before diagnostics enrichment.
- Existing search relevance probes still pass.
- Full
MemorySmith.Testssuite passes. - Benchmark smoke command passes:
dotnet run -c Release --project MemorySmith.Benchmarks -- --smoke. - No ranking formula or persisted memory schema changes are introduced.
Post-Implementation Checkpoints
Stage 1, search and formatter plumbing: implemented in 591f2bb Tighten search diagnostics follow-up. The follow-up added a shared MemoryContextPackFormatter, routed direct MCP and chat context-pack output through it, and applied hybrid search limits before diagnostics enrichment. Validation passed with dotnet build MemorySmith.slnx -v minimal, affected tests at 80/80, full tests at 225/225, and benchmark smoke returning results for lexical metadata diagnostics, semantic, hybrid, chat-context, and context-pack paths. Council conclusion: acceptance criteria met without ranking or schema changes. Confidence: 0.91.
Stage 2, PR review closure: current source evidence shows lexical N+1 diagnostics, semantic linear-scan enrichment, warning bloat, deprecation event spam, duplicated deprecation thresholds, and Markdown formatter drift have been addressed. The only remaining source-grounded cleanup was removing blocklisted status-like working tags from memory records because Status already carries that state. This is a policy-conformance edit, not a new retrieval or schema decision, so a fresh full council was not required. Validation gate: no "working" tags remain under Data/Memories/**/*.json, and focused governance/search tests pass. Confidence: 0.89.
Stage 3, second Copilot review round: the latest Copilot PR responses raised six source-grounded hardening concerns: duplicate memory ids can throw in metadata/search/diagnostics maps, duplicate tag-policy namespaces can throw in user-editable policy parsing, tag policy reads happen on every diagnostics pass, and deprecation recommendation idempotence still scans events per record. Council classification: high impact for reliability and hot-path performance, but not a ranking/schema change. Acceptance gates before resolving: duplicate-id fixtures must not throw; duplicate namespace policy must produce a warning rather than fail diagnostics; tag policy reads must be cached with file-write invalidation; consolidation recommendation must scan existing recommendation events once; full tests, search benchmarks, and GitHub CI must pass. Confidence before implementation: 0.86.
Stage 3 Seat Review
| Seat | Recommendation | Confidence | Blocking concern |
|---|---|---|---|
| Source-Grounded Archivist | Treat all six comments as valid because they point at concrete current code paths, not speculative style concerns. | 0.90 | Do not resolve threads until tests cover duplicate ids, duplicate namespaces, cache invalidation, and event scan behavior. |
| Data Model Architect | Keep duplicate-id handling runtime-tolerant in this PR; do not promote a new persisted duplicate/conflict schema yet. | 0.84 | Silently choosing a duplicate without deterministic rules would make diagnostics hard to trust. |
| Retrieval Specialist | Harden search snapshots and metadata maps with one shared duplicate-tolerant map helper so search cannot be taken down by one bad record. | 0.89 | Any helper must preserve deterministic ordering and avoid changing ranking formulas. |
| Human Learning Advocate | Surface bad user-editable policy configuration as diagnostics instead of crashes; warn-first governance must apply to governance itself. | 0.87 | Hiding policy errors would make the tool feel flaky rather than teachable. |
| Skeptical Reviewer | Cache tag policy carefully: last-write invalidation is acceptable, but tests must prove edits are picked up. | 0.82 | Caching stale governance policy would be worse than the current slow-but-fresh behavior. |
| Synthesizer | Patch resilience and performance now, defer persisted duplicate-id records/events and richer policy validation dashboards. | 0.88 | Validation must include both focused unit tests and the full suite before pushing. |
Stage 3 implementation result: accepted and implemented. Current patch adds a shared duplicate-tolerant memory record lookup, applies it to memory metadata, diagnostics, and search snapshots, converts hybrid result lookup maps to non-throwing TryAdd behavior, caches TagPolicyService.GetPolicy() with last-write invalidation, emits tag.policy_duplicate_namespace instead of throwing on duplicate namespace policy entries, and scans existing DeprecationRecommended events once per consolidation run. Validation passed with focused tests at 62/62, dotnet build MemorySmith.slnx -v minimal, full tests at 231/231, and benchmark smoke returning results across lexical metadata diagnostics, semantic, hybrid, chat-context, and context-pack paths. Confidence after implementation: 0.93.
Open Questions
- Should future search quality gates compute MRR/NDCG over the project wiki rather than only top-hit and must-contain probes?
- Should context-pack warning caps become configurable after larger wiki measurements?
- Should
SearchMetadataAsyncreplace lexicalSearchAsyncfor API/UI callers in a later contract cleanup?