Council Review: PR 13 Search Diagnostics Follow-Up

Decision

Proceed with a narrow PR 13 follow-up that preserves the no-ranking-change design constraint while centralizing repeated context-pack formatting and ensuring search diagnostics are enriched only for returned results, not every candidate.

Evidence Reviewed

Findings

Seat Recommendation Confidence Blocking concern
Source-Grounded Archivist Accept the follow-up only if it remains source-grounded in current PR comments and does not broaden into ranking changes. 0.88 Any claim that search quality improved must be backed by tests/benchmarks, not prose.
Data Model Architect Keep diagnostics as DTO metadata and do not promote new schema fields in this PR. 0.86 Schema promotion before Phase 5 would violate the plan and create migration cost.
Retrieval Specialist Patch the remaining hot path: hybrid search should sort candidates first, apply the requested limit, then attach diagnostics only to returned results. 0.90 Enriching diagnostics for every hybrid candidate wastes time and can touch source files unnecessarily.
Human Learning Advocate Keep warnings visible but concise; warning/error-only summaries and capped Markdown diagnostics are the right token budget tradeoff. 0.84 Too many diagnostics in context packs will reduce usefulness even when technically correct.
Skeptical Reviewer Do not mark the PR comment fully satisfied until duplicated context-pack formatters are centralized or deliberately documented as separate surfaces. 0.82 Repeated formatter drift already caused contract problems and contradicts the user's centralization request.
Synthesizer Make two changes now: limited diagnostics enrichment and shared context-pack formatting. Defer ranking/page embeddings/vector-index changes to measured phases. 0.87 CI, search probes, and benchmark smoke must pass before pushing.

Synthesis

Changes now:

Deferred:

Dissent

The Retrieval Specialist would prefer broader ranking quality probes immediately because search quality is the product's core value. The Synthesizer defers ranking changes because the active design explicitly says ranking remains unchanged until probes prove a change is helpful. The compromise is to improve search infrastructure and benchmark coverage now without changing relevance math.

Acceptance Criteria

Post-Implementation Checkpoints

Stage 1, search and formatter plumbing: implemented in 591f2bb Tighten search diagnostics follow-up. The follow-up added a shared MemoryContextPackFormatter, routed direct MCP and chat context-pack output through it, and applied hybrid search limits before diagnostics enrichment. Validation passed with dotnet build MemorySmith.slnx -v minimal, affected tests at 80/80, full tests at 225/225, and benchmark smoke returning results for lexical metadata diagnostics, semantic, hybrid, chat-context, and context-pack paths. Council conclusion: acceptance criteria met without ranking or schema changes. Confidence: 0.91.

Stage 2, PR review closure: current source evidence shows lexical N+1 diagnostics, semantic linear-scan enrichment, warning bloat, deprecation event spam, duplicated deprecation thresholds, and Markdown formatter drift have been addressed. The only remaining source-grounded cleanup was removing blocklisted status-like working tags from memory records because Status already carries that state. This is a policy-conformance edit, not a new retrieval or schema decision, so a fresh full council was not required. Validation gate: no "working" tags remain under Data/Memories/**/*.json, and focused governance/search tests pass. Confidence: 0.89.

Stage 3, second Copilot review round: the latest Copilot PR responses raised six source-grounded hardening concerns: duplicate memory ids can throw in metadata/search/diagnostics maps, duplicate tag-policy namespaces can throw in user-editable policy parsing, tag policy reads happen on every diagnostics pass, and deprecation recommendation idempotence still scans events per record. Council classification: high impact for reliability and hot-path performance, but not a ranking/schema change. Acceptance gates before resolving: duplicate-id fixtures must not throw; duplicate namespace policy must produce a warning rather than fail diagnostics; tag policy reads must be cached with file-write invalidation; consolidation recommendation must scan existing recommendation events once; full tests, search benchmarks, and GitHub CI must pass. Confidence before implementation: 0.86.

Stage 3 Seat Review

Seat Recommendation Confidence Blocking concern
Source-Grounded Archivist Treat all six comments as valid because they point at concrete current code paths, not speculative style concerns. 0.90 Do not resolve threads until tests cover duplicate ids, duplicate namespaces, cache invalidation, and event scan behavior.
Data Model Architect Keep duplicate-id handling runtime-tolerant in this PR; do not promote a new persisted duplicate/conflict schema yet. 0.84 Silently choosing a duplicate without deterministic rules would make diagnostics hard to trust.
Retrieval Specialist Harden search snapshots and metadata maps with one shared duplicate-tolerant map helper so search cannot be taken down by one bad record. 0.89 Any helper must preserve deterministic ordering and avoid changing ranking formulas.
Human Learning Advocate Surface bad user-editable policy configuration as diagnostics instead of crashes; warn-first governance must apply to governance itself. 0.87 Hiding policy errors would make the tool feel flaky rather than teachable.
Skeptical Reviewer Cache tag policy carefully: last-write invalidation is acceptable, but tests must prove edits are picked up. 0.82 Caching stale governance policy would be worse than the current slow-but-fresh behavior.
Synthesizer Patch resilience and performance now, defer persisted duplicate-id records/events and richer policy validation dashboards. 0.88 Validation must include both focused unit tests and the full suite before pushing.

Stage 3 implementation result: accepted and implemented. Current patch adds a shared duplicate-tolerant memory record lookup, applies it to memory metadata, diagnostics, and search snapshots, converts hybrid result lookup maps to non-throwing TryAdd behavior, caches TagPolicyService.GetPolicy() with last-write invalidation, emits tag.policy_duplicate_namespace instead of throwing on duplicate namespace policy entries, and scans existing DeprecationRecommended events once per consolidation run. Validation passed with focused tests at 62/62, dotnet build MemorySmith.slnx -v minimal, full tests at 231/231, and benchmark smoke returning results across lexical metadata diagnostics, semantic, hybrid, chat-context, and context-pack paths. Confidence after implementation: 0.93.

Open Questions