Council Review: MemorySmith Application Next Steps

Status: Begun after PR #13 merge on 2026-05-21
Scope: Application roadmap, search/retrieval quality, governance UX, chat/agent behavior, schema gates
Decision level: high impact; implementation is gated by acceptance criteria below

Decision

After PR #13, MemorySmith should prioritize measurement-backed governance UX and retrieval safety before changing ranking formulas, persisted memory schema, page chunking, or Agent write approval semantics.

Evidence Reviewed

Round 1: Independent Seat Findings

Seat Recommendation Confidence Blocking concern
Source-Grounded Archivist Treat the merged governance diagnostics as the new baseline, then reconcile older schema/vector/chat plans against the newer convention-first roadmap before implementing more architecture. 0.88 Several docs are intentionally historical or aspirational; implementing them literally would contradict newer gates.
Data Model Architect Defer persisted schema fields until policy diagnostics, UI assistance, and usage metrics prove tags cannot carry the concept safely. 0.84 Schema promotion without migration tests and legacy-read compatibility would create avoidable churn in the live wiki fixture.
Retrieval Specialist Build a measured search quality baseline next: curated memory/page queries, MRR or Recall@5, stale-warning rates, and ONNX-vs-fallback metadata. 0.90 "Best possible search" cannot be judged from local code shape alone; it needs repeatable probes over the project wiki.
Human Learning Advocate Make governance visible in the workbench before adding more hidden rules: tag chips, policy editing, diagnostics before save, and clear warning surfaces in chat references. 0.86 Warning-first governance fails if users only discover warnings after a tool or agent has already used bad context.
Skeptical Reviewer Resist adding vector indexes, page chunking, native provider tools, or strict Agent approval rules in one large branch. Spike and measure the riskier surfaces separately. 0.82 The product is now broad enough that a sweeping "AI suite" branch could break trust through token bloat, stale citations, or UX friction.
Synthesizer Start with measurement plus governance UX, then complete retrieval-warning propagation, then harden chat/tool behavior, then advance Agent write governance and schema/page promotions only through council gates. 0.87 Each phase needs a narrow validation gate and rollback note before implementation starts.

Round 2: Adversarial Review And Dissent

Retrieval Versus Data Model

The Retrieval Specialist wants faster progress on page chunking and richer vector search because page content and long-form docs are clearly part of the knowledge base. The Data Model Architect dissents: page frontmatter, heading chunks, and embeddings introduce durable data contracts. The synthesis is to measure page length, heading density, page-query misses, and chat truncation first. Chunking becomes justified only when the corpus or query probes prove whole-page search is insufficient.

Chat Native Tools Versus Simplicity

The Chat Capability plan makes a strong case that prompt-mediated JSON tool calls are brittle. The Skeptical Reviewer agrees, but rejects a full provider-tool rewrite as the immediate next branch. The synthesis is a provider-capability spike with tests: define the shared registry and capability metadata, prove one provider path or document why it cannot yet be made native, and keep the app-intercept fallback.

Agent Governance Versus Human Workbench

Agent write governance is strategically important, but the Human Learning Advocate argues it should not outrun the UI that teaches humans the same policy. The synthesis is to implement policy visibility first: if tag/source/relation/staleness diagnostics are not understandable in /memories, they will not be trustworthy inside Agent approval flows.

Schema Ambition Versus Current Plan

MemorySystemSchemaImprovements_20260519.md remains valuable as a catalog of eventual schema candidates: typed relations, constraints, priority, validity. It is not the controlling plan. The controlling path is convention-first with diagnostics and promotion gates from temp-plan.md and ai-memory-suite-implementation-plan.md.

Phase 0: Close The PR 13 Baseline

Status: complete.

Phase 1: Measurement Baseline Branch

Goal: make future search, governance, and page decisions measurable.

Recommended branch: feature/search-governance-measurement-baseline.

Implementation path: add a read-only MeasurementBaselineService and expose it from the admin diagnostics surface at /api/diagnostics/measurement-baseline. This keeps Phase 1 out of ranking, schema, page chunking, and Agent write behavior while still giving future branches concrete data.

Falsifiable implementation hypothesis: the existing search, page, diagnostics, tag policy, source-link, and semantic-provider services already expose enough information to compute a useful baseline without mutating Data/Memories or Data/Pages.

Cheapest discriminating validation: focused NUnit tests should run the baseline against copied project wiki fixtures and prove the live wiki snapshots are unchanged after measurement.

Tasks:

Acceptance gates:

Phase 1 baseline thresholds now carried by the measurement contract:

Threshold Value Use
Search promotion minimum MRR 0.75 Minimum mean reciprocal rank before search behavior is treated as healthy enough for promotion decisions.
Search promotion minimum Recall@5 0.80 Minimum top-five recall before ranking/chunking changes are considered successful.
Maximum diagnostic warning rate 0.20 Search/context output warning rate above this needs review before expanding warning surfaces.
Maximum broken source-link rate 0.05 Source-link health above this failure rate blocks stronger source-backed Agent/write behavior.
Page chunking token threshold 1500 estimated tokens Pages above this size count toward the chunking pressure metric.
Page chunking long-page ratio threshold 0.20 If more than 20% of pages exceed the token threshold, page chunking gets a fresh council review.

Phase 2: Governance Workbench And Tag Manager

Goal: make warning-first governance usable by humans.

Recommended branch: feature/governance-workbench-tag-manager.

Tasks:

Acceptance gates:

Implementation result, 2026-05-22:

Phase 3: Retrieval Warning Propagation

Goal: carry diagnostics and provenance through every retrieval surface without changing ranking.

Tasks:

Acceptance gates:

Implementation result, 2026-05-22:

Phase 4: Chat Context Planner And Native Tool Spike

Goal: reduce context bloat and make tool use more reliable.

Tasks:

Acceptance gates:

Implementation result, 2026-05-22:

Phase 5: Agent Write Governance

Goal: make Agent proposals auditable before expanding write power.

Tasks:

Acceptance gates:

Phase 6: Page Metadata, Chunking, And Embeddings

Goal: make long-form pages first-class retrieval sources only when the corpus justifies it.

Tasks:

Acceptance gates:

Phase 7: Schema Promotion

Goal: promote only proven conventions.

Candidate order:

  1. ReviewAfter and ValidUntil date fields.
  2. Kind enum.
  3. Priority enum.
  4. Typed Relations.

Acceptance gates:

Round 4: Implementation Order Recommendation

The next concrete branch should be Phase 1, not Phase 2, because measurement gives the project a way to decide whether later UI, search, and schema work actually improves the system.

Recommended immediate work package:

  1. Add search/governance/page/source-link metric services or test-only probes.
  2. Add a small report writer under diagnostics or benchmark smoke output.
  3. Add wiki documentation explaining current baseline thresholds.
  4. Run the existing full suite and benchmark smoke.

Only after that should the Tag Manager UI branch begin. This keeps the next user-facing surface grounded in known tag drift and source-link health rather than imagined examples.

Phase 1 Implementation Result

Status: implemented on feature/search-governance-measurement-baseline.

Implemented now:

Still intentionally deferred:

Validation plan for PR readiness:

Validation result:

Risks And Mitigations

Risk Mitigation
Measurement becomes another dashboard with no decisions. Define thresholds and use them as gates for chunking, schema, and ranking changes.
Tag Manager UI adds friction. Start with observe/warn modes and fast keyboard editing; do not block ordinary edits until policy quality is proven.
Diagnostics create token bloat. Keep warning/error caps and structured summaries; benchmark context-pack and chat prompts.
Native tool-call work becomes provider-specific sprawl. Add capability metadata and keep fallback intercepts; isolate provider-specific adapters.
Agent governance becomes too strict for local use. Use risk tiers: low-risk Working proposals stay fast; Core/rule/source-link changes get stronger review.
Schema promotion happens prematurely. Require usage thresholds, tests, migration, rollback, and council review.

Open Questions

Confidence

Overall confidence: 87%.

The roadmap is strongly supported by the latest project wiki and the just-merged PR #13 evidence. Confidence is not higher because several next phases need measurements that do not exist yet: curated search quality baselines, page miss rates, tag drift trends, and Agent proposal quality metrics.