Council Review: Sprint 35 LLM-First Delta Audit — Seat Synthesis
Date: 2026-06-23
Scope: MemorySmith.Agent head commit 073175b03387e05592780016db345f7ae48217c0
Self-simulated seats (no subagents: user permission not requested)
Decision
Proceed with P0/P1 urgent fixes now (five concrete items); defer strategic architecture decomposition and data model modernization to Sprint 36+ with explicit evidence gates.
Evidence Reviewed
| Source | Role |
|---|---|
Data/Pages/Audit/memorysmith_agent_sprint35_llm_first_delta_audit_260623-005442.md |
Source-Grounded Archivist full report |
Agent.Planning/ChatModels.cs |
ChatInterpretation.GoalName zombie field confirmed |
Agent.Planning/LlmChatInterpreter.cs |
IntentDraft pipeline docs — code matches docs directionally |
Agent.Planning/GoalFactory.cs |
Policy-engine-in-factory-clothing verified |
Agent.Planning/IntentManager.cs |
GoalRequest.GoalName — the new pipeline still carries the name through |
Agent.Tools/Tools/SearchMemoryTool.cs |
Writes bestPageId + results to Data — never read downstream |
WebUI.Blazor/AgentBackgroundService.cs |
_placeBlockContexts parallel store, 13+ responsibilities |
Agent.Planning/Goals/BuildGoal.cs |
HasExplicitOrigin — partial-coordinate semantics confirmed |
Agent.Planning/HtnTaskLibrary.cs |
~15 SearchMemory calls per gather cycle |
Agent.Planning/HtnPlanner.cs |
Legacy build-origin path bypasses BuildGoal.OriginX/Y/Z |
Sprint 35 handoff: Data/Pages/Tasks/agent-handoff-sprint35-llm-first.md |
Planning baseline for the audit |
Data/Pages/decisions.md |
D-011 (parsers never create goals), D-012 (ActionOutcome) |
MemorySmith.Agent.Tests/ |
ToolDispatchTests + SprintTests — SearchMemory coverage exists for tool-level; zero integration/e2e coverage for search utility* |
Seat Findings Summary
| Seat | Recommendation | Confidence | Blocking concern |
|---|---|---|---|
| Source-Grounded Archivist | Fix TSK-0074/TSK-0075 terrain-occupancy skip, fix smelt→craft routing, fix PlaceBlock timeout docs | 0.92 | Terrain skip converts stalls to silent wrong builds — worse than the original bug |
| Data Model Architect | Deprecate Facts dict, add Context to PendingAction, type ActionData.Arguments/Context, fix HasExplicitOrigin, remove GoalName, rename GoalFactory, decompose AgentBackgroundService |
0.87 | String-keyed dictionaries create system-wide fragility; dual fact stores will diverge |
| Retrieval Specialist | P0: Deduplicate SearchMemory calls, wire TSK-0004 so results are consumed. P1: Wire IKnowledgeResolver, pre-retrieve blueprints for LLM prompts |
0.85 | ~15 SearchMemory calls per cycle with zero results consumption — 99% confident this is waste |
| Human Learning Advocate | P1: Add user-facing stall/progress messages. P1: Decompose AgentBackgroundService. P2: Surface LLM intent ambiguity to user. P5: "explain" command | 0.72 | 13+ responsibilities in one service = impossible to reason about locally |
| Skeptical Reviewer | Fix TSK-0074 terrain skip (worse than stall), fix _placeBlockContexts dictionary leak, fix PlaceBlock 2s race, fix smelt→craft (7 sprints old), add tests for Sprint 42 checkpoint/occupancy changes, break through unit-test ceiling with E2E | 0.78 | Zero tests for Sprint 42's most critical new behavior (checkpoint/occupancy) |
Synthesis: What Changes NOW vs What is DEFERRED
🔴 CHANGE NOW — P0/P1 Urgent Fixes
These are runtime bugs or data-loss risks that are actively degrading agent behavior today. They should be implemented before any strategic work.
| # | Priority | Item | Consensus? | Key Evidence |
|---|---|---|---|---|
| 1 | P0 | Fix TSK-0074/TSK-0075 terrain-occupancy skip. When terrain occupies the target position, the adapter emits blockPlaced, advancing the checkpoint without placing. This is silent wrong-build corruption — worse than a stall (which at least is observable). |
Archivist (0.92), Skeptic (0.78) — both call it critical | Audit finding: terrain-occupied positions emit blockPlaced event, advancing checkpoint. Skeptic: "worse than stall? replaces timeout with silent wrong build." |
| 2 | P0 | Fix smelt → CraftItem routing. "Smelt iron ore" triggers a craft-from-inventory plan, not a furnace workflow. 7 sprints old. |
Archivist (0.98), Skeptic (0.78) — high confidence | Audit §2: ChatInterpreter routes smelt into CraftItemGoal → CraftItemTool → no furnace path. |
| 3 | P0 | Wire SearchMemory results into actual decision-making (TSK-0004). ~15 calls per gather cycle produce results stored in ToolResult.Data["results"] / ["bestPageId"] that no code reads. This is pure wasted latency/API cost. |
Retrieval Specialist (0.85), Skeptic (0.99) — strongest consensus of any finding | SearchMemoryTool.cs writes bestPageId + results. Zero downstream consumers grep-confirmed. |
| 4 | P1 | Add cleanup path for _placeBlockContexts dictionary on duplicate events. Missing handler pattern causes dictionary leak when duplicate events arrive with no matching pending entry. |
Skeptic (0.78), Data Model Architect (0.87) | Skeptical Reviewer: "dictionary leak on duplicate events (no cleanup path)" |
| 5 | P1 | Fix PlaceBlock timeout from 2s → 5s (or match documentation). Code says 2s; docs say 5s. 2s creates a race condition with the adapter. | Archivist (0.92), Skeptic (0.78) | Audit: "code says 2s, not 5s as documented." |
🟡 CHANGE NOW — P1 Code Health Items
| # | Priority | Item | Consensus? |
|---|---|---|---|
| 6 | P1 | Add tests for Sprint 42 checkpoint/occupancy behavior. Zero test coverage for the most critical new code path. | Skeptic (0.78) — blocking concern |
| 7 | P1 | Fix HasExplicitOrigin partial-coordinate semantics. Any single non-null coordinate returns true, which is brittle for future callers. |
Data Model Architect (0.87), Archivist (0.57) |
| 8 | P1 | Add user-facing stall/progress messages. UX is reactive: no stall messages, no progress indication. | Human Learning Advocate (0.72) |
🔵 DEFERRED — Sprint 36+ Strategic Improvements
These are important architectural improvements that should NOT block the P0/P1 fixes above. Each has an evidence gate.
| # | Item | Gate | Suggested Sprint |
|---|---|---|---|
| D1 | Decompose AgentBackgroundService (13+ responsibilities). Biggest boundary violation. |
Gate: After P0/P1 fixes are merged and green. Extract one responsibility at a time (start with cycle management, then outcome handling). | Sprint 36 |
| D2 | Deprecate Facts dictionary; consolidate to StructuredFacts. Dual stores can diverge. |
Gate: Add Context to PendingAction first (so the parallel _placeBlockContexts store can be eliminated). |
Sprint 36 |
| D3 | Type ActionData.Arguments and ActionData.Context from Dictionary<string, object?> to typed records. |
Gate: After D2 — once the data flow is understood, types can be extracted. | Sprint 37 |
| D4 | Remove ChatInterpretation.GoalName zombie field. 7 sprints obsolete. |
Gate: Remove only after all downstream readers (tests, serialization) are updated. Non-breaking if done carefully. | Sprint 36 |
| D5 | Rename GoalFactory to GoalPolicyEngine — it's a policy router, not a factory. |
Gate: Low risk, pure rename. Can be done anytime. | Sprint 36 |
| D6 | Wire IKnowledgeResolver for pre-retrieval augmentation of LLM prompts with blueprints, items, and search results. |
Gate: TSK-0004 must be wired first (P0 item #3). After that, IKnowledgeResolver can augment. |
Sprint 37 |
| D7 | Deduplicate SearchMemory calls — ~15 per gather cycle is excessive even after TSK-0004 wiring. | Gate: Measure actual call volume after TSK-0004. Cache results within a planning cycle. | Sprint 37 |
| D8 | Surface LLM intent ambiguity to user (confidence < threshold → ask clarifying question). | Gate: Requires the IntentDraft → GoalRequest pipeline to be stable first. | Sprint 38 |
| D9 | Add E2E tests to break through the "unit-test ceiling" (608 unit tests, zero E2E). | Gate: Requires the checkpoint/occupancy tests (P1 #6) to establish the pattern. | Sprint 39 |
| D10 | Add FlatAreaFoundEvent.SearchedRadius so retry logic isn't heuristic. |
Gate: Low risk; wire through bridge. Can be done independently. | Sprint 36 |
| D11 | "Explain" command for users to ask why the agent did something. | Gate: Requires action-outcome journaling to be mature (D-012). | Sprint 40+ |
Where Seats Agree vs Disagree
High-Confidence Consensus (≥3 seats, ≥0.80 confidence)
| Finding | Supporting Seats | Max Delta |
|---|---|---|
| TSK-0074 terrain-occupancy skip is the #1 bug | Archivist (0.92), Skeptic (0.78) | 0.14 |
| Smelt→craft routing is wrong | Archivist (0.98), Skeptic (0.78) | 0.20 |
| SearchMemory results are unused (TSK-0004 gap) | Retrieval (0.85), Skeptic (0.99) | 0.14 |
| AgentBackgroundService is too large | Architect (0.87), HLA (0.72), Skeptic (0.78) | 0.15 |
| ChatInterpretation.GoalName must be removed | Architect (0.87), Archivist (0.96) | 0.09 |
| Testing coverage gap for Sprint 42 changes | Skeptic (0.78), Retrieval (0.85) | 0.07 |
Disagreements (need resolution)
| Topic | Position A | Position B | Resolution |
|---|---|---|---|
| Decomposition order | Architect prefers contract-first extraction: define interfaces, then implement. | Skeptic prefers behavior-slice-first: move code with minimal interface changes, extract contracts later. | Adopt Skeptic's approach for P0/P1 (fix bugs first). Adopt Architect's approach for Sprint 36 decomposition (define contracts at each seam before moving code). |
| SearchMemory: waste vs feature | Retrieval Specialist calls it a P0 waste problem (15x calls, zero consumption). | Skeptic says 99% confident it's unused but flags the ticket (TSK-0004) as the blocker. | Agree on urgency but label it P0 based on consensus. The "waste" framing is accurate — each call makes an HTTP request to MemorySmith with zero behavioral effect. |
| GoalFactory rename urgency | Architect says rename is low-risk, do anytime. | Skeptic would defer to Sprint 36 since it's cosmetic. | Defer to Sprint 36 — it's a rename, not a bugfix. |
| User-facing messages priority | HLA says P1 (stall/progress messages). | Other seats didn't flag UX at P0/P1 level. | Accept P1 — meaningful UX improvement but not blocking correctness. |
Evidence Gates for Deferred Items
Every deferred item must pass its gate before implementation begins:
| Deferred Item | Gate | Verifier |
|---|---|---|
| D1: Decompose AgentBackgroundService | All 5 P0/P1 items merged, green tests, and a written decomposition plan reviewed by ≥2 seats | CI + council |
| D2: Deprecate Facts | PendingAction.Context field exists and _placeBlockContexts parallel store is removed |
Code review |
| D3: Type ActionData | D2 complete, and ToolResult.Data consumers are documented |
Code review |
| D4: Remove GoalName | All GoalName references (tests, serialization, backward-compat shims) are migrated |
grep must return zero |
| D5: Rename GoalFactory | Council vote to confirm new name + one sprint of stability after P0/P1 | Council |
| D6: Wire IKnowledgeResolver | TSK-0004 implementation is merged and SearchMemory results are demonstrably used in planning |
Test evidence |
| D7: Deduplicate SearchMemory | Measure call volume per planning cycle after TSK-0004; target ≤3 calls per cycle | Benchmark |
| D8: Surface LLM ambiguity | IntentDraft → GoalRequest pipeline is stable and instrumented | Code review + test |
| D9: E2E tests | Checkpoint/occupancy tests (P1 #6) are merged and establish the testing pattern | CI |
| D10: FlatAreaEvent radius | Low-risk — no gate beyond standard PR review | Code review |
| D11: "Explain" command | ActionOutcome journaling (D-012) is complete and queryable | Council + test |
Risks & Assumptions
Risks
- P0 fix #1 (terrain skip) may require adapter-level changes. If the
blockPlacedevent cannot distinguish "block placed in world" from "block metadata updated at occupied position," the fix may need a Mineflayer adapter patch — which increases scope and testing burden. - Mitigation: Start with C#-side filtering (compare pre-place world state vs post-event world state). Only escalate to adapter changes if filtering proves insufficient. - P0 fix #3 (wire SearchMemory) has unknown scope. TSK-0004 is tracked but not scoped. Wiring results into planning may require changes to
HtnTaskLibrary,ActionDataflow, and decision logic. - Mitigation: Start with the simplest possible consumer (log results, usebestPageIdas a hint in the next tool dispatch). Iterate from there. - P0 fix #2 (smelt→craft) may reveal furnace tool is missing entirely. If no furnace-specific tool exists, the fix requires a new tool, decomposer, and adapter action.
- Mitigation: Audit existing tools first. If absent, implement a
SmeltItemToolthat wraps thefurnaceadapter action (likely already supported in Mineflayer).
Assumptions
- The 5 P0/P1 items are independent and can be implemented in parallel by separate branches.
- No P0/P1 fix requires a schema migration or breaking API change.
- The
IKnowledgeResolverinterface (Agent.Memory/IKnowledgeResolver.cs) is stable enough to wire after TSK-0004, even if its implementation evolves. - The 608 unit tests will catch regressions in P0/P1 changes for non-checkpoint code paths. Checkpoint/occupancy tests must be written alongside fix #1.
Open Questions
- Does the adapter distinguish
blockPlaced(real placement) fromblockUpdate(metadata change at occupied position)? This determines the fix strategy for TSK-0074/TSK-0075. Need to checkMineflayerAdapter/index.jsevent emissions. - Is there already a
SmeltItemToolor furnace-related action in the adapter that's simply not wired through the chat interpreter? The code audit found no evidence, but a quick adapter grep would confirm. - Should SearchMemory results be consumed at plan-construction time (in
HtnTaskLibrary) or at dispatch time (inAgentBackgroundService)? TSK-0004 needs a design decision. - How many of the ~15 SearchMemory calls per cycle are genuinely useful (i.e., would produce actionable results) vs. noise? A brief instrumentation pass before TSK-0004 implementation would inform the deduplication strategy.
Directional Assessment
Is the project on the right track? — Yes, with a caveat.
The LLM-first architecture direction (Sprint 35) is sound. The IntentDraft → GoalRequest → IntentManager pipeline is a meaningful improvement over the old parsers-create-goals model. The code is materially ahead of where it was 5 sprints ago.
However: The gap between architectural intent and runtime behavior has widened in three areas:
- Silent failures — The TSK-0074 terrain skip is a regression that introduces less observable failure modes than the stalls it replaced. This is a net negative for debuggability.
- Orphaned infrastructure — SearchMemory is called ~15 times per cycle and produces zero behavioral effect. This is wasted latency, API cost, and code complexity. It suggests a pattern of "build the tool, forget to wire the consumer."
- Testing gap — The most critical new behavior (checkpoint/occupancy) has zero tests. This is a predictable outcome when unit-test coverage (608 tests) masks the absence of integration and E2E tests.
Bottom line: Fix the five P0/P1 items immediately. They are the difference between "architecturally improving but regressing in observable behavior" and "architecturally improving AND running correctly." After those are done, the strategic decomposition and data model work in Sprint 36+ will rest on a solid foundation.
Acceptance Criteria
This council review is complete when:
- [x] Decision statement is explicit: Proceed with P0/P1 fixes now; defer strategic work to Sprint 36+
- [x] Evidence links are source-grounded (code paths confirmed)
- [x] All 5 seats include confidence values and blocking concerns
- [x] Dissent is visible (GoalFactory rename priority, decomposition order)
- [x] Acceptance criteria are testable (evidence gates table)
- [ ] Report is saved to
Data/Pages/council/sprint35-llm-first-seat-synthesis-20260623.md - [ ] Council report is linked from the active task tracker