Two Deep Dive Research Reports: Review 1

Review 2:

Chat Harness Enhancements: Context, Compaction, Memory, and Prompting

We recommend treating context management as the agent’s working memory【24†L152-L160】. In practice, the harness should allocate tokens to distinct “regions” – e.g. the system prompt, recent chat history, retrieved memory items, tool outputs, and a reserve for the upcoming reply. Each turn should log a token budget plan (slices and reasons) and any truncation event (e.g. why older content was pruned)【52†L93-L95】. This ensures traceability of what the model “sees.” For example, one implementation uses a lightweight token counter (via sampling) to track usage without costly full tokenization【52†L93-L95】. By instrumenting these budgets in the trace, we can diagnose when and why context overflow occurs. In short, context engineering = memory engineering【24†L152-L160】, so we must explicitly manage what stays and what goes in each prompt.

Context Compaction (Summarization) Strategy

When the conversation grows too large, we fold older parts into a summary. We suggest initially making compaction manual (an explicit /compact command) with strict lineage links to original turns, then gradually add guarded automation. Key points from related systems:

Design Safeguards

Memory and Retrieval (Per-Chat Memory Ledger)

Beyond the immediate context, the harness should support a separate memory store for durable facts or notes that persist across turns or sessions. Key ideas:

Prompting and Guardrails

To guide the model’s behavior around compaction and memory, we will update system prompts and instructions:

Agent Tools and Diagnostics

To support these features, we’ll augment the agent toolkit:

Phased Implementation Plan

Based on our synthesis of research and current architecture, we propose a phased rollout:

  1. Phase 1 (Convention-First):
    - Manual Tools & Tracing Only: Implement /compact, budget tracing, and memory ledger as local storage, all behind explicit user actions. Do not alter existing default behavior.
    - Prompt Updates: Update system prompt to include the compaction and uncertainty clauses.
    - Testing: Write unit tests (e.g. in PagesAndChatTests.cs and ChatToolCatalogAndInterceptTests.cs) covering manual compaction, disabling features, and verifying that /compact summarization actually creates a new message with correct lineage. Add retrieval tests ensuring answers from past chat still resolve correctly after compaction. - User Feedback/Early Use: Expose these features to power users or on dev environments to gather real-case feedback on wording and usability.

  2. Phase 2 (Guarded Auto-Compaction):
    - Threshold Checking: Enable the auto-trigger at configurable thresholds (start at ~85%). Ensure it’s disabled by default in production.
    - Instrumentation: Collect metrics on how often auto-compaction fires, and how it affects retrieval-of-information and answer quality.
    - Roll-Back Switch: Provide an admin flag to completely disable compaction features if needed.

  3. Phase 3 (Optional Schema Upgrade):
    - If usage stabilizes, consider adding database schema fields for “CompactionSummary” and “MemoryNote” entries, migrating the in-memory/session storage to a persistent model. But only do this after confirming usage patterns (so we don’t hard-code a wrong structure).

Tests, Metrics, and Rollout Gates

Before fully enabling auto-compaction, we will enforce clear acceptance criteria:

Challenges and Open Questions

Summary

In sum, we propose a staged approach: first implement explicit compaction and memory features with full traceability, then carefully enable automation once we verify quality. All compaction should be reversible/auditable and include uncertainty-handling clauses to avoid silent errors【16†L126-L133】【42†L103-L110】. With thorough testing and user controls (disable flags, notifications), we can significantly extend chat sessions’ effective scope while keeping the system reliable.

Sources: We draw on state-of-the-art practices in context management and compaction (e.g. Claude, Forge, OpenAI Codex)【19†L318-L324】【52†L105-L114】, memory system design【24†L152-L160】【24†L185-L193】, and prompt engineering to reduce hallucination【42†L111-L118】. These inform our design for MemorySmith’s chat harness upgrades.