System Prompt Variants and Context Density
Executive Summary
The current tuning work shows a clear split between two problems:
- The model now usually emits valid tool-call JSON.
- The model still confuses tool boundaries, especially broad search vs exact lookup and search vs get/list behavior.
That means the next gain is not more formatting instruction. It is more precise routing guidance with better payload density.
I recommend splitting the chat system prompt into two variants:
- A regular variant for large cloud or local models with enough headroom to carry richer tool-routing guidance.
- A lite variant for small local models where every token should bias toward intent selection, tool boundaries, and response shape.
Why This Matters
The latest benchmark pattern is consistent:
- Envelope compliance is high.
- Tool-match is still the bottleneck.
- The model repeatedly collapses into
memorysmith_searchwhen the correct answer ismemorysmith_unified_search,memorysmith_hybrid_search,memorysmith_semantic_search, or a known-id*_getcall.
That is a prompt-density problem as much as a data problem. If the prompt spends tokens on repeated mission statements or broad prose, it leaves less space for the exact routing rules the small model actually needs.
Recommended Prompt Strategy
Regular Variant
Use this for larger cloud or local models that can handle a fuller instruction payload.
Keep:
- Mission and instruction priority.
- Trust boundary rules for retrieved data.
- Tool-selection heuristics.
- A short routing table that distinguishes:
- broad cross-surface discovery
- conceptually related recall
- exact-term lookup
- known-id retrieval
- page/task/code search and status checks
- A short reminder about concise answers and source citations.
Best fit:
- Cloud-hosted models
- Local models with more stable instruction following
- Higher-context runs where the prompt can afford more guardrails
Lite Variant
Use this for smaller local models where prompt payload density matters more than exhaustiveness.
Keep only the non-negotiables:
- Instruction priority.
- Untrusted-data boundary.
- Minimal tool-routing rules.
- One-line guidance for when to use each major tool family.
- Short answer style constraints.
Strip or compress:
- Repeated mission framing.
- Long prose about workflow.
- Redundant examples.
- Any instruction that does not directly help choose a tool or shape the output.
Best fit:
- Small local models
- Tight context windows
- Prompt-sensitive runs where the model benefits more from fewer, sharper tokens than from completeness
Concrete Routing Guidance To Keep
The lite variant should still preserve these distinctions:
memorysmith_unified_searchfor broad cross-surface questions.memorysmith_hybrid_searchfor balanced conceptual discovery.memorysmith_semantic_searchfor strong conceptual recall.memorysmith_searchfor exact terms, ids, tags, and literal source words.memorysmith_get/memorysmith_page_get/memorysmith_task_getfor known-id retrieval.memorysmith_page_search/memorysmith_task_list/memorysmith_code_search/memorysmith_code_search_statusfor discovery or operational status.
Next Training Step
The current augmentation batch should be followed by another contrastive pass that explicitly trains:
- broad discovery versus exact lookup on the same topic
- known-id fetch versus search
- page/task get versus page/task list
- code search versus code search status
If possible, add two prompt packs to the source tree:
wiki-chat-agent.regular.mdfor large-context deployments.wiki-chat-agent.lite.mdfor small local deployments.
Recommendation
Do not try to solve this by only making the single canonical prompt longer. Split the prompt by model class and optimize for payload density:
- regular = fuller guidance, broader model compatibility
- lite = smaller, sharper, fewer words, stronger routing signal
That is the most likely way to maximize effective context usage without bloating the small-model prompt.
V4 Deep-Dive (B-Only Rerun)
To avoid wasting time on an unchanged base model, the benchmark runner now supports reusing prior base rows while evaluating only the tuned adapter.
Run shape:
- Base source reused from v3 results.
- Tuned side evaluated on adapter
distilled-all-cat-20260530-routefix2. - Output:
tool-ab-spotcheck-20260530-v4.data.jsonandtool-ab-spotcheck-20260530-v4.md.
Headline deltas versus v3:
- Tuned tool match improved from
13/60to22/60(+9 absolute). - Tuned envelope validity decreased from
59/60to56/60(-3 absolute).
Interpretation:
- Routing improved materially, so the added contrastive samples are helping.
- Minor envelope regression suggests we should add a small envelope-stability pack while continuing routing work.
Per-tool result summary (tuned v4):
- Strong:
memorysmith_search5/5memorysmith_task_list5/5memorysmith_context_pack4/5memorysmith_task_get3/5- Partial:
memorysmith_code_search2/5memorysmith_semantic_search2/5memorysmith_code_search_status1/5- Still weak:
memorysmith_unified_search0/5memorysmith_hybrid_search0/5memorysmith_get0/5memorysmith_page_search0/5memorysmith_page_get0/5
Primary remaining failure mode:
- Search-family collapse into
memorysmith_searchstill dominates broad/conceptual prompts.
Secondary failure mode:
- Occasional invalid/unsupported tool names in completions (for example
memorysmith_pack_context) indicate schema drift from target catalog.
Frontier-Model Data Plan
If generation capacity is effectively unlimited, the best ROI is a targeted hard-negative and boundary-contrast pipeline, not generic bulk expansion.
Recommended data production plan:
- Build confusion-pair packs (high priority)
- For each weak tool, generate 100-300 pairs where prompts differ by one intent cue:
- broad wiki discovery vs exact lexical lookup
- conceptual recall vs exact term search
- known-id retrieval vs search/list
- code-status check vs code-content lookup
- Add schema-stability pack (medium priority)
- 150-300 examples that enforce exact tool-name vocabulary and one-object envelope shape.
- Include explicit negatives that demonstrate wrong names are invalid.
- Add ambiguity-resolution micro-dialogues (medium priority)
- 100-200 two-turn samples where first intent is ambiguous and second turn resolves to get/list/search.
- Keep a small replay set from high performers (low priority)
- Preserve
memorysmith_searchandmemorysmith_task_listperformance by replaying 10-20% stable positives each round.
Prompt Variant Rollout Proposal
To maximize context payload density while preserving quality, move to an explicit two-variant contract:
- Regular: full guardrails, richer explanations, cloud/large local models.
- Lite: compressed routing table + strict output contract, small local models.
Operationally:
- Keep shared invariant sections (priority, trust boundary, JSON contract).
- Swap only the routing/detail sections by model profile.
- Track A/B separately per variant, because a single merged prompt can hide small-model regressions.
Immediate Next Steps
- Produce a
weak-tools-v5dataset focused on:
unified,hybrid,get,page_search,page_get,code_search_status.
-
Run one epoch retrain with the same optimizer settings used in v4.
-
Execute B-only benchmark again (reuse v3 base) and gate on:
- tool-match >= 28/60
- envelope >= 56/60
- no new unsupported tool names.
- If the gate passes, publish v5 report and freeze that corpus slice as a baseline branch point.