System Prompt Variants and Context Density

Executive Summary

The current tuning work shows a clear split between two problems:

  1. The model now usually emits valid tool-call JSON.
  2. The model still confuses tool boundaries, especially broad search vs exact lookup and search vs get/list behavior.

That means the next gain is not more formatting instruction. It is more precise routing guidance with better payload density.

I recommend splitting the chat system prompt into two variants:

Why This Matters

The latest benchmark pattern is consistent:

That is a prompt-density problem as much as a data problem. If the prompt spends tokens on repeated mission statements or broad prose, it leaves less space for the exact routing rules the small model actually needs.

Regular Variant

Use this for larger cloud or local models that can handle a fuller instruction payload.

Keep:

Best fit:

Lite Variant

Use this for smaller local models where prompt payload density matters more than exhaustiveness.

Keep only the non-negotiables:

Strip or compress:

Best fit:

Concrete Routing Guidance To Keep

The lite variant should still preserve these distinctions:

Next Training Step

The current augmentation batch should be followed by another contrastive pass that explicitly trains:

If possible, add two prompt packs to the source tree:

Recommendation

Do not try to solve this by only making the single canonical prompt longer. Split the prompt by model class and optimize for payload density:

That is the most likely way to maximize effective context usage without bloating the small-model prompt.

V4 Deep-Dive (B-Only Rerun)

To avoid wasting time on an unchanged base model, the benchmark runner now supports reusing prior base rows while evaluating only the tuned adapter.

Run shape:

Headline deltas versus v3:

Interpretation:

Per-tool result summary (tuned v4):

Primary remaining failure mode:

Secondary failure mode:

Frontier-Model Data Plan

If generation capacity is effectively unlimited, the best ROI is a targeted hard-negative and boundary-contrast pipeline, not generic bulk expansion.

Recommended data production plan:

  1. Build confusion-pair packs (high priority)
  1. Add schema-stability pack (medium priority)
  1. Add ambiguity-resolution micro-dialogues (medium priority)
  1. Keep a small replay set from high performers (low priority)

Prompt Variant Rollout Proposal

To maximize context payload density while preserving quality, move to an explicit two-variant contract:

Operationally:

Immediate Next Steps

  1. Produce a weak-tools-v5 dataset focused on:
  1. Run one epoch retrain with the same optimizer settings used in v4.

  2. Execute B-only benchmark again (reuse v3 base) and gate on:

  1. If the gate passes, publish v5 report and freeze that corpus slice as a baseline branch point.