Training Corpus Review - 2026-05-30

Summary

I reviewed the active distilled corpus and recent A/B evaluation results between base and tuned models.

Key conclusion: the corpus has improved tool-call envelope compliance, but still under-trains tool routing precision. In the latest continuation benchmark, tuned model envelope validity reached 85%, yet expected tool match remained 21.7%, with over-selection of memorysmith_search.

Executive Summary

The current corpus is good at getting the model to emit valid tool-call JSON, but it is still not good enough at choosing the right tool. The model keeps collapsing ambiguous routing prompts into memorysmith_search, especially when the correct answer should be memorysmith_unified_search, memorysmith_hybrid_search, memorysmith_semantic_search, or a known-id *_get call.

I added a focused tool-selection augmentation shard to address that gap, but the latest benchmark still shows routing as the primary remaining problem. The practical next step is more contrastive examples that separate broad discovery from exact lookups and from known-id retrieval.

Evidence

Observed pattern in v2:

Improvements Applied

1) Added targeted tool-selection augmentation shard

New file:

What it adds:

2) System-prompt specialization for routing

The augmentation shard uses a stronger tool-selection system prompt that explicitly defines routing boundaries between similar tools. This is intended to reduce:

3) Added disambiguation mini-set

The shard includes mini contrast examples for:

4) Continued refinement after benchmark review

The shard now also includes explicit contrast cases for:

Additional Recommendations (Next Batch)

  1. Add negative pair training for tool confusion cases: - same user intent phrased two ways, one mapping to specialized tool and one to broad search.
  2. Increase multi-turn “repair” examples where first tool call is wrong and corrected on follow-up.
  3. Add schema-level lint before export: - enforce one-tool-call object only when tool-use is required. - reject outputs containing prefaces like <think> or prose before JSON.
  4. Add weighted sampling by underperforming tools (hybrid, semantic, *_get, code_search_status).
  5. Introduce a routing score gate in pre-train validation to block corpus updates that regress expected-tool match.

Confidence