Tool A/B Executive Summary - Explicit Corpus v6 ChatML Launch (2026-06-01)

Run Status

Interim Baseline For Interpretation

Because the explicit-corpus run is still in flight, this executive interpretation uses the latest completed comparable run:

Headline Findings (Latest Comparable Completed Run)

Executive Interpretation

  1. Contract compliance is now split across two dimensions: - Syntax/format compliance improved. - Semantic tool selection remains failing.
  2. The largest quality blocker is canonical tool-name alignment, not JSON formatting.
  3. Historical regressions are consistent with corpus contamination and alias drift: - training examples contained non-canonical aliases (search, open_page, fetch_task, etc.) that do not satisfy benchmark tool-name expectations.

Decision Guidance

Final Numbers (this run)

Note: The tuned model improved envelope validity over base but did not produce canonical tool-name matches required by the benchmark gate.

  1. Complete current A/B run and update this page with final numbers.
  2. Build a canonical-only corpus slice for tool-call responses: - allowlist only benchmarked canonical tool names. - remove or remap alias tool names before training.
  3. Re-run the same 60-case spotcheck and compare deltas on: - envelope validity - expected-tool match
  4. Keep gate unchanged for progression: - envelope: 60/60 - tool-match: >= 42/60

Notes