Tool A/B Executive Summary - Explicit Corpus v6 ChatML Launch (2026-06-01)

Run Status

A/B run launched:
command: Scripts/eval_tool_ab_spotcheck.py
output target: Data/Pages/research/training/tool-ab-spotcheck-20260601-explicit-corpus-v6-chatml.data.json
adapter: D:/temp/memorysmith-training/runs/20260601-142123/adapter
results: Data/Pages/research/training/tool-ab-spotcheck-20260601-explicit-corpus-v6-chatml.data.json
generatedAtUtc: 2026-06-01T21:08:58.345701+00:00
Current state: completed; final JSON artifact emitted and saved to repo working tree.

Because the explicit-corpus run is still in flight, this executive interpretation uses the latest completed comparable run:

source: Data/Pages/research/training/tool-ab-spotcheck-20260601-step01-v6-chatml-batched-rerun-t256.data.json
report: Data/Pages/research/training/tool-ab-spotcheck-20260601-step01-v6-chatml-batched-rerun-t256.md

Envelope validity improved versus base:
base: 0/60
tuned: 10/60 (16.7%)
Expected-tool match did not improve:
base: 0/60
tuned: 0/60
Interpretation:
The model got better at producing parseable tool-call envelopes.
The model is still not selecting canonical MemorySmith tool names required by the benchmark contract.

Contract compliance is now split across two dimensions: - Syntax/format compliance improved. - Semantic tool selection remains failing.
The largest quality blocker is canonical tool-name alignment, not JSON formatting.
Historical regressions are consistent with corpus contamination and alias drift: - training examples contained non-canonical aliases (search, open_page, fetch_task, etc.) that do not satisfy benchmark tool-name expectations.

Do not advance to Step 2 additive data experiments until tool-match passes gate.
Keep explicit corpus defaults (no implicit transcript or starter ingestion).
Prioritize corpus curation for canonical tool names before additional training volume.

Note: The tuned model improved envelope validity over base but did not produce canonical tool-name matches required by the benchmark gate.

Complete current A/B run and update this page with final numbers.
Build a canonical-only corpus slice for tool-call responses: - allowlist only benchmarked canonical tool names. - remove or remap alias tool names before training.
Re-run the same 60-case spotcheck and compare deltas on: - envelope validity - expected-tool match
Keep gate unchanged for progression: - envelope: 60/60 - tool-match: >= 42/60

UI launch path now supports explicit corpus configuration and no longer assumes transcript/starter corpus by default.