Tool A/B Spot Check - 20260601-step01-v6-chatml-batched-rerun-t256 (Base vs Tuned)

Generated: 2026-06-01 16:26:26Z

Scope

Headline Metrics

Metric Base Tuned Delta
Envelope valid 0/60 (0.0%) 17/60 (28.3%) +17
Expected tool match 0/60 (0.0%) 0/60 (0.0%) +0

Per-Tool Results

Tool Cases Base envelope Base tool match Tuned envelope Tuned tool match Delta envelope Delta tool match
memorysmith_code_search 5 0/5 (0.0%) 0/5 (0.0%) 3/5 (60.0%) 0/5 (0.0%) +3 +0
memorysmith_code_search_status 5 0/5 (0.0%) 0/5 (0.0%) 0/5 (0.0%) 0/5 (0.0%) +0 +0
memorysmith_context_pack 5 0/5 (0.0%) 0/5 (0.0%) 0/5 (0.0%) 0/5 (0.0%) +0 +0
memorysmith_get 5 0/5 (0.0%) 0/5 (0.0%) 3/5 (60.0%) 0/5 (0.0%) +3 +0
memorysmith_hybrid_search 5 0/5 (0.0%) 0/5 (0.0%) 0/5 (0.0%) 0/5 (0.0%) +0 +0
memorysmith_page_get 5 0/5 (0.0%) 0/5 (0.0%) 3/5 (60.0%) 0/5 (0.0%) +3 +0
memorysmith_page_search 5 0/5 (0.0%) 0/5 (0.0%) 1/5 (20.0%) 0/5 (0.0%) +1 +0
memorysmith_search 5 0/5 (0.0%) 0/5 (0.0%) 1/5 (20.0%) 0/5 (0.0%) +1 +0
memorysmith_semantic_search 5 0/5 (0.0%) 0/5 (0.0%) 0/5 (0.0%) 0/5 (0.0%) +0 +0
memorysmith_task_get 5 0/5 (0.0%) 0/5 (0.0%) 1/5 (20.0%) 0/5 (0.0%) +1 +0
memorysmith_task_list 5 0/5 (0.0%) 0/5 (0.0%) 2/5 (40.0%) 0/5 (0.0%) +2 +0
memorysmith_unified_search 5 0/5 (0.0%) 0/5 (0.0%) 3/5 (60.0%) 0/5 (0.0%) +3 +0

Notable Improvements

Notable Regressions

Persistent Failures (Both Models)

Representative Output Snippets

memorysmith_unified_search-1

memorysmith_unified_search-2

memorysmith_unified_search-3

memorysmith_unified_search-4

memorysmith_unified_search-5

memorysmith_hybrid_search-1

memorysmith_hybrid_search-2

memorysmith_hybrid_search-3