Training Data Distillation Request
Type: Agent Task Request
Date: 2026-05-29
Status: Open
Audience: AI agent (Athena or equivalent) with access to MemorySmith MCP tools
Supplemental Delivery Note (2026-05-30)
Hyperagent delivered 55 validated JSONL examples across all 9 target categories with 29 unique real memory/page IDs from the live knowledge base. Tool call envelope verification passed for the required shape:
{"toolCalls":[{"name":"...","arguments":{...}}]}
Category coverage summary:
| Category | Count | Notes |
|---|---|---|
| Single-tool retrieval | 15 | unified(3), hybrid(2), semantic(2), lexical(2), page_search(2), code_search(4) |
| Context pack + multi-reference | 8 | context_pack with backlinks, grounded multi-source answers |
| Direct get by ID/slug | 5 | memorysmith_get(3), page_get(2) |
| Task browsing | 5 | task_list(4), task_get(1) with filters |
| Code search | 5 | realistic codebase questions |
| Multi-turn | 6 | 4+ messages each |
| Agent-mode writes | 4 | strict JSON with reply/memoryWrites/pageWrites |
| Graceful failure | 4 | acknowledges gaps, suggests resources |
| Citation-focused | 3 | explicit - Source: memory: footers |
Validation claim from Hyperagent: 55/55 valid JSON, 0 confabulated IDs, correct tool-call envelope on all 53 tool-call examples, and all memory: / page: references came from verified inventory.
Important training-quality note: Hyperagent reports it initially produced an incorrect tool envelope ({"tool":"..."}) before correction, matching audit concern TRAIN-001. The delivered dataset now uses the exact envelope expected by the C# ReadToolCalls parser.
Context — Why This Exists
MemorySmith fine-tunes a local language model — currently Qwen/Qwen3.5-4B — using parameter-efficient LoRA adapters (rank 8, alpha 16, target modules: q_proj k_proj v_proj o_proj). The goal is to teach the model to behave like Athena: a grounded wiki assistant that queries the MemorySmith knowledge base before answering, cites sources, requests tool calls in a precise JSON protocol, and distinguishes evidence from inference.
Training is driven by MemorySmith.Training/harness.py via the Training Workbench at /training-workbench. The harness reads .sft.jsonl export files from Data/Training/exports/, converts them to causal language model training text using to_training_text(), and runs the PEFT LoRA training loop directly on the GPU (RTX 5060, CUDA).
Currently there are ~30 synthetic export files in Data/Training/exports/, most of which cover narrow tool-call scenarios produced during sprint development. The dataset is thin on:
- Multi-turn conversations with real retrieval depth
- Accurate source citation patterns (
memory:<id>,page:<slug>) - Agent-mode write examples (
memoryWrites,pageWrites, task mutation tools) - Graceful refusal/clarification when context is insufficient
- Code search integration (questions about the codebase resolved via
memorysmith_code_search)
This request asks an agent to distill new, high-quality SFT examples by querying the live MemorySmith knowledge base and producing realistic Athena-style conversations that would help the model generalise those behaviours.
Training Data Format
Each output file must be a line-delimited JSON file with extension .sft.jsonl, placed under Data/Training/exports/.
Each line is one independent training example in the FilteredSft / ChatML format:
{"messages": [
{"role": "system", "content": "<system prompt>"},
{"role": "user", "content": "<user turn>"},
{"role": "assistant", "content": "<ideal response>"},
// optional additional turns for multi-turn examples
]}
System Prompt Variants
Use realistic system prompt variants. The canonical form is:
You are Athena, MemorySmith's local wiki assistant. Use the supplied memories, pages, and attachments as local context. Distinguish clearly between evidence from the knowledge base and your own inference.
For citation-emphasis examples add:
You are Athena. Cite sources as memory:<id> or page:<slug>.
For agent-mode write examples add:
You are Athena in Agent mode with auto_accept approval. Mutation tools are available.
Tool-Call Response Shape
When the ideal response is a tool call, the content must be exactly this JSON — no prose, no Markdown fence:
{"toolCalls":[{"name":"<tool>","arguments":{<args>}}]}
For memoryWrites (agent mode):
{"memoryWrites":[{"id":"<id>","title":"<t>","content":"<c>","tags":["..."]}]}
For pageWrites (agent mode):
{"pageWrites":[{"slug":"<slug>","title":"<t>","body":"<markdown>"}]}
Available Tools
The following tools are accessible through the intercepted MCP-compatible protocol at runtime and can be called as toolCalls in assistant turns. Use them to discover real content for example answers.
Search & Retrieval
| Tool | Purpose | Key Arguments |
|---|---|---|
memorysmith_unified_search |
Broad search — memories + pages + tasks | query, memoryLimit, pageLimit |
memorysmith_hybrid_search |
Balanced conceptual/lexical memory search | query, limit |
memorysmith_semantic_search |
Dense vector search for conceptual recall | query, limit |
memorysmith_search |
Exact-term, tag, ID, or literal-word search | query, limit |
memorysmith_context_pack |
Root record + references + backlinks | id or query, includeBacklinks, referenceDepth |
memorysmith_get |
Fetch a single memory by ID | id |
Pages
| Tool | Purpose | Key Arguments |
|---|---|---|
memorysmith_page_search |
Find wiki pages by natural-language query | query |
memorysmith_page_get |
Fetch a page by its slug | slug |
Tasks
| Tool | Purpose | Key Arguments |
|---|---|---|
memorysmith_task_list |
List/filter task records | status, tag, assignee, query |
memorysmith_task_get |
Fetch a task by ID or key | id |
Agent-Mode Write Tools (only valid in Agent-mode examples)
| Tool | Purpose |
|---|---|
memorysmith_task_create |
Create a new task record |
memorysmith_task_update |
Update task fields |
memorysmith_task_set_status |
Transition task status |
memorysmith_task_add_comment |
Append a comment |
memorysmith_task_add_attachment |
Attach a file reference |
Code Search (read-only)
| Tool | Purpose | Key Arguments |
|---|---|---|
memorysmith_code_search |
Full-text search over indexed source | query, limit |
memorysmith_code_search_status |
Current index build status | (none) |
What to Generate
Produce at least 50 high-quality examples across the following categories. Aim for variety in phrasing, turn count, and content. Ground answers in real IDs, slugs, and facts pulled from the live knowledge base using the tools above.
Category Breakdown
| Category | Count | Notes |
|---|---|---|
| Single-tool retrieval (search → grounded answer) | 15 | Covers unified_search, hybrid_search, semantic_search, search |
| Context pack + multi-reference answer | 8 | Use context_pack with includeBacklinks: true |
| Direct get by ID or slug | 5 | Covers memorysmith_get, memorysmith_page_get |
| Task browsing | 5 | task_list with filters + task_get |
| Code search | 5 | Realistic codebase questions answered with source snippets |
| Multi-turn: search then follow-up | 6 | Second user turn asks for more detail, related records, or a correction |
| Agent-mode write | 4 | Create or update task, write a new memory draft |
| Graceful failure / clarification | 4 | Model has insufficient context, asks for clarification or says "I don't know" |
| Citation-focused | 3 | Include explicit - Source: memory:<id> footers |
Quality Criteria
Each example must:
- Have a plausible user question grounded in MemorySmith's actual domain (wiki, tasks, code, training, chat, admin, ONNX, semantic search, etc.)
- Have an assistant response that either (a) requests exactly one focused tool call, or (b) provides a grounded prose answer with at least one cited ID/slug, or (c) is a clean graceful refusal when evidence is absent
- Not confabulate memory IDs or page slugs — only use IDs retrieved through the tools during distillation
- Follow the tool-call JSON contract exactly (no prose wrapping)
- Keep system prompts realistic; do not add tool call instructions into user turns
Example Skeleton
// Tool-call example
{"messages":[
{"role":"system","content":"You are Athena, MemorySmith's local wiki assistant."},
{"role":"user","content":"What does MemorySmith use for semantic search?"},
{"role":"assistant","content":"{\"toolCalls\":[{\"name\":\"memorysmith_unified_search\",\"arguments\":{\"query\":\"semantic search ONNX embedding\",\"memoryLimit\":5,\"pageLimit\":3}}]}"}
]}
// Grounded prose example (after tool results would have been returned)
{"messages":[
{"role":"system","content":"You are Athena. Cite sources as memory:<id> or page:<slug>."},
{"role":"user","content":"How does the ONNX semantic search pipeline work?"},
{"role":"assistant","content":"MemorySmith's semantic search uses a MiniLM-L6-v2 ONNX model to produce 384-dimensional embeddings stored per-memory. At query time the same model embeds the query and cosine similarity ranks results.\n\n- Source: memory:mem_onnx_001 - ONNX Semantic Search Pipeline"}
]}
Required Output
-
One or more
.sft.jsonlfiles inData/Training/exports/, named with the prefixdistilled-and a date suffix, e.g.: -distilled-retrieval-20260529.sft.jsonl-distilled-agent-writes-20260529.sft.jsonl-distilled-code-search-20260529.sft.jsonl -
A summary comment (can be a Task comment on this task, or a page update here) describing: - How many examples were generated per category - Which memory IDs and page slugs were verified live - Any gaps discovered (topics with no coverage in the wiki) - Confidence level on factual accuracy of grounded answers
Acceptance Criteria
- [ ] ≥ 50 valid JSONL lines across all output files
- [ ] Each line parses as valid JSON with a
messagesarray - [ ] No confabulated IDs — all
memory:<id>andpage:<slug>references were retrieved via tools - [ ] Tool-call lines contain only the JSON object (no surrounding text)
- [ ] At least 4 multi-turn examples (3+ turns each)
- [ ] At least 4 agent-mode examples
- [ ] Files placed in
Data/Training/exports/ - [ ] Summary comment or update provided
Notes for the Agent
- Start with
memorysmith_unified_searchormemorysmith_context_packto orient yourself to the current state of the knowledge base before generating any examples. - The live data is in
Data/Memories/Core/andData/Pages/. Use it as your ground truth. - Do not generate examples about capabilities that don't exist (e.g., external web browsing, image generation, SQL queries) — Athena is a local wiki assistant only.
- Prefer realistic terse user questions over verbose or lecture-style prompts.
- For the "graceful failure" category, the assistant should acknowledge the gap and suggest what the user could provide or check, not confabulate an answer.