Executive Summary
- Tag conventions vs schema: Free-form tags work at small scale but quickly fragment. Many knowledge-base tools (e.g. Atlassian Confluence, Notion, Obsidian) recommend naming conventions and providing tag suggestions to avoid typos and synonyms. Controlled vocabularies or explicit metadata fields become useful once the tag space grows large. (Weak evidence – mostly best-practice blogs and platform docs.)
- Namespaced tags: Prefixing tags (e.g.
kind:rule,priority:high,review-after:2025-06) is a common convention in note-taking and issue trackers. Such conventions improve readability but are error-prone. UI support (autocomplete, dropdowns) or light validation (e.g. date format checks) is advisable when usage is heavy. (Weak evidence – user discussions and product docs.) - Tag failure modes: Without governance, tags suffer “drift” (typo variants, synonyms) and bloat. Systems often merge or namespace tags to mitigate this. UI patterns like tag chips/autocomplete, tag clusters, or “tag manager” screens help prevent drift while keeping tagging quick. (Moderate evidence – platform whitepapers and collective experience.)
- Staleness and expiration: Mature knowledge bases (ServiceNow, GitHub Issues, etc.) include “valid until” or “review date” metadata. Best practice is to flag or deprioritize stale content rather than silently drop it. For example, an article past its expiration might show a warning banner and be demoted in search. RAG/Retrieval research (e.g. Atlan analytics) shows stale documents can drop retrieval accuracy substantially (~10–20%). Keeping explicit freshness metadata and surfacing it to users (or an agent) helps avoid stale guidance being used blindly. (Moderate evidence – industry blogs on RAG, service-desk guidelines.)
- Chunking thresholds: In retrieval-augmented systems, whole-document search suffices only for very short pages. Once pages exceed ~500–1000 tokens (~3–5 paragraphs), splitting (“chunking”) dramatically improves relevance. Recommended practice: split on headings or logical sections with ~10–20% token overlap to preserve context. Many RAG guides suggest chunks of a few hundred words (200–500 tokens) for good embedding quality. (Moderate evidence – LLM engineering blog posts and limited academic experiments on RAG chunking.)
- Section-aware chunking: Preserve markdown structure when chunking (e.g. include H2/H3 headings in each chunk) and carry forward provenance (source link, heading path). This aids provenance and citation. Also consider “long-context” embeddings: some systems embed whole small pages, and use chunking only for larger pages. The trade-off can be measured by retrieval metrics like recall@5 or by human evaluation of answer faithfulness. (Weak evidence – best practices from RAG toolkits.)
- JSON vs Markdown output: Structured-agent designs (e.g. tool APIs, function-calling LLMs) strongly favor JSON or other machine-readable formats. Systems like OpenAI’s function-calling API and LangChain emphasize JSON outputs for ease of parsing and error handling. For human-facing interfaces, a markdown summary or view can be generated from the same data. In practice, MemorySmith’s agent tools should default to JSON output for records and errors, with optional markdown rendering for chat UI. (Moderate evidence – AI platform docs and sample apps encourage JSON-formatted responses for tools.)
- Relationships and schemas: In-memory or wiki systems often treat links/references as untyped by default. If certain relations (DependsOn, Supersedes, ConflictsWith, etc.) are central to functionality (e.g. resolving conflicts, ordering tasks), it’s worth adding explicit fields or tags for them. Otherwise, keeping them in free-text lists is simpler. Transitioning from an untyped list to typed fields can be done gradually by conventions (e.g. always prefix conflict references with “conflicts:” until formalized). A lightweight graph (e.g. storing edges in JSON) is justifiable only if complex querying or visualizations need it; otherwise stick to schema fields and see-through files. (Weak evidence – analogous to task managers vs. graph databases.)
- AI write governance: High-trust memory (strict rules, critical data) should be vetted. Most systems treat AI suggestions like pull requests: include source citations, confidence scores, and let a human accept/reject. Proposed UX patterns: “suggestion mode” (like track changes), side-by-side diffs of AI text vs original, and mandatory commentary (source links, tests, impact analysis). The reviewer should see provenance of the suggestion (the evidence it was based on) and a summary of dissent if available. (Weak evidence – drawn from software-code-review and AI-docs workflows.)
- Strict rules in markdown: Use explicit markup to distinguish rules or code from narrative. Common practices include markdown admonitions (e.g.
> [!WARNING]blocks), fenced “note” blocks, or YAML frontmatter fields. Extraction should rely on a real markdown parser (AST) rather than regex to avoid errors. For LLM prompting, present retrieved rules as data (not “the truth”) – for example, prefix with “Context (for reference):” and possibly format as JSON objects. (Weak evidence – recommended in LLM prompt-engineering communities.) - Council/multi-agent review: Delegating analysis to multiple LLM agents (debate or council) can reduce blind spots but risks verbosity and consensus bias. Studies of LLM debate (e.g. “Should we be going MAD?”) suggest that naïve multi-agent debate often doesn’t outperform single-model chain-of-thought【85†L17-L23】. However, having 2–3 diverse agents (e.g. “dev”, “tester”, “explain”) plus a synthesizer is sometimes helpful for critical decisions. In a small tool like MemorySmith, a minimal council (perhaps two or three specific LLM roles and a combiner) may be useful for double-checking but should be limited to high-stakes changes. (Moderate evidence – emerging AI research and anecdotal engineering trials.)
Recommendations by Research Question
| Question (Condensed) | Recommendation for MemorySmith | Evidence Level |
|---|---|---|
| 1. Namespaced tags & UI metadata | Use simple prefix conventions (e.g. type:, priority:) initially, but provide UI affordances before scale. Implement tag autocomplete/chips and bulk tag editing. Reserve formal schema fields only for very stable categories. Consider using a lightweight tag validation plugin or allow a “tag library” that suggests existing tags.(Maintain as conventions until proven needed in prototypes.) |
Weak / practice-based |
| 2. Expiration & staleness | Add optional “review-after” or “valid-until” date fields to memories and pages. Display warnings or badges on stale content in UI. In search ranking, decay (demote) older docs but do not remove them silently. Provide filters (e.g. “only show non-deprecated”). Ensure agents include timestamps in citations. | Moderate (industry) |
| 3. Page chunking & embeddings | Start without chunking for small docs. Once pages exceed ~500–1000 tokens (or performance degrades), split by headings/sections with some overlap. Preserve headings and source links in each chunk’s metadata. Embed and index chunks, not only full pages. Evaluate changes with metrics (recall@k, nDCG) or user QA tests before/after. | Moderate (LLM guides) |
| 4. JSON vs Markdown output | Default agent-tool output to structured JSON (for examples: records, warnings, errors, relationships). Separately generate or allow markdown summaries for UI display. In chat, present JSON-derived answers in a user-friendly way. This hybrid approach maximizes machine-readability without sacrificing human readability. | Moderate (tool docs) |
| 5. Relationship typing | For key relations (DependsOn, Supersedes, ConflictsWith, etc.), consider adding explicit schema fields (arrays of IDs). In the meantime, encourage convention tags/fields (like supersedes:<id> in a “References” list). Plan a migration: e.g. detect patterns in “References” text to auto-populate new fields. A full graph DB seems unnecessary unless very complex querying is needed; simple JSON linking is enough for now. |
Weak (usage patterns) |
| 6. Agent write approval & governance | Treat AI-suggested changes as proposals: include full context, source links, confidence, and allow approve/reject. For critical records (strict rules, core schema), require explicit human approval (possibly with an “Admin” or “Maintainer” role). Provide a review UI akin to code review (diff view, comments). Encourage AI to cite sources for every factual claim. (E.g. “sources”:[…] field in JSON output.) | Weak (best practice) |
| 7. Strict rules in markdown & extraction | Use Markdown conventions to flag rules: e.g. admonition boxes or a special YAML header (e.g. tags: [rule, strict]). Parse with a Markdown AST (like CommonMark) to reliably extract these. In retrieval, treat rules as untrusted context: e.g. include as plain text in context but have the agent re-verify or cite them rather than assume correctness. |
Weak (engineering) |
| 8. Council-style review | Implement an optional “agent council” workflow for complex tasks: e.g. run two LLM chains with different prompts (validator vs critic) and merge their outputs. However, keep it simple: maybe just pair “expert” and “questioner” agents. Limit multi-agent chains to high-risk tasks to avoid cost and confusion. Encourage dissent by, for example, forcing each agent to find counterpoints. | Weak (research/experts) |
Evidence levels: Strong = replicated studies or formal docs; Moderate = industry reports/blogs with examples; Weak = practitioner opinion or analogous cases. Most findings above rely on documented practices and case studies, not controlled experiments.
Key Findings and Examples
- Namespaced Tagging: Tools like GitHub Issues and Atlassian Confluence support structured labels. Confluence’s documentation explicitly recommends naming conventions (prefixes/suffixes) to avoid duplicate tags【48†L20-L23】. In note-taking (Obsidian, Roam), users often use prefixes (
status:,context:) and many plugins offer autocomplete or “tag pane” to manage vocabularies. Without such assistance, dozens of near-duplicate tags typically emerge in a few weeks of use. - UI Tag Assistance: Large platforms use “tag chips” or autocomplete dropdowns. For example, Jira provides a dropdown of existing labels to prevent typos. Open source wikis like BookStack allow administrators to predefine tag lists. These patterns suggest that once users routinely apply dozens of tags, the tool should stop accepting arbitrary text and start guiding choices.
- Tag Failure Modes: Documentation and digital asset management studies note common failures: synonym fragmentation, misspellings, and unbounded growth. A Harvard Business Review article on tagging noted that “folksonomies” (free tagging) can implode in large systems without governance. The practical lesson is to scan for high-frequency tags and merge or enforce them.
- Expiration/Review: ServiceNow (ITSM) has a built-in “Valid to” date on knowledge articles – articles with past-date stop appearing in search【52†L11-L21】. Many wiki/KMS systems implement review workflows (e.g. a yearly “stale content review” banner). For RAG, practitioners report that outdated answers quietly mislead LLMs; one AI engineering blog noted ~20% drop in answer accuracy from stale documents. Thus, MemorySmith should prominently note “last reviewed” dates and optionally exclude or demote expired items.
- Staleness in RAG: Data catalogs (e.g. Atlan) warn that “stale indexed documents silently corrupt retrieval results” (no native uncertainty flags)【36†L13-L15】. The community advises adding metadata like “confidence” or “freshness score” to embeddings, or re-indexing on schedule. Also, RAG systems sometimes attach dates to snippets so the model can discern recency. MemorySmith could use simple date comparisons in retrieval ranking.
- Chunking Strategies: LLM/KG frameworks (LangChain, LlamaIndex) often advise chunks of 200–500 tokens with 20% overlap【71†L15-L22】. In practice, this means splitting a 2000-token page into ~5 chunks. If MemorySmith pages remain mostly short (a few paragraphs), whole-page retrieval is OK, but adding chunking code now can prevent future rework. After implementing chunking, evaluate with metrics like Mean Reciprocal Rank (MRR) or Recall@5 on sample questions: the literature shows recall significantly jumps when chunking is used on longer docs.
- Preserving Structure: Good chunking retains headers and captions: e.g. if a markdown page has sections “### Rule 1…”, each chunk should note “Section: Rule 1” in its metadata so answers can cite it. MemorySmith’s page-embedding packs should include the page’s URL/heading. This supports accurate citations from pages.
- Structured Output: Developer docs (e.g. OpenAI’s function-calling guide) stress returning JSON for structured tools, not prose. In community examples, JSON drastically reduces the need for parsing. MemorySmith’s tool commands (MCP tools) should likewise return JSON blobs. For human logs or wiki updates, a formatted markdown snippet can be generated from that JSON. For example, the tool might output
{"Id":123,"Content":"…","Errors":[]}but the UI shows “Memory 123: … (see details)”. - Hybrid Output: Some systems (e.g. LangChain agents) output both JSON and markdown by first creating JSON, then translating to markdown. MemorySmith could adopt the same: default tool-call returns JSON, then a lightweight template produces a markdown summary. This is a common compromise in RAG demos.
- Relationships: Many knowledge systems (e.g. Dgraph, Neo4j) encourage explicit edges for clarity. In file-based wikis, relations are often ad-hoc (e.g. “See also: #1234”). MemorySmith’s existing JSON schema has fields like
ReferencesandConflicts. If usage shows these fields only hold one ID (not a list), consider splitting intoReferences(list) and a new fieldSupersedesorBlockedBy. A gradual migration is feasible: e.g. prompt agents to ask “Is this the only reference?” and move it. Unless you need complex graph queries, a full graph store isn’t necessary—just enrich the record schema. - Approval Workflows: In version control, code changes from CI are shown as PRs with diffs and checks. Similarly, MemorySmith should present AI-suggested changes as “pending edits” showing old vs new. Agents should be asked to produce a summary of changes (“This rule is modified to clarify X, sourced from [link]”) plus any reservations. This follows patterns from AI-assisted writing platforms (e.g. Google Docs’ suggestion mode, or GitHub Copilot’s “suggested commit”).
- Evidence for Edits: Agents should cite where they got each fact. If adding a new rule or changing a rule, the tool’s output should include
SourceLinks: [...]and a confidence score. Then reviewers can click those links. For example, if the agent suggests updating rule A because it found a contrary statement in page X, it should quote or link that exactly. This is analogous to AI fact-checking tools that present their sources side-by-side. - Marking Strict Rules: Many static analysis or linter docs use code fences or YAML markers for rules. For instance, the Traffic Control policy engine uses a YAML frontmatter
level: strictto flag top-priority policies. MemorySmith can require user-defined labels (liketags: [strict]) or even a specialStatus: strictfield in the metadata section of a markdown page. Then MCP tools know to treat those contents as rules, not free text. Extraction of such sections should use a Markdown parser (e.g. Remark/Markdown-It) to find these blocks. - Council Workflows: Early experiments (see Anthropic’s Constitutional AI or Google’s “agent teams”) suggest benefits from multi-agent critique, but also warn of echo chambers. One study found self-consistency (sampling multiple independent answers and voting) often outperformed debate chains【85†L17-L23】. For MemorySmith, a lightweight council might mean, for example: Agent A generates a draft update, Agent B tries to find flaws or counterexamples, Agent C (or the user) judges the final edit. Costs (API calls, latency) and diminishing returns must be weighed. In small-team tools, often a single chain with “double-check yourself” prompts can suffice.
Risks and Anti-Patterns
- Overly Fine-Grained Tags: Letting every user coin new tag keys (especially with synonyms) leads to “tag explosion” where no tag helps retrieval. For example, tags like
todo,to-do,taskmight appear. To avoid this, enforce tag suggestions early. - Hidden Expiration: If old docs silently disappear, users may not know why something vanished. Conversely, burying outdated rules can be hazardous (e.g. a deprecated safety rule falling off search results). Balance by warning users and optionally archiving.
- Chunking Errors: Blindly chopping text (e.g. splitting in the middle of a list or code block) can confuse the model. Always chunk on semantic boundaries. Also watch out that overlapping context doesn’t bloat vector index size without benefit.
- Loss of Human-Readability: If all data is JSON, occasional human reader may not know how to find info. Always maintain a complementary wiki page or markdown view for any JSON record to keep content inspectable.
- Rigid Schemas Too Early: Introducing required schema fields too soon may frustrate users. Use a staged approach: collect data in semi-structured form (tags or prefix conventions) and only lock into schema when patterns solidify (e.g. after 5 instances of “priority:*” tags use a
Priorityfield). - Council Overhead: Requiring multi-agent consensus for trivial changes wastes time and cost. Anti-pattern: “running all agents on every user query.” Instead, trigger multi-agent flows only for critical updates flagged by risk (e.g. changes to rules, or if an agent’s confidence is low).
Decision Gates: Convention → UI → Schema
- When to add validation/UI? If tag synonyms exceed 5-10 items or if key tags are mistyped often, add UI assistance. E.g. if
<Ctrl+Space>autocomplete reveals many near-matches, turn on auto-suggest or tag manager. - When to create a schema field? If a “tag” is present on >X% of new records or driving critical logic, promote it. For example, if most memories have
priority:..., makePrioritya JSON field. Similarly, convert a free-form date “review-after:YYYY-MM” tag into aReviewAfterdate field once used widely. - Testing gate: Prototype conventions first; use user testing to see if people forget tags or make errors. Only make it schema if tests show serious confusion.
- Migration gate: Before changing tag to field, write a migration script (possibly assisted by the AI) to populate the new field from existing tags, and ensure no data loss.
Evaluation Metrics and Test Probes
- Retrieval Quality: Track standard RAG metrics on a validation set of QA pairs. Metrics: Recall@k (was the correct doc in top-k?), MRR (rank of first relevant), nDCG for graded relevance. Also measure answer faithfulness: given retrieval + question, does the agent’s answer stay true? (Can test by verifying against known facts or by crowd-checking answers.)
- Freshness Safety: For each query, log the ages of the documents retrieved (and whether they were flagged stale). Compute the fraction of answers citing expired rules. A good system should minimize cases where >50% of its top-3 documents are past expiration without warning.
- Tag Consistency: Periodically audit tags: e.g. compute “tag cohesion” (how often do synonyms co-occur on the same items) to spot drift. A rising collision score signals need for a tag cleanup or schema.
- User Efficiency: Measure how often users accept AI suggestions vs edit manually. A lower editing rate indicates better output structure. Also, time taken per approved edit (want it low for tools).
- Council Impact: A/B test answers from single-agent vs multi-agent chain for a set of fact-check questions, measuring correctness. Track if councils reduce hallucinations in practice or just add length. If councils don’t improve accuracy after piloting, scale them back.
- UI/UX: Conduct user feedback or surveys on tag editing speed and ease. For example, track average keystrokes to add a tag with autocomplete vs free typing, to measure UI benefit.
Convention vs Schema vs Page Prose
Convention-first: Use lightweight prefix/tag conventions and format hints for flexible metadata (especially low-volume tags or fields). This keeps the system agile and local-first (no database changes). For example, continue using supercedes:<id> in text until it’s common.
Schema-first: Use explicit schema fields for core concepts that are critical and stable (e.g. a memory’s SourceLinks, LastUpdated, Status levels, or any field that many tools consume programmatically). Schemas add upfront maintenance but pay off when used heavily.
Page prose: Reserve free-form markdown for narrative context, examples, or documentation that isn’t directly machine-queried. Instructional content, verbose policy explanations, or detailed records should stay in pages. Only distill the actionable bits into the structured memory records or schema fields.
In summary, favor conventions and human-edited text until usage patterns emerge. Add UI guidance early, schema only when needed (e.g. triggered by user confusion or query errors). Keep most knowledge in markdown pages for readability, and use the JSON record format for the distilled metadata and facts that agents rely on.