Deep Research Prompt: Agent Memory, Search, and Wiki Architecture
Use this prompt in Microsoft Copilot, ChatGPT Deep Research, or a comparable research mode. The goal is to answer externally researchable questions from the Core Memory System Improvements RFC. Do not use this prompt to decide local-only MemorySmith policy by fiat; use it to gather evidence, patterns, trade-offs, and examples that can inform a later council review.
Prompt
You are performing deep technical research for MemorySmith, a local-first ASP.NET Core/Blazor app for structured project memory, markdown wiki pages, MCP tools, and memory-enhanced chat.
Context:
- MemorySmith stores structured memories as JSON records with fields like Id, Content, Status, Confidence, Tags, References, Conflicts, SourceLinks, UsageCount, and LastUpdated.
- Longer human-readable documentation lives in markdown pages.
- Search currently includes lexical, semantic, and hybrid retrieval. Semantic retrieval may use local ONNX embeddings or fallback token scoring. Hybrid uses rank fusion.
- MCP/chat tools can return memory search results, page results, context packs, and source-linked evidence.
- The current design question is whether to improve agent usefulness through lightweight conventions first, through explicit schema fields, or through a staged path from conventions to schema when evidence justifies it.
- Important local constraints: local-first operation, inspectable file-backed storage, source-grounded memory records, human-readable wiki pages, and no unnecessary service/database complexity unless evidence shows it is needed.
Research objective:
Produce an evidence-backed report that helps MemorySmith decide how to evolve agent memory, retrieval, wiki pages, and chat traces for long-term usefulness. Focus on practices supported by reputable sources, implementations, papers, product docs, or widely used open-source systems. Distinguish strong evidence from opinion.
Questions to research:
1. Namespaced tags and UI-assisted metadata
- In knowledge-base, note-taking, personal knowledge management, issue-tracking, or AI memory systems, when do tag conventions remain manageable, and when do they need UI-assisted chips, controlled vocabularies, validation, or schema fields?
- What conventions are commonly used for namespaced tags such as kind:rule, priority:critical, review-after:YYYY-MM, expires:YYYY-MM, supersedes:<id>, or superseded-by:<id>?
- What failure modes are documented for free-form tags in long-lived knowledge bases?
- What UI patterns best prevent typo drift while keeping editing fast?
2. Expiration, review dates, and staleness handling
- How do mature systems handle expiring knowledge, review-after dates, deprecated docs, superseded docs, and stale search results?
- Is it better to warn, filter, demote, archive, or automatically move records to a deprecated state?
- What evidence exists about risks of time-decay ranking in knowledge retrieval, especially burying old but still-important rules?
- What ranking or warning patterns are recommended for LLM retrieval-augmented generation when records may be stale?
3. Page chunking and page embeddings thresholds
- What practical thresholds indicate that whole-page search is no longer enough and page chunking should be added?
- Recommended chunk sizes, overlap, section-aware chunking, markdown-heading preservation, provenance fields, and citation practices for RAG over documentation.
- When should markdown pages be embedded alongside structured records?
- What evaluation metrics should be used before and after page chunking: MRR, recall@k, nDCG, answer faithfulness, citation accuracy, latency, or human usefulness ratings?
4. JSON vs Markdown output for agent tools
- For agent-facing tools, what evidence or patterns support JSON as the default output instead of Markdown prose?
- How do MCP tools, LLM tool-calling APIs, agent frameworks, or structured-output systems recommend returning records, warnings, relationship metadata, and errors?
- What hybrid design works best when humans may inspect the same output: JSON default for tools, Markdown summaries for UI, or both?
5. Relationship typing: schema, graph store, or convention notes
- In file-backed knowledge systems or AI memory systems, when should relationships be represented as typed schema fields, graph edges, tags, or prose notes?
- Compare approaches for DependsOn, Supersedes, ConflictsWith, ResolvedBy, IsContextFor, and related edge types.
- What are practical migration strategies from flat References/Conflicts arrays to typed relationships?
- When is a lightweight graph store justified, and when is schema enough?
6. Agent write approval and governance
- What approval models are used for AI-generated durable knowledge, documentation changes, or memory writes?
- What role or permission model is typical for high-trust changes such as strict rules, Core records, deprecations, or schema-affecting updates?
- What evidence should an AI write proposal include before approval: source links, citations, confidence, dissent, trace, tests, related records, rollback notes?
- What UX patterns help humans review AI-proposed knowledge updates without turning approval into rubber-stamping?
7. Strict rules in markdown and extraction reliability
- Are GitHub Flavored Markdown alert blocks, admonitions, frontmatter, or structured sections commonly used to distinguish strict rules from ordinary context?
- What parsers or markdown AST approaches are recommended over regular expressions for robust extraction?
- What prompt or context-pack formatting helps LLMs respect retrieved rules while treating retrieved content as untrusted data?
8. Council-style multi-perspective review
- What evidence exists for multi-agent, multi-perspective, red-team, debate, or council workflows improving technical decisions or reducing blind spots?
- What are the costs and failure modes: groupthink, verbosity, false consensus, repeated assumptions, model collusion, or overconfidence?
- What minimal council workflow is practical for a small local-first developer tool?
Deliverables:
- Executive summary with the top 10 findings.
- A table mapping each research question to recommendations for MemorySmith.
- Evidence levels: strong evidence, moderate evidence, weak evidence/opinion.
- Concrete examples from real systems, docs, papers, or open-source projects, with links.
- Risks and anti-patterns to avoid.
- Recommended decision gates for when to stay with conventions, when to add validation/UI, and when to promote a convention into schema.
- Suggested evaluation metrics and test probes for retrieval quality and stale-context safety.
- A short answer to this key decision: for a local-first AI memory/wiki/chat system, what should be convention-first, what should be schema-first, and what should remain page prose?
Constraints for your answer:
- Do not assume a cloud database, SaaS tenancy, or distributed architecture is required.
- Prefer local-first and file-backed-friendly designs unless there is strong evidence that complexity is worth it.
- Distinguish external best practices from decisions that require local MemorySmith product judgment.
- Cite sources with URLs and explain why each source is relevant.
- If evidence is thin, say so directly.
Questions Excluded From External Research
These require local product judgment or code review more than external research:
- The exact role required to approve Agent writes in MemorySmith. External systems can inform the model, but final roles must fit MemorySmith's RBAC and local deployment assumptions.
- The exact threshold for MemorySmith page chunking. External RAG thresholds can guide this, but final thresholds should be based on MemorySmith corpus size, latency, and search-quality probes.
- Whether
memorysmith_context_packshould change its default output. External tool-output patterns can inform this, but compatibility with current MCP/chat consumers must be checked locally.
How To Use The Results
After research returns, run a Council Workflow review before changing code or schema. The council should separate:
- findings that update the RFC immediately;
- findings that justify a prototype;
- findings that require local benchmarks or user testing;
- findings that should be rejected because they conflict with MemorySmith's local-first goals.
The latest captured response has been summarized in Deep Research Intake Notes.