Architecture Deep Dive

How to Actually Build the Customer Voice Skill

Three ways to structure your review data inside Mind OS — each optimized for different types of questions your team will ask. Based on Anthropic's own research, Karpathy's LLM Wiki pattern, and how your existing skills are built.

The Core Problem

Can You Ask "Give Me 5 Amazing Reviews About :Rise"?

Short answer: yes, but only if the data is structured right. Here's why it matters:

Claude's context window is like a desk. You can spread a lot of papers on it (~500 pages worth). Claude can find specific items on that desk — but there's a catch called the "lost in the middle" problem.

The Science: Stanford and UC Berkeley found that AI models pay the most attention to what's at the beginning and end of the context — and their accuracy drops 30%+ for information buried in the middle. So if you dump 1,000 raw reviews in and ask for the best ones about taste, Claude will reliably find reviews near the top and bottom, but might miss great ones in the middle.

The fix isn't to use fewer reviews. It's to organize them so Claude knows where to look — with tags, headers, and an index. Anthropic's own testing found three techniques that dramatically improve retrieval:

Tagging each review with metadata (product, theme, star rating) — improves accuracy up to 40%
Asking Claude to "quote the relevant reviews first, then answer" — improved accuracy from 27% to 98% in Anthropic's tests
Putting a table of contents / index at the top — gives Claude a map so it goes directly to the right section instead of scanning everything

All three options below use these techniques. The difference is how they organize the files.

Constraints

How Much Review Data Actually Fits

Claude's context window is 200K tokens. But it's not all yours — the system, conversation, and other skills take up space. Here's the realistic budget:

What's Using the Space	Tokens
System + tools + other skills	~20,000
Your conversation with Claude	~10,000
Claude's "thinking room" (needs space to reason)	~20,000
Available for review data	~150,000

A typical customer review (100-200 words + metadata tags) is about 200-300 tokens. That means:

~500-750 raw reviews could fit if that's ALL you loaded
~300-400 reviews is the comfortable sweet spot (leaves headroom)
Summaries + personas + swipe files + 200 curated reviews = ~63,000 tokens (just 32% of budget)

Key insight: Anthropic recommends filling no more than ~60% of the context window. Past that, reasoning quality degrades. So the skill should be designed to load only what's relevant to each question — not everything every time.

Three Options

How to Structure It

Option 01 Flexible

The Wiki — One File Per Topic

Inspired by Karpathy's LLM Wiki pattern. Every topic gets its own file — one for each product, one for each theme, one for each persona. Each file has a summary at the top and curated raw reviews below. An index file tells Claude where everything lives.

mudwtr-customer-voice/
  SKILL.md — instructions for Claude
  references/
    index.md — master catalog of all files
    products/
      rise.md — :rise summary + top 40 reviews
      rest.md
      balance.md
      starter-kit.md
    themes/
      taste.md — cross-product taste insights + reviews
      energy.md
      ritual.md
      quitting-coffee.md
    personas/
      coffee-quitter.md
      wellness-optimizer.md
      ritual-seeker.md
    swipe/
      ad-hooks.md — pre-built hooks from review language
      testimonial-blocks.md
      objection-handling.md

Example Question

"Give me 5 amazing reviews about :rise that talk about energy benefits"

Claude reads index.md, loads products/rise.md + themes/energy.md, finds tagged reviews that match both :rise AND energy, returns 5 verbatim reviews with context.

Pros

Handles any type of question — product, theme, persona, or creative
Claude loads only relevant files (good token efficiency)
Each file is small and focused (avoids the "lost in the middle" problem)
Easy to update one file without touching others
Follows the Karpathy pattern used by leading AI practitioners

Cons

Most files to build upfront (~18-20 files)
Some reviews appear in multiple files (a great :rise taste review is in both rise.md and taste.md)
Claude has to decide which files to load — needs clear routing instructions
More maintenance when refreshing with new reviews

Option 02 Simple

The Product Bible — One Big File Per Product

The simplest approach. Each product gets one comprehensive file containing everything: summary, themes, persona notes, AND all the curated reviews for that product. Plus one cross-product file for brand-level insights and swipe files.

mudwtr-customer-voice/
  SKILL.md — instructions for Claude
  references/
    index.md — file catalog
    rise-reviews.md — everything about :rise (summary + 80 best reviews)
    rest-reviews.md
    balance-reviews.md
    starter-kit-reviews.md
    brand-overview.md — cross-product themes, personas, swipe file

Example Question

"Give me 5 amazing reviews about :rise that talk about energy benefits"

Claude loads rise-reviews.md, scans the "Energy" section header, finds tagged reviews, returns 5 verbatim reviews. Simple and fast.

Pros

Fewest files — easiest to build and maintain (5-6 files total)
Dead simple mental model: "all :rise stuff is in one place"
No duplication — each review appears once
Fewer routing decisions for Claude to make
Mirrors your existing mudwtr-products skill structure (one file per product)

Cons

Files get large (8,000-15,000 tokens each) — higher "lost in the middle" risk
Cross-product questions need multiple files loaded at once
Always loads the full product file, even for narrow questions
Persona and swipe file content gets buried inside a big file

Option 03 Recommended

The Hybrid — Three Layers by Purpose

Combines the best of both approaches. Three layers, each designed for a different type of question: Insights (what do customers think?), Playbooks (give me creative output), and Vault (give me actual reviews). Claude loads only the layer(s) needed.

mudwtr-customer-voice/
  SKILL.md — instructions + routing logic
  references/
    index.md — master catalog

    LAYER 1: INSIGHTS (~25K tokens)
    insights/
      product-summaries.md — 1-page review summary per product
      theme-map.md — all major themes + key findings + star quotes
      customer-personas.md — 4-5 persona cards from review patterns
      brand-language.md — actual words customers use, by emotion/topic

    LAYER 2: PLAYBOOKS (~8K tokens)
    playbooks/
      ad-hooks.md — 50 hooks from review language, tagged by product
      testimonial-blocks.md — 30 copy-paste review blocks for ads/emails
      objection-handling.md — top 10 objections + real customer responses

    LAYER 3: VAULT (~30K tokens)
    vault/
      rise-reviews.md — best 50 verbatim :rise reviews, tagged
      rest-reviews.md — best 50 verbatim :rest reviews
      balance-reviews.md
      starter-kit-reviews.md
      negative-reviews.md — honest critical reviews (for objection handling)

How Claude Routes Questions

If the Team Asks...	Claude Loads...	Tokens Used
"What do customers think about :rise?"	insights/product-summaries.md	~5K
"Write me 10 ad hooks about energy"	playbooks/ad-hooks.md + insights/brand-language.md	~10K
"Give me 5 amazing reviews about :rise"	vault/rise-reviews.md	~6K
"Build me a persona for the Coffee Quitter"	insights/customer-personas.md + insights/brand-language.md	~8K
"What objections do people have before buying?"	playbooks/objection-handling.md + vault/negative-reviews.md	~10K
"Write a full landing page using customer voice"	All three layers	~63K

Example Question

"Give me 5 amazing reviews about :rise that talk about energy benefits"

Claude reads index.md, routes to vault/rise-reviews.md, scans metadata tags for "energy" theme, quotes 5 matching reviews verbatim with star rating and reviewer context. Total: ~6K tokens loaded. Fast, accurate, grounded in real words.

Why this structure answers your question: The Vault layer is specifically designed for "give me actual reviews" requests. Each review in the vault is tagged with product, star rating, themes, and sentiment — so Claude can filter precisely. The SKILL.md instructions tell Claude to "always quote reviews by ID before answering" — the technique Anthropic found improves retrieval from 27% to 98% accuracy.

Pros

Best token efficiency — most questions use only 5-10K tokens
Pre-built playbooks make the most common asks (hooks, testimonials) instant
Vault gives you verbatim reviews on demand — answers "give me 5 real reviews"
Each layer is independently updatable
Follows both the Karpathy pattern AND Anthropic's own guidance
Scales — can easily add more vault files or insight docs over time
Claude only loads what's needed, keeping reasoning quality high

Cons

More upfront work than Option 2 (~12-14 files vs 5-6)
Routing logic in SKILL.md needs to be written carefully
Some review content appears in both insights (as quotes) and vault (as full reviews)

Comparison

Side-by-Side

Factor	01: Wiki	02: Product Bible	03: Hybrid
Files to build	~18-20	~5-6	~12-14
"Give me 5 real reviews"	Yes (scattered across files)	Yes (in one big file)	Yes (dedicated vault layer)
"Write me ad hooks"	Needs to generate from reviews	Needs to generate from reviews	Pre-built in playbooks
Tokens per typical query	~8-15K	~10-15K	~5-10K
"Lost in the middle" risk	Low (small files)	Medium (large files)	Low (small, layered files)
Cross-product questions	Good (theme files span products)	Needs multiple files	Good (theme-map spans products)
Maintenance effort	Higher (many files)	Lowest	Medium
Speed of common asks	Medium	Medium	Fastest (playbooks are pre-built)

Recommendation: Option 3 — The Hybrid

The three-layer design directly maps to how your team will actually use this data. Your creative strategist asking for ad hooks is a different workflow than your designer asking for real testimonial quotes, which is different from your brand strategist building a persona. The hybrid gives each of them a fast path.

The Playbooks layer is the secret weapon. Your most common requests — "give me hooks," "give me testimonials," "help me handle this objection" — are pre-built and ready to go. Claude doesn't need to re-derive them from raw reviews every time. It just serves them up.

And yes — "Give me 5 amazing reviews about :rise that talk about benefits" works. Claude loads the vault, scans the tags, and returns real verbatim reviews. The tagging and indexing techniques from Anthropic's research make this reliable, not a gamble.

The total data footprint (~63K tokens for everything, ~5-10K for a typical question) leaves your team plenty of room to have a real back-and-forth conversation with Claude on top of the data. No context window crunch.

Research Sources

Anthropic — Long Context Window Prompting Tips Anthropic — Context Engineering for AI Agents Anthropic — Using XML Tags to Structure Prompts Anthropic — "Quote First" Retrieval Technique (27% to 98%) Stanford/UC Berkeley — Lost in the Middle: How Language Models Use Long Contexts Karpathy — LLM Wiki Knowledge Base Pattern (GitHub Gist) VentureBeat — Karpathy's LLM Knowledge Base Architecture MindStudio — LLM Wiki vs RAG Comparison DTCskills — How to Run Your Ecommerce Brand with Claude Multi-Needle Retrieval in Large Context Windows (2025) Long Context vs RAG for LLMs (2025)