A production pipeline for generating large, diverse, persona-grounded natural-language queries that describe UI to be built — anchored to a 2,440-topic real-world corpus, validated by trigram-similarity dedup, and benchmarked across four generation strategies.
corpus controls what · persona controls who / how — two orthogonal control signals, with a 4-way horizontal ablation that Persona Hub's original paper did not run.
Most "let's just prompt an LLM for some queries" pipelines collapse to narrow, templated distributions that don't generalize. We attack that on three axes.
Each card maps to a real situation where a team needs "a batch of high-quality query data" — not a benchmark, not a demo, but data that holds up in production or research.
test-corpus-methods.js empirically shows corpus anchoring and persona injection are additive — both raise topic hit rate and authenticity independently. Three diversification layers (Layer-A dedup / opener hash / persona-tone) are open-sourced and directly reproducible as a baseline.
Inputs from the corpus channel (what) and the persona channel (who / how)
converge at buildCorpusPlan(); a single Claude call generates the query;
successful rows feed back to corpus_usage.json so the next batch picks least-used topics.
The diagram shows the corpus-direct path (production default). The legacy persona-driven path adds an extra LLM call for persona synthesis between Step 1 and Step 2; both share the same Step 4 scoring + persistence layer.
How Step 4 scores — per-record heuristic: Authenticity × 0.4 + Specificity × 0.4 + Diversity × 0.2, pass threshold ≥ 2.8 / 5.0. A and S get equal 40% because "sounds like a real user" and "specific enough to be useful" are the two signals most directly computable from a single query; D only gets 20% because a single query can't prove batch-level diversity — that responsibility sits upstream in the corpus channel and Layer-A state, not at the per-record scorer.
The "pre-stage" is the secret. We don't ask the LLM to come up with topics — we pre-allocate them from a curated corpus, then let the LLM render natural language around each anchor. Similarity-based scoring then verifies the output stays diverse.
topics[i % length] rotation makes consecutive batches pick identical
corpus topics. We persist data/state/corpus_usage.json — usage counts per
(l2_key, topic) — and prefer least-used topics for each new run. 100% → 0% topic
overlap with previous batch.
Build a mobile .... We deterministically hash query_id into one
of 5 opener buckets (Build a / Need a / Create a / Make a / no formal opener),
forcing uniform distribution. Idempotent across reruns.
maker, planner, curator, operator, founder_like) and assign each task
the best-fit persona by L2 semantics (not random hash) via
scripts/corpus_persona_map.json. The prompt injects each persona's voice descriptor —
purely qualitative, no example phrases (which would make output rigid). A jargon blacklist
(modal / dashboard / auto-generate / swipeable / bottom sheet ...) reinforces what each persona
wouldn't say.
queries.jsonl + plan.jsonl pair. Subsequent runs auto-update on success;
failed rows are not counted toward usage. Inspect data/state/corpus_usage.json any time.
Five archetypes drawn from how people actually request UI work. Each archetype carries per-complexity style hints, so the same persona phrases a vague request differently from a complex one — preserving voice while scaling specificity.
Synthetic-data research has roughly four lines — instance-driven (Self-Instruct, Evol-Instruct), key-point-driven (GLAN), persona-driven (Persona Hub), and self-play (Magpie). This repo sits on the persona-driven line and adds three engineering reinforcements the original paper did not include.
| Dimension | Persona Hub (Tencent AI Lab, 2024) | This repo |
|---|---|---|
| Persona source | Generic 1B persona pool, reverse-derived from web text | 5 targeted archetypes reverse-derived from product scenarios + real user profiles |
| Distribution control | Black-box: relies on the large pool's natural spread | White-box: corpus channel tracks 2,440 topics, Layer-A least-used-first state |
| Horizontal ablation | No with/without-persona ablation; no head-to-head vs Self-Instruct / Magpie | 4-way: llm-direct / corpus-direct / persona-direct / corpus+persona |
| Typical setting | General-domain distillation at scale | Vertical product domain (UI vibe-coding) — where you can actually obtain a clean corpus |
Under same base model (claude-sonnet-4-6) / same query total / same eval protocol,
four generation strategies head-to-head on identical scenes. Each method isolates the contribution of one control signal
— corpus (what) or persona (who / how) — so the marginal effect of each channel becomes visible.
| Method | Topic adherence | Voice authenticity | LLM calls / query | Verdict |
|---|---|---|---|---|
| Scene-direct | ~60–80% | low | 1× | Baseline · drifts off-topic |
| Corpus-direct | ~100% | medium | 1× | Production default · best ROI |
| Persona-only | low | high | 1× | Strong voice · weak specificity |
| Persona + Corpus | ~100% | high | 2× | Best quality · 2× cost & latency |
Verbatim quotes from each dataset's published samples — no paraphrasing, no rewriting. Scope deliberately narrowed to web / UI code-generation datasets (the same application surface this repo targets), not general instruction-tuning corpora. Each row is a real string copied from the dataset card or release; the source link sits next to it so you can verify. Then one real bilingual query from this repo for shape contrast.
| Dataset · source | Verbatim query / instruction (as published) |
|---|---|
|
WebSight v0.1 HuggingFaceM4 HF dataset card ↗ screenshot ↔ HTML pairs · the text field shown here is the natural-language side
|
"Fashion Brand: A visually stunning layout with a full-width, rotating image carousel showcasing their latest collections, a bold, center-aligned logo, and a bottom navigation menu. The color palette is inspired by the latest fashion trends."
shape · visual-structural description · "[type]: [layout pieces]" template · no user voice, no goal, no context
|
|
Web2Code MBZUAI · NeurIPS 2024 HF dataset card ↗ 1.18M webpage-image ↔ HTML instruction pairs · conversations[0].value field shown
|
"<image>\nGenerate HTML corresponding to the webpage in the given image. See code developed with guidance from the principles of material design."
shape · model-facing imperative + screenshot · "Generate HTML…" template · no end-user motivation, no product framing
|
|
WebGen-Bench / WebGen-Instruct Lu et al. 2025 · arXiv 2505.03733 HF dataset card ↗ 101 test + 6,667 train instructions for end-to-end website generation |
"Please implement a website for generating stock reports to provide stock information and analysis. The website should have the functionality to search and summarize stock information, and generate customized stock reports based on user requirements. Users should be able to input stock codes or names, select report formats and content, and the website will automatically generate the corresponding reports. The reports should include basic stock information, market trends, financial data, and more. Set the background color to white and the component color to navy."
shape · third-person functional spec · "Users should be able to…" PRD voice · feature checklist with a tacked-on color rule · no first-person motivation
|
|
MM-WebGen-Bench Microsoft · 2025 HF dataset card ↗ 120 multimodal webpage-design prompts · 11 scenes × 11 styles · input field shown
|
"Design a playful, vibrant event landing page titled 'Pixel Pop Creative Camp 2025' that showcases a weekend of digital design and comic illustration workshops. The overall layout should favor a two-column structure with deliberate asymmetries and whimsical details, all in a bright, candy-inspired palette. The page back…"
shape · design-director brief · "Design a [adjective stack] [page type] titled '…' that …" · aesthetic vocabulary · still third-person, no real-user voice
|
|
ui-queryMaker ★ corpus + persona corpus_run_v7_mobile_500/queries.jsonl · q_scene_002_001 · persona = curator · L2 = 个人生活类 · topic = Travel Memory Scrapbook
|
"A travel memory scrapbook app for mobile where each trip feels like flipping through a beautifully worn journal — photos layered with handwritten-style captions, little stamps marking the city and date, and a warm golden-hour color palette that gives everything that nostalgic, end-of-a-roll-of-film feeling. The home screen shows all your trips arranged like a loose stack of polaroids you can tap into, and once inside a trip, the memories are laid out in an organic, almost collage-like way where things slightly overlap and feel like someone actually pasted them onto a page by hand. The vibe should feel tactile and intimate, the kind of thing you'd want to screenshot and share just because it looks so good."
shape · first-person user voice + complete 0-to-1 app framing · 117 words · top-level noun =
app · no dev jargon · positive framing (NEG = 0) |
Every number below is computed from real verbatim data — 100 samples per dataset, identical algorithm applied to all five. Code: scripts/quality/*. Raw output: data/output/quality_report/*.json. Charts A+B run as pure-regex / pure-counting Node scripts; chart C runs locally on sentence-transformers/all-MiniLM-L6-v2 ONNX (90 MB, in models/) + van der Maaten's reference numpy t-SNE. Reproducible single-command pipeline; no LLM-as-judge, no API, no downstream training in scope.
| dataset | jargon/q↓ user voice |
words p50context |
distinct-3↑ less template |
TTR↑ richer vocab |
#openers↑ variety /100 |
top opener · share↓ less template |
max-peer sim↓ fewer dupes |
|---|
jargon/qlower = ordinary-user voicewords p50context-only · ~80-150 idealdistinct-3higher = less templatingTTRhigher = richer vocab#openershigher = more variety · /100top opener · sharelower share = less templatemax-peer simlower = fewer near-dupes
500 ui-queryMaker queries from corpus_run_v7_mobile_500, embedded with sentence-transformers/all-MiniLM-L6-v2 (ONNX, local, 384-dim, mean-pool + L2-norm) → t-SNE 2D (perplexity 30, 1000 iter, van der Maaten's reference numpy implementation). Click a color-mode button to recolor: L1 (12 categories, Excel-defined topic structure) · persona (5 archetypes, who/how channel) · style · complexity. The proper test isn't "are we visually different from other datasets" — different datasets target different scenarios. The test is: do queries cluster meaningfully along the structural fields we set out to spread across? If yes → the prompt didn't collapse. Same intra-dataset shape inspection as Code Aesthetics (arXiv 2510.23272).
A registered design-style layer on top of the core two-channel pipeline — orthogonal to what and who/how, controls the look. Eleven first-class styles, each with a Chinese persona-side hint and an English LLM-side instruction. New styles register in one line.
design_style: null
--design-styles "Dark,
Glassmorphism,Cyberpunk"
--design-styles auto
Each stage is a real production-grade pain we discovered after running the previous version at scale, then a focused prompt or pipeline fix. Click any "Sample query" to read the actual generated text — same persona, side by side, before vs after.
corpus_usage.json state · 5-bucket opener hash (Build a / Need a / Create a / Make a / no opener) keyed by query_id · persona-tone semantic mapping from L2 → 5 ordinary-user archetypes.page / screen / view / section / module / feature / widget as the top-level scope noun. Use app or a specific app type (tracker / tool / reminder / planner / calculator / logger / manager / timer). Existing 200-row batch retrofitted via scripts/rewrite-app-scope.js.founder_like voice — was literally instructed to "explain what NOT to include as much as what to include". (2) Flipped three "Do NOT" prompt rules into positive-form ("Open with the substance" / "Use everyday vocabulary" / "Use 'app' as top-level noun"). LLMs primed by negation imperatives mirror them in output. (3) Added explicit positive-framing rule limiting negation words to ≤1 per query.
Each stage is a single, focused commit on main — fully reproducible. The full negation-density audit script is one inline node command; quantitative deltas are computed against the same persona breakdown so improvements are not just topic luck.
Six queries picked from a real 200-query corpus-direct run, spanning six L1 categories. Toggle EN ↔ 中文 in the top right — the data ships bilingual out of the box.
From clone to 50 generated queries in under 10 minutes. Requires Node 18+ and the Claude CLI (or set a Packy-compatible API key).
queries.jsonl includes the English query, a Chinese translation,
word count, scene metadata, the corpus topic anchor, model name, and timing. Pipe directly
into training, or run scripts/generate-analysis-report.js for a self-contained HTML report.
Synthetic data has edges. Listing them is not self-sabotage — it puts credibility on a multi-indicator evidence net, and it tells adopters which piece they should reinforce in their own setting.
data/intermediate/scenario_specs/; the recommendation is to grow the corpus rather than stack more personas.The shortest path to understanding the codebase.