ui-queryMaker.
vs Persona Hub Ablation vs Datasets Quality Report Persona Evolution Limitations Quickstart
GitHub
Production pipeline · v2 corpus-direct branch

Realistic UI query data,
synthesized with rigor.

A production pipeline for generating large, diverse, persona-grounded natural-language queries that describe UI to be built — anchored to a 2,440-topic real-world corpus, validated by trigram-similarity dedup, and benchmarked across four generation strategies.

corpus controls what · persona controls who / how — two orthogonal control signals, with a 4-way horizontal ablation that Persona Hub's original paper did not run.

01Real-world corpus grounding 02Persona-centered authenticity 03Design-style aware 04Similarity-validated diversity
2,440topics
Corpus coverage
61 L2 scenes · 12 L1 categories
200/ 200
Run success rate
0 failures · claude-sonnet-4-6
~84words
Avg query length
medium complexity, EN
~3.3s
Per-query latency
200 queries in 11 min
// THE PROBLEM

Why synthesized UI queries usually fall short

Most "let's just prompt an LLM for some queries" pipelines collapse to narrow, templated distributions that don't generalize. We attack that on three axes.

Narrow distribution
A naive prompt returns 100 variations of "build me a dashboard" — covering < 5% of real product space.

Real-world corpus anchoring
Each generation is locked to a specific topic from a curated 2,440-entry corpus spanning 12 L1 categories.
Robotic phrasing
LLMs default to a single polite, structured voice that doesn't match how humans actually request UI.

Persona-driven voice
Five archetypes × three complexity styles produce first-person voice variation grounded in user goals.
Visually flat output
Generated queries rarely specify visual style, leaving downstream UI generation to default to one aesthetic.

Design-style awareness
Eleven registered design styles × three invocation modes (default / fixed / heuristic-auto) inject visual intent.
// WHO THIS IS FOR

Three concrete usage angles

Each card maps to a real situation where a team needs "a batch of high-quality query data" — not a benchmark, not a demo, but data that holds up in production or research.

🏭
Industry · Product
NL → mini-app code generation products
Teams behind Bolt, v0.dev, Builder.io, ByteDance Doubao, Ant Lingguang and similar products need high-quality seed query data for training, eval, and prompt iteration. This repo ships a drop-in 2,440-topic × 5-persona × 11-style dataset, distribution-faithful to scene weights, framed as 0-to-1 mini-apps (not single-page mocks).
🎓
Research
Synthetic data, instruction tuning, UI codegen
Controlled-variable benchmark in test-corpus-methods.js empirically shows corpus anchoring and persona injection are additive — both raise topic hit rate and authenticity independently. Three diversification layers (Layer-A dedup / opener hash / persona-tone) are open-sourced and directly reproducible as a baseline.
🛠
Industry · Internal teams
"A batch of seed queries, fast"
ML / engineering teams who need ad-hoc query batches for prompt iteration, UX test sets, pre-launch API stress tests, or product demos. ~3.3s per query · ~7 min for 200 · cross-batch dedup state · bilingual EN+ZH xlsx out-of-the-box.
// THE PIPELINE

How one UI query is synthesized — 4-step pipeline

Inputs from the corpus channel (what) and the persona channel (who / how) converge at buildCorpusPlan(); a single Claude call generates the query; successful rows feed back to corpus_usage.json so the next batch picks least-used topics.

ui-queryMaker corpus-direct pipeline: 4 steps with Layer-A feedback LM INPUTS scenario.xlsx 61 L2 · 12 L1 corpus_data.json 2,440 topics · the WHAT 5 archetypes + L2 best-fit map · WHO/HOW corpus_usage.json Layer-A state · feedback target STEP 1 · PLAN buildCorpusPlan() ↳ Layer-A least-used topic ↳ persona ← L2 semantic match ↳ design_style allocation ↳ complexity mix → plan.jsonl (deterministic) STEP 2 · PROMPT ASSEMBLY buildCorpusDirectQueryPrompt() ↳ corpus topic anchor ↳ persona voice descriptor (qualitative, no example phrases) ↳ design_style hint ↳ scope rule · app, not page/screen ↳ positive framing · NEG ≤ 1 ↳ opener bucket hash · 5-way (Build/Need/Create/Make/∅) ↳ jargon blacklist (dashboard, modal, swipeable…) STEP 3 · LM claude-sonnet-4-6 1× call · ~3.3 s No-API mode: Claude Code subagent · $0 6–15 parallel batches STEP 4 · SCORE · DEDUP · PERSIST scoreQuery() A 40% · S 40% · D 20% pass threshold ≥ 2.8 / 5.0 trigram Jaccard within scene drop if ≥ 0.55 vs same-scene peers queries.jsonl + queries.xlsx · dashboard.html ↻ saveCorpusUsage() · success rows → Layer-A state Layer-A feedback · cross-batch topic overlap = 0% REAL EXAMPLE · one row from data/output/corpus_run_v9_mobile_300/ PLAN TASK · STEP 1 OUTPUT query_id: q_v9_mob_0142 l2_scene: 番茄钟 / 待办 corpus_topic: 每日时段聚焦清单 persona: operator (L2 best-fit) design_style: Minimalism complexity: medium opener_bucket: "no formal opener" platform: mobile generator: cc-subagent (sonnet) step 2 + 3 QUERY · STEP 3 LM OUTPUT (verbatim) "I keep losing track of which task I should focus on next when I sit down to work, so I want a simple pomodoro app where I can drop in my to-dos for the day and have it walk me through them one focus block at a time without me having to think about it — clean look, big timer, just the next thing in front of me." step 4 · audit: word_count = 81 opener = "I keep losing…" NEG = 0 scope_noun = "app" ✓ step 4 · score: Authenticity 4.2 · Specificity 4.0 · Diversity 4.5 → weighted 4.16 / 5.0 step 4 · jaccard: max trigram sim vs same-scene peers = 0.22 → distinct, kept step 4 · verdict: PASS → written to queries.jsonl + queries.xlsx · 番茄钟 topic marked used in corpus_usage.json

The diagram shows the corpus-direct path (production default). The legacy persona-driven path adds an extra LLM call for persona synthesis between Step 1 and Step 2; both share the same Step 4 scoring + persistence layer.
How Step 4 scores — per-record heuristic: Authenticity × 0.4 + Specificity × 0.4 + Diversity × 0.2, pass threshold ≥ 2.8 / 5.0. A and S get equal 40% because "sounds like a real user" and "specific enough to be useful" are the two signals most directly computable from a single query; D only gets 20% because a single query can't prove batch-level diversity — that responsibility sits upstream in the corpus channel and Layer-A state, not at the per-record scorer.

PILLAR 01

Diversity through real-world grounding

The "pre-stage" is the secret. We don't ask the LLM to come up with topics — we pre-allocate them from a curated corpus, then let the LLM render natural language around each anchor. Similarity-based scoring then verifies the output stays diverse.

// PRE-STAGE
Hierarchical topic allocation
Excel scenario spec is parsed into 61 L2 scenes across 12 L1 categories. Each task in the plan is anchored to a concrete corpus topic — not a templated slot.
spec × corpus_data plan.jsonl
  // 200 tasks · L1-proportional · topic-anchored
// POST-STAGE
Trigram Jaccard dedup
Each query is compared against its same-scene peers via trigram-set Jaccard similarity. The diversity score rewards low max-peer-similarity.
J(A,B) = |trigrams(A) trigrams(B)| / ||
sim < 0.55distinct enough+1.0 sim < 0.30strongly distinct+2.0
Each ⬤ inner node is an L1 category; · outer dot is an L2 sub-scene. Hover to highlight a branch · click any node to drill in.
// click an L1 or L2 node to drill in
Click an inner ⬤ to see its L2 sub-scenes, or any · outer dot to see the actual topics inside.
// PAIN POINTS DISCOVERED IN BATCH GENERATION → FIXES SHIPPED
// CROSS-BATCH
Layer-A · Least-used topic sampling
Naive topics[i % length] rotation makes consecutive batches pick identical corpus topics. We persist data/state/corpus_usage.json — usage counts per (l2_key, topic) — and prefer least-used topics for each new run. 100% → 0% topic overlap with previous batch.
pickLeastUsedTopics(topics, n, usageMap)
  // stable tiebreak by original index
v3 → v4topic overlap with prior batch100% → 0%
// PROMPT CONVERGENCE
Opener hash distribution
With "mobile H5" in the system prompt, 54% of v3 outputs converged on Build a mobile .... We deterministically hash query_id into one of 5 opener buckets (Build a / Need a / Create a / Make a / no formal opener), forcing uniform distribution. Idempotent across reruns.
opener = OPENERS[hash(query_id) % 5]
v3"Build a" share54% v4all 4 main openers38–48 each
// VOICE / TONE
Layer-C · Persona-tone semantic mapping
v4 batches still read like PM specs ("GTD inbox dashboard", "auto-generate based on UV index", "bottom card swipe up to expand"). We define 5 ordinary-user personas (maker, planner, curator, operator, founder_like) and assign each task the best-fit persona by L2 semantics (not random hash) via scripts/corpus_persona_map.json. The prompt injects each persona's voice descriptor — purely qualitative, no example phrases (which would make output rigid). A jargon blacklist (modal / dashboard / auto-generate / swipeable / bottom sheet ...) reinforces what each persona wouldn't say.
persona = map[task.corpus_l2_key] || "maker"
  // 61 L2 entries · semantic best-fit · NOT hash-random
curator个人生活类 · 内容创作 · 餐厅点评 · 笔记编辑 operator番茄钟 · 待办 · 健康打卡 · 编辑/搜索/筛选 planner闪卡 · 行程 · 健康追踪 · 学习打卡 founder_like个人专业 · 海报 / 简历模板 maker长尾微工具 · 经典小游戏 · Adding & Creating
Bootstrap from existing run · first time enabling Layer-A, seed the state file from a historical queries.jsonl + plan.jsonl pair. Subsequent runs auto-update on success; failed rows are not counted toward usage. Inspect data/state/corpus_usage.json any time.
PILLAR 02

Authenticity through persona

Five archetypes drawn from how people actually request UI work. Each archetype carries per-complexity style hints, so the same persona phrases a vague request differently from a complex one — preserving voice while scaling specificity.

// HOW THIS COMPARES

vs Persona Hub / Self-Instruct / Magpie

Synthetic-data research has roughly four lines — instance-driven (Self-Instruct, Evol-Instruct), key-point-driven (GLAN), persona-driven (Persona Hub), and self-play (Magpie). This repo sits on the persona-driven line and adds three engineering reinforcements the original paper did not include.

Dimension Persona Hub (Tencent AI Lab, 2024) This repo
Persona source Generic 1B persona pool, reverse-derived from web text 5 targeted archetypes reverse-derived from product scenarios + real user profiles
Distribution control Black-box: relies on the large pool's natural spread White-box: corpus channel tracks 2,440 topics, Layer-A least-used-first state
Horizontal ablation No with/without-persona ablation; no head-to-head vs Self-Instruct / Magpie 4-way: llm-direct / corpus-direct / persona-direct / corpus+persona
Typical setting General-domain distillation at scale Vertical product domain (UI vibe-coding) — where you can actually obtain a clean corpus
Honest framing: Persona Hub's headline evidence is "1M-persona synthesis trains a 7B model to approach GPT-4-turbo on MATH" — strong end-to-end result evidence, but it does not directly isolate the marginal contribution of the persona mechanism itself vs other synthesis routes. This repo fills that ablation gap (see next section) and adds a corpus channel for white-box distribution anchoring.
Method lineage: Self-Instruct (2022) → Evol-Instruct (2023) → Magpie (2024) → Persona Hub (2024) → this repo.
PILLAR 04

The 4-way ablation Persona Hub skipped

Under same base model (claude-sonnet-4-6) / same query total / same eval protocol, four generation strategies head-to-head on identical scenes. Each method isolates the contribution of one control signal — corpus (what) or persona (who / how) — so the marginal effect of each channel becomes visible.

METHOD A · baseline
Scene-direct
Only L1/L2 scene context. No topic, no persona. Measures raw scene-conditioned quality.
METHOD B · primary
Corpus-direct
Scene + specific corpus topic. 1× LLM call. The production default — sweet spot.
METHOD C · voice
Persona-only
Scene + persona, no topic anchor. Tests whether persona alone yields specific queries.
METHOD D · combined
Persona + Corpus
Best on paper. 2× LLM calls (persona synthesis + query). Higher cost & latency.
Method Topic adherence Voice authenticity LLM calls / query Verdict
Scene-direct ~60–80% low Baseline · drifts off-topic
Corpus-direct ~100% medium Production default · best ROI
Persona-only low high Strong voice · weak specificity
Persona + Corpus ~100% high Best quality · 2× cost & latency
Honest takeaway: Persona+Corpus wins on paper, but at 2× the cost. For most production workloads, corpus-direct hits the sweet spot — 100% topic adherence with a single LLM call. Persona+Corpus is reserved for high-value subsets where voice quality matters most.
Scoring rubric: Authenticity (40%) + Specificity (40%) + Diversity (20%). Pass threshold ≥ 2.8 / 5.0.
// VERBATIM QUERY SHAPE · vs WEB / UI CODEGEN DATASETS

What a query actually looks like in each public web-codegen dataset

Verbatim quotes from each dataset's published samples — no paraphrasing, no rewriting. Scope deliberately narrowed to web / UI code-generation datasets (the same application surface this repo targets), not general instruction-tuning corpora. Each row is a real string copied from the dataset card or release; the source link sits next to it so you can verify. Then one real bilingual query from this repo for shape contrast.

Dataset · source Verbatim query / instruction (as published)
WebSight v0.1
HuggingFaceM4
HF dataset card ↗
screenshot ↔ HTML pairs · the text field shown here is the natural-language side
"Fashion Brand: A visually stunning layout with a full-width, rotating image carousel showcasing their latest collections, a bold, center-aligned logo, and a bottom navigation menu. The color palette is inspired by the latest fashion trends."
shape · visual-structural description · "[type]: [layout pieces]" template · no user voice, no goal, no context
Web2Code
MBZUAI · NeurIPS 2024
HF dataset card ↗
1.18M webpage-image ↔ HTML instruction pairs · conversations[0].value field shown
"<image>\nGenerate HTML corresponding to the webpage in the given image. See code developed with guidance from the principles of material design."
shape · model-facing imperative + screenshot · "Generate HTML…" template · no end-user motivation, no product framing
WebGen-Bench / WebGen-Instruct
Lu et al. 2025 · arXiv 2505.03733
HF dataset card ↗
101 test + 6,667 train instructions for end-to-end website generation
"Please implement a website for generating stock reports to provide stock information and analysis. The website should have the functionality to search and summarize stock information, and generate customized stock reports based on user requirements. Users should be able to input stock codes or names, select report formats and content, and the website will automatically generate the corresponding reports. The reports should include basic stock information, market trends, financial data, and more. Set the background color to white and the component color to navy."
shape · third-person functional spec · "Users should be able to…" PRD voice · feature checklist with a tacked-on color rule · no first-person motivation
MM-WebGen-Bench
Microsoft · 2025
HF dataset card ↗
120 multimodal webpage-design prompts · 11 scenes × 11 styles · input field shown
"Design a playful, vibrant event landing page titled 'Pixel Pop Creative Camp 2025' that showcases a weekend of digital design and comic illustration workshops. The overall layout should favor a two-column structure with deliberate asymmetries and whimsical details, all in a bright, candy-inspired palette. The page back…"
shape · design-director brief · "Design a [adjective stack] [page type] titled '…' that …" · aesthetic vocabulary · still third-person, no real-user voice
ui-queryMaker
corpus + persona
corpus_run_v7_mobile_500/queries.jsonl · q_scene_002_001 · persona = curator · L2 = 个人生活类 · topic = Travel Memory Scrapbook
"A travel memory scrapbook app for mobile where each trip feels like flipping through a beautifully worn journal — photos layered with handwritten-style captions, little stamps marking the city and date, and a warm golden-hour color palette that gives everything that nostalgic, end-of-a-roll-of-film feeling. The home screen shows all your trips arranged like a loose stack of polaroids you can tap into, and once inside a trip, the memories are laid out in an organic, almost collage-like way where things slightly overlap and feel like someone actually pasted them onto a page by hand. The vibe should feel tactile and intimate, the kind of thing you'd want to screenshot and share just because it looks so good."
shape · first-person user voice + complete 0-to-1 app framing · 117 words · top-level noun = app · no dev jargon · positive framing (NEG = 0)
The shape gap, observed without rewriting: Quoted directly, the four public web-codegen datasets cluster into recognizable shapes — visual-structural description (WebSight), model-facing imperative (Web2Code), third-person functional spec (WebGen-Bench), and design-director brief (MM-WebGen-Bench). None of them read like an actual end user of Bolt / v0.dev / Doubao / Lingguang asking for an app — those users speak in first-person, with personal motivation and product framing. The "shape gap" between these benchmarks and real production queries is exactly what corpus + persona is designed to close.
// QUALITY REPORT · QUANTITATIVE · vs 4 PUBLIC WEB-CODEGEN DATASETS

Quantitative quality report — real numbers, identical metrics on all 5 datasets

Every number below is computed from real verbatim data — 100 samples per dataset, identical algorithm applied to all five. Code: scripts/quality/*. Raw output: data/output/quality_report/*.json. Charts A+B run as pure-regex / pure-counting Node scripts; chart C runs locally on sentence-transformers/all-MiniLM-L6-v2 ONNX (90 MB, in models/) + van der Maaten's reference numpy t-SNE. Reproducible single-command pipeline; no LLM-as-judge, no API, no downstream training in scope.

A · Surface signals — voice + lexical · 7 metrics × 5 datasets

dataset jargon/q
↓ user voice
words p50
context
distinct-3
↑ less template
TTR
↑ richer vocab
#openers
↑ variety /100
top opener · share
↓ less template
max-peer sim
↓ fewer dupes
📖 metric reference · what each column means
jargon/qlower = ordinary-user voice
Voice (vocab). Mean dev-jargon hits per query (dashboard, modal, swipeable, CTA…). Real end users don't know these terms. Our target: ~0.
words p50context-only · ~80-150 ideal
Length. Median word count per query. Too short = info-thin / one-line command; too long = spec or design brief, not a user's natural ask.
distinct-3higher = less templating
Lexical diversity. Unique trigrams ÷ total trigrams across the dataset (Li et al. 2016). 1.0 = every 3-word sequence is unique; lower = lots of repeated phrases.
TTRhigher = richer vocab
Vocab richness. Type-token ratio = unique words ÷ total words. Long texts naturally have lower TTR (words repeat).
#openershigher = more variety · /100
Opener variety. Count of unique 3-word openers across 100 queries. Ceiling 100 (every query unique); WebGen-Bench at 4 = 100 queries use only 4 opening templates.
top opener · sharelower share = less template
Opener templating. The most common 3-word opener and its share. 76% = three out of four queries start with the same 3 words.
max-peer simlower = fewer near-dupes
Near-duplicate rate. For each query, its highest trigram-Jaccard similarity to any other peer; averaged across 100 queries.

B · Intra-dataset spread — does our pre-defined requirements distribution actually surface in the queries?

500 ui-queryMaker queries from corpus_run_v7_mobile_500, embedded with sentence-transformers/all-MiniLM-L6-v2 (ONNX, local, 384-dim, mean-pool + L2-norm) → t-SNE 2D (perplexity 30, 1000 iter, van der Maaten's reference numpy implementation). Click a color-mode button to recolor: L1 (12 categories, Excel-defined topic structure) · persona (5 archetypes, who/how channel) · style · complexity. The proper test isn't "are we visually different from other datasets" — different datasets target different scenarios. The test is: do queries cluster meaningfully along the structural fields we set out to spread across? If yes → the prompt didn't collapse. Same intra-dataset shape inspection as Code Aesthetics (arXiv 2510.23272).

What the numbers actually show: Charts A + B (cross-dataset, lexical): ours leads on distinct-3 trigram diversity (0.76) and is the only dataset with non-trivial first-person voice (0.81%, ~4–5× the others). WebGen-Bench has 76% of openers starting with "please implement a" (opener entropy 1.11 bits, only 4 unique openers) — clear templating. Web2Code has the highest intra-dataset max-peer similarity (0.33) from the "Generate HTML…" template. MM-WebGen-Bench briefs average 2,600 words (25× ours).

Chart C (intra-ours, semantic): our 500 queries spread across all 12 L1 categories with each L1 forming a coherent cluster in MiniLM embedding space — the corpus channel's topic anchoring carries through to the semantic layer, not just the lexical layer. Switch the color mode to persona to see persona archetype distribution within each L1; switch to complexity to see complexity tiers.

What this proves and doesn't: measurable shape differences vs other web-codegen datasets and intra-dataset spread along our pre-defined requirements distribution. Cross-dataset 2D projection was intentionally removed — different datasets target different scenarios, so clustering apart proves nothing about quality. The "is ours better" question still needs downstream validation, honestly listed in §Limitations.
EXTENSION · DESIGN STYLE

Design-style aware generation

A registered design-style layer on top of the core two-channel pipeline — orthogonal to what and who/how, controls the look. Eleven first-class styles, each with a Chinese persona-side hint and an English LLM-side instruction. New styles register in one line.

MODE 01 · default
Free / contextual inference
No fixed style — let the visual direction emerge naturally from scene + topic context.
design_style: null
MODE 02 · fixed
Explicit rotation
Cycle through a user-supplied list. Guarantees coverage of priority styles.
--design-styles "Dark,
 Glassmorphism,Cyberpunk"
MODE 03 · auto
Heuristic scene-matching
inferStyles() ranks styles by scene keywords — e.g. health + meditation → Neumorphism, Minimalism, Dark.
--design-styles auto
// HOW WE GOT HERE

Iteration story: 4 stages, 4 fixes

Each stage is a real production-grade pain we discovered after running the previous version at scale, then a focused prompt or pipeline fix. Click any "Sample query" to read the actual generated text — same persona, side by side, before vs after.

Stage 1
Naive baseline — single prompt, no scaffolding
PAIN
No persona, no opener distribution, no scope discipline. Same first-N corpus topics get reused across batches; "Build a mobile X" pattern dominates; query framed as a single page or screen, not a 0-to-1 app.
Sample query (illustrative)
BEFOREBuild a mobile travel scrapbook page where users can create entries with cover photo, trip title, date range, and a collage-style photo grid… [identical opener and "page" framing across the entire batch]
Stage 2
Three-layer diversification 040a427
FIX
Layer-A least-used topic dedup across batches via corpus_usage.json state · 5-bucket opener hash (Build a / Need a / Create a / Make a / no opener) keyed by query_id · persona-tone semantic mapping from L2 → 5 ordinary-user archetypes.
RESULT
Cross-batch corpus-topic overlap: 100% → 0%. "Build a" share: 54% → 21%. 5 distinct persona voices visible in batch.
NEW PAIN
Audit found 49.5% of queries still framed as "Build a XX page where…" — the opener is now diverse but the scope noun is still page-level, causing downstream LLMs to generate single-page mocks instead of complete apps.
Stage 3
App-scope rewrite 07ed4af
FIX
New rule 7 forbids page / screen / view / section / module / feature / widget as the top-level scope noun. Use app or a specific app type (tracker / tool / reminder / planner / calculator / logger / manager / timer). Existing 200-row batch retrofitted via scripts/rewrite-app-scope.js.
RESULT
Single-page-framed queries: 49.5% → 0%. Average word count unchanged (91 → 91, no length inflation).
NEW PAIN
Audit found 76% of EN queries / 80% of ZH contained negation words; 45% had outright grievance dump patterns ("no stock photo, no cartoon, no confetti…"). Real users describing 0-to-1 apps speak in positive terms — they're building something they want, not refining something they hate.
Sample query (real, founder_like persona, wedding card creator)
BEFOREMake a wedding invitation card creator that feels personal and handcrafted, not like some cookie-cutter template factory — I want a small set of maybe four or five elegant layouts I can actually customize with our names, date, and a short line of text, and the font choices should lean traditional and warm, not trendy sans-serif stuff. No stock photo backgrounds, no cartoon illustrations, no confetti animations — just clean, tasteful design with maybe a soft floral border option. It should feel like something I made myself, not something that came off an assembly line.
Stage 4
Positive-framing prompt rewrite — current ee04965
FIX
(1) Rewrote founder_like voice — was literally instructed to "explain what NOT to include as much as what to include". (2) Flipped three "Do NOT" prompt rules into positive-form ("Open with the substance" / "Use everyday vocabulary" / "Use 'app' as top-level noun"). LLMs primed by negation imperatives mirror them in output. (3) Added explicit positive-framing rule limiting negation words to ≤1 per query.
RESULT
NEG words / query: 1.08 → 0.66 (-39%). Grievance patterns: 0.58 → 0.22 (-62%). By persona: curator -56%, maker -57%, planner -39%. Residual negations are now feature-value descriptions ("auto-saves so you never lose"), not grievance dumps.
Sample query (real, founder_like persona, ATS resume builder)
AFTERCreate a resume builder app that has its own quiet identity — one of those tools you actually feel good using — built around a small, carefully chosen set of ATS-friendly templates that are clean but still carry a bit of character, where I can fill in my experience and skills section by section and watch a live preview come together in a format that looks genuinely considered to a human reader while staying structured enough for automated hiring systems to parse without a fuss.

Each stage is a single, focused commit on main — fully reproducible. The full negation-density audit script is one inline node command; quantitative deltas are computed against the same persona breakdown so improvements are not just topic luck.

// THE RECEIPTS

Real samples from a production run

Six queries picked from a real 200-query corpus-direct run, spanning six L1 categories. Toggle EN ↔ 中文 in the top right — the data ships bilingual out of the box.

// RUN IT YOURSELF

Quickstart

From clone to 50 generated queries in under 10 minutes. Requires Node 18+ and the Claude CLI (or set a Packy-compatible API key).

bash # 1. Install git clone https://github.com/PlevanTem/queryMaker.git cd queryMaker && npm install # 2. Dry-run the plan (no LLM call) node scripts/run-corpus.js --total 50 --dry-run # 3. Real run — 50 queries, ~3 min node scripts/run-corpus.js --total 50 --concurrency 2 # 4. Inspect outputs cat data/output/corpus_run/summary.json head data/output/corpus_run/queries.jsonl
Each record in queries.jsonl includes the English query, a Chinese translation, word count, scene metadata, the corpus topic anchor, model name, and timing. Pipe directly into training, or run scripts/generate-analysis-report.js for a self-contained HTML report.
// HONEST BOUNDARIES

Limitations we're honest about

Synthetic data has edges. Listing them is not self-sabotage — it puts credibility on a multi-indicator evidence net, and it tells adopters which piece they should reinforce in their own setting.

Persona pool size
5 archetypes chosen by L2 semantic best-fit within the UI product domain. Good coverage on the head; long-tail user representativeness still needs reverse validation against real user logs.
Diversity measured at lexical layer
Diversity is currently measured at the trigram-Jaccard and corpus-distribution layers. Deeper evidence at the semantic / task-distribution / discriminator layers is out of scope for this repo.
No end-to-end downstream validation
The ultimate evidence — "model capability gain when this data is added to training" — is not in scope here due to downstream training resource limits. Adopters should run a controlled comparison in their own setting.
Narrow-corpus L2 scenes
Where an L2 scene's corpus is narrow, the persona channel cannot compensate — mode collapse can still happen. Known narrow scenes are flagged in data/intermediate/scenario_specs/; the recommendation is to grow the corpus rather than stack more personas.
Authenticity relies on expert review + heuristics
Currently design-expert blind review + 3-axis heuristic scoring (authenticity / specificity / diversity). Cold-metric counterparts (discriminator AUC, distribution distance) are not yet in place.
// THE MAP

What's inside

The shortest path to understanding the codebase.

mvp/query_factory_v2.js
Core engine: parsing, planning, generation, scoring (~2,600 LOC)
scripts/run-corpus.js
Production CLI for the corpus-direct pipeline
scripts/run-free.js
Persona-driven runner with --persona-scope control
scripts/test-corpus-methods.js
4-method evaluation harness on identical scenes
scripts/generate-analysis-report.js
Reusable: turn any scored JSONL into an interactive HTML report
scripts/corpus_data.json
The 2,440-topic corpus across 61 L2 categories
data/output/corpus_run/
Latest production run — plan, queries, summary, xlsx export
README.md · ARCHIVE/
Full pipeline diagram and the original 5-stage research blueprint