ui-queryMaker — Realistic UI Query Data Synthesis at Scale

// THE PROBLEM

Why synthesized UI queries usually fall short

Most "let's just prompt an LLM for some queries" pipelines collapse to narrow, templated distributions that don't generalize. We attack that on three axes.

Narrow distribution

A naive prompt returns 100 variations of "build me a dashboard" — covering < 5% of real product space.

Real-world corpus anchoring

Each generation is locked to a specific topic from a curated 2,440-entry corpus spanning 12 L1 categories.

Robotic phrasing

LLMs default to a single polite, structured voice that doesn't match how humans actually request UI.

Persona-driven voice

Five archetypes × three complexity styles produce first-person voice variation grounded in user goals.

Visually flat output

Generated queries rarely specify visual style, leaving downstream UI generation to default to one aesthetic.

Design-style awareness

Eleven registered design styles × three invocation modes (default / fixed / heuristic-auto) inject visual intent.

// WHO THIS IS FOR

Three concrete usage angles

Each card maps to a real situation where a team needs "a batch of high-quality query data" — not a benchmark, not a demo, but data that holds up in production or research.

🏭

Industry · Product

NL → mini-app code generation products

Teams behind Bolt, v0.dev, Builder.io, ByteDance Doubao, Ant Lingguang and similar products need high-quality seed query data for training, eval, and prompt iteration. This repo ships a drop-in 2,440-topic × 5-persona × 11-style dataset, distribution-faithful to scene weights, framed as 0-to-1 mini-apps (not single-page mocks).

🎓

Research

Synthetic data, instruction tuning, UI codegen

Controlled-variable benchmark in test-corpus-methods.js empirically shows corpus anchoring and persona injection are additive — both raise topic hit rate and authenticity independently. Three diversification layers (Layer-A dedup / opener hash / persona-tone) are open-sourced and directly reproducible as a baseline.

🛠

Industry · Internal teams

"A batch of seed queries, fast"

ML / engineering teams who need ad-hoc query batches for prompt iteration, UX test sets, pre-launch API stress tests, or product demos. ~3.3s per query · ~7 min for 200 · cross-batch dedup state · bilingual EN+ZH xlsx out-of-the-box.

// THE PIPELINE

How one UI query is synthesized — 4-step pipeline

Inputs from the corpus channel (what) and the persona channel (who / how) converge at buildCorpusPlan(); a single Claude call generates the query; successful rows feed back to corpus_usage.json so the next batch picks least-used topics.

The diagram shows the corpus-direct path (production default). The legacy persona-driven path adds an extra LLM call for persona synthesis between Step 1 and Step 2; both share the same Step 4 scoring + persistence layer.
How Step 4 scores — per-record heuristic: Authenticity × 0.4 + Specificity × 0.4 + Diversity × 0.2, pass threshold ≥ 2.8 / 5.0. A and S get equal 40% because "sounds like a real user" and "specific enough to be useful" are the two signals most directly computable from a single query; D only gets 20% because a single query can't prove batch-level diversity — that responsibility sits upstream in the corpus channel and Layer-A state, not at the per-record scorer.

PILLAR 01

Diversity through real-world grounding

The "pre-stage" is the secret. We don't ask the LLM to come up with topics — we pre-allocate them from a curated corpus, then let the LLM render natural language around each anchor. Similarity-based scoring then verifies the output stays diverse.

// PRE-STAGE

Hierarchical topic allocation

Excel scenario spec is parsed into 61 L2 scenes across 12 L1 categories. Each task in the plan is anchored to a concrete corpus topic — not a templated slot.

spec × corpus_data → plan.jsonl
// 200 tasks · L1-proportional · topic-anchored

// POST-STAGE

Trigram Jaccard dedup

Each query is compared against its same-scene peers via trigram-set Jaccard similarity. The diversity score rewards low max-peer-similarity.

J(A,B) = |trigrams(A) ∩ trigrams(B)| / |∪|

sim < 0.55distinct enough+1.0 sim < 0.30strongly distinct+2.0

Each ⬤ inner node is an L1 category; · outer dot is an L2 sub-scene. Hover to highlight a branch · click any node to drill in.

// click an L1 or L2 node to drill in

Click an inner ⬤ to see its L2 sub-scenes, or any · outer dot to see the actual topics inside.

// PAIN POINTS DISCOVERED IN BATCH GENERATION → FIXES SHIPPED

// CROSS-BATCH

Layer-A · Least-used topic sampling

Naive topics[i % length] rotation makes consecutive batches pick identical corpus topics. We persist data/state/corpus_usage.json — usage counts per (l2_key, topic) — and prefer least-used topics for each new run. 100% → 0% topic overlap with previous batch.

pickLeastUsedTopics(topics, n, usageMap)
// stable tiebreak by original index

v3 → v4topic overlap with prior batch100% → 0%

// PROMPT CONVERGENCE

Opener hash distribution

With "mobile H5" in the system prompt, 54% of v3 outputs converged on Build a mobile .... We deterministically hash query_id into one of 5 opener buckets (Build a / Need a / Create a / Make a / no formal opener), forcing uniform distribution. Idempotent across reruns.

opener = OPENERS[hash(query_id) % 5]

v3"Build a" share54% v4all 4 main openers38–48 each

// VOICE / TONE

Layer-C · Persona-tone semantic mapping

v4 batches still read like PM specs ("GTD inbox dashboard", "auto-generate based on UV index", "bottom card swipe up to expand"). We define 5 ordinary-user personas (maker, planner, curator, operator, founder_like) and assign each task the best-fit persona by L2 semantics (not random hash) via scripts/corpus_persona_map.json. The prompt injects each persona's voice descriptor — purely qualitative, no example phrases (which would make output rigid). A jargon blacklist (modal / dashboard / auto-generate / swipeable / bottom sheet ...) reinforces what each persona wouldn't say.

persona = map[task.corpus_l2_key] || "maker"
// 61 L2 entries · semantic best-fit · NOT hash-random

curator个人生活类 · 内容创作 · 餐厅点评 · 笔记编辑 operator番茄钟 · 待办 · 健康打卡 · 编辑/搜索/筛选 planner闪卡 · 行程 · 健康追踪 · 学习打卡 founder_like个人专业 · 海报 / 简历模板 maker长尾微工具 · 经典小游戏 · Adding & Creating

Bootstrap from existing run · first time enabling Layer-A, seed the state file from a historical queries.jsonl + plan.jsonl pair. Subsequent runs auto-update on success; failed rows are not counted toward usage. Inspect data/state/corpus_usage.json any time.

PILLAR 02

Authenticity through persona

Five archetypes drawn from how people actually request UI work. Each archetype carries per-complexity style hints, so the same persona phrases a vague request differently from a complex one — preserving voice while scaling specificity.

// HOW THIS COMPARES

vs Persona Hub / Self-Instruct / Magpie

Synthetic-data research has roughly four lines — instance-driven (Self-Instruct, Evol-Instruct), key-point-driven (GLAN), persona-driven (Persona Hub), and self-play (Magpie). This repo sits on the persona-driven line and adds three engineering reinforcements the original paper did not include.

Dimension	Persona Hub (Tencent AI Lab, 2024)	This repo
Persona source	Generic 1B persona pool, reverse-derived from web text	5 targeted archetypes reverse-derived from product scenarios + real user profiles
Distribution control	Black-box: relies on the large pool's natural spread	White-box: corpus channel tracks 2,440 topics, Layer-A least-used-first state
Horizontal ablation	No with/without-persona ablation; no head-to-head vs Self-Instruct / Magpie	4-way: `llm-direct` / `corpus-direct` / `persona-direct` / `corpus+persona`
Typical setting	General-domain distillation at scale	Vertical product domain (UI vibe-coding) — where you can actually obtain a clean corpus

Honest framing: Persona Hub's headline evidence is "1M-persona synthesis trains a 7B model to approach GPT-4-turbo on MATH" — strong end-to-end result evidence, but it does not directly isolate the marginal contribution of the persona mechanism itself vs other synthesis routes. This repo fills that ablation gap (see next section) and adds a corpus channel for white-box distribution anchoring.

    Method lineage: Self-Instruct (2022) → Evol-Instruct (2023) → Magpie (2024) → Persona Hub (2024) → this repo.
  

PILLAR 04

The 4-way ablation Persona Hub skipped

Under same base model (claude-sonnet-4-6) / same query total / same eval protocol, four generation strategies head-to-head on identical scenes. Each method isolates the contribution of one control signal — corpus (what) or persona (who / how) — so the marginal effect of each channel becomes visible.

METHOD A · baseline

Scene-direct

Only L1/L2 scene context. No topic, no persona. Measures raw scene-conditioned quality.

METHOD B · primary

Corpus-direct

Scene + specific corpus topic. 1× LLM call. The production default — sweet spot.

METHOD C · voice

Persona-only

Scene + persona, no topic anchor. Tests whether persona alone yields specific queries.

METHOD D · combined

Persona + Corpus

Best on paper. 2× LLM calls (persona synthesis + query). Higher cost & latency.

Method	Topic adherence	Voice authenticity	LLM calls / query	Verdict
Scene-direct	~60–80%	low	1×	Baseline · drifts off-topic
Corpus-direct	~100%	medium	1×	Production default · best ROI
Persona-only	low	high	1×	Strong voice · weak specificity
Persona + Corpus	~100%	high	2×	Best quality · 2× cost & latency

Honest takeaway: Persona+Corpus wins on paper, but at 2× the cost. For most production workloads, corpus-direct hits the sweet spot — 100% topic adherence with a single LLM call. Persona+Corpus is reserved for high-value subsets where voice quality matters most.

    Scoring rubric: Authenticity (40%) + Specificity (40%) + Diversity (20%). Pass threshold ≥ 2.8 / 5.0.
  

// VERBATIM QUERY SHAPE · vs WEB / UI CODEGEN DATASETS

What a query actually looks like in each public web-codegen dataset

Verbatim quotes from each dataset's published samples — no paraphrasing, no rewriting. Scope deliberately narrowed to web / UI code-generation datasets (the same application surface this repo targets), not general instruction-tuning corpora. Each row is a real string copied from the dataset card or release; the source link sits next to it so you can verify. Then one real bilingual query from this repo for shape contrast.

Dataset · source	Verbatim query / instruction (as published)
WebSight v0.1 HuggingFaceM4 HF dataset card ↗ screenshot ↔ HTML pairs · the `text` field shown here is the natural-language side	"Fashion Brand: A visually stunning layout with a full-width, rotating image carousel showcasing their latest collections, a bold, center-aligned logo, and a bottom navigation menu. The color palette is inspired by the latest fashion trends." shape · visual-structural description · "[type]: [layout pieces]" template · no user voice, no goal, no context
Web2Code MBZUAI · NeurIPS 2024 HF dataset card ↗ 1.18M webpage-image ↔ HTML instruction pairs · `conversations[0].value` field shown	"<image>\nGenerate HTML corresponding to the webpage in the given image. See code developed with guidance from the principles of material design." shape · model-facing imperative + screenshot · "Generate HTML…" template · no end-user motivation, no product framing
WebGen-Bench / WebGen-Instruct Lu et al. 2025 · arXiv 2505.03733 HF dataset card ↗ 101 test + 6,667 train instructions for end-to-end website generation	"Please implement a website for generating stock reports to provide stock information and analysis. The website should have the functionality to search and summarize stock information, and generate customized stock reports based on user requirements. Users should be able to input stock codes or names, select report formats and content, and the website will automatically generate the corresponding reports. The reports should include basic stock information, market trends, financial data, and more. Set the background color to white and the component color to navy." shape · third-person functional spec · "Users should be able to…" PRD voice · feature checklist with a tacked-on color rule · no first-person motivation
MM-WebGen-Bench Microsoft · 2025 HF dataset card ↗ 120 multimodal webpage-design prompts · 11 scenes × 11 styles · `input` field shown	"Design a playful, vibrant event landing page titled 'Pixel Pop Creative Camp 2025' that showcases a weekend of digital design and comic illustration workshops. The overall layout should favor a two-column structure with deliberate asymmetries and whimsical details, all in a bright, candy-inspired palette. The page back…" shape · design-director brief · "Design a [adjective stack] [page type] titled '…' that …" · aesthetic vocabulary · still third-person, no real-user voice
ui-queryMaker ★ corpus + persona `corpus_run_v7_mobile_500/queries.jsonl` · `q_scene_002_001` · persona = curator · L2 = 个人生活类 · topic = Travel Memory Scrapbook	"A travel memory scrapbook app for mobile where each trip feels like flipping through a beautifully worn journal — photos layered with handwritten-style captions, little stamps marking the city and date, and a warm golden-hour color palette that gives everything that nostalgic, end-of-a-roll-of-film feeling. The home screen shows all your trips arranged like a loose stack of polaroids you can tap into, and once inside a trip, the memories are laid out in an organic, almost collage-like way where things slightly overlap and feel like someone actually pasted them onto a page by hand. The vibe should feel tactile and intimate, the kind of thing you'd want to screenshot and share just because it looks so good." shape · first-person user voice + complete 0-to-1 app framing · 117 words · top-level noun = `app` · no dev jargon · positive framing (NEG = 0)

The shape gap, observed without rewriting: Quoted directly, the four public web-codegen datasets cluster into recognizable shapes — visual-structural description (WebSight), model-facing imperative (Web2Code), third-person functional spec (WebGen-Bench), and design-director brief (MM-WebGen-Bench). None of them read like an actual end user of Bolt / v0.dev / Doubao / Lingguang asking for an app — those users speak in first-person, with personal motivation and product framing. The "shape gap" between these benchmarks and real production queries is exactly what corpus + persona is designed to close.

// QUALITY REPORT · QUANTITATIVE · vs 4 PUBLIC WEB-CODEGEN DATASETS

Quantitative quality report — real numbers, identical metrics on all 5 datasets

Every number below is computed from real verbatim data — 100 samples per dataset, identical algorithm applied to all five. Code: scripts/quality/*. Raw output: data/output/quality_report/*.json. Charts A+B run as pure-regex / pure-counting Node scripts; chart C runs locally on sentence-transformers/all-MiniLM-L6-v2 ONNX (90 MB, in models/) + van der Maaten's reference numpy t-SNE. Reproducible single-command pipeline; no LLM-as-judge, no API, no downstream training in scope.

A · Surface signals — voice + lexical · 7 metrics × 5 datasets

dataset	`jargon/q` ↓ user voice	`words p50` context	`distinct-3` ↑ less template	`TTR` ↑ richer vocab	`#openers` ↑ variety /100	`top opener · share` ↓ less template	`max-peer sim` ↓ fewer dupes

📖 metric reference · what each column means

jargon/qlower = ordinary-user voice: Voice (vocab). Mean dev-jargon hits per query (dashboard, modal, swipeable, CTA…). Real end users don't know these terms. Our target: ~0.
words p50context-only · ~80-150 ideal: Length. Median word count per query. Too short = info-thin / one-line command; too long = spec or design brief, not a user's natural ask.
distinct-3higher = less templating: Lexical diversity. Unique trigrams ÷ total trigrams across the dataset (Li et al. 2016). 1.0 = every 3-word sequence is unique; lower = lots of repeated phrases.
TTRhigher = richer vocab: Vocab richness. Type-token ratio = unique words ÷ total words. Long texts naturally have lower TTR (words repeat).
#openershigher = more variety · /100: Opener variety. Count of unique 3-word openers across 100 queries. Ceiling 100 (every query unique); WebGen-Bench at 4 = 100 queries use only 4 opening templates.
top opener · sharelower share = less template: Opener templating. The most common 3-word opener and its share. 76% = three out of four queries start with the same 3 words.
max-peer simlower = fewer near-dupes: Near-duplicate rate. For each query, its highest trigram-Jaccard similarity to any other peer; averaged across 100 queries.

B · Intra-dataset spread — does our pre-defined requirements distribution actually surface in the queries?

500 ui-queryMaker queries from corpus_run_v7_mobile_500, embedded with sentence-transformers/all-MiniLM-L6-v2 (ONNX, local, 384-dim, mean-pool + L2-norm) → t-SNE 2D (perplexity 30, 1000 iter, van der Maaten's reference numpy implementation). Click a color-mode button to recolor: L1 (12 categories, Excel-defined topic structure) · persona (5 archetypes, who/how channel) · style · complexity. The proper test isn't "are we visually different from other datasets" — different datasets target different scenarios. The test is: do queries cluster meaningfully along the structural fields we set out to spread across? If yes → the prompt didn't collapse. Same intra-dataset shape inspection as Code Aesthetics (arXiv 2510.23272).

What the numbers actually show: Charts A + B (cross-dataset, lexical): ours leads on distinct-3 trigram diversity (0.76) and is the only dataset with non-trivial first-person voice (0.81%, ~4–5× the others). WebGen-Bench has 76% of openers starting with "please implement a" (opener entropy 1.11 bits, only 4 unique openers) — clear templating. Web2Code has the highest intra-dataset max-peer similarity (0.33) from the "Generate HTML…" template. MM-WebGen-Bench briefs average 2,600 words (25× ours).

Chart C (intra-ours, semantic): our 500 queries spread across all 12 L1 categories with each L1 forming a coherent cluster in MiniLM embedding space — the corpus channel's topic anchoring carries through to the semantic layer, not just the lexical layer. Switch the color mode to persona to see persona archetype distribution within each L1; switch to complexity to see complexity tiers.

What this proves and doesn't: measurable shape differences vs other web-codegen datasets and intra-dataset spread along our pre-defined requirements distribution. Cross-dataset 2D projection was intentionally removed — different datasets target different scenarios, so clustering apart proves nothing about quality. The "is ours better" question still needs downstream validation, honestly listed in §Limitations.

EXTENSION · DESIGN STYLE

Design-style aware generation

A registered design-style layer on top of the core two-channel pipeline — orthogonal to what and who/how, controls the look. Eleven first-class styles, each with a Chinese persona-side hint and an English LLM-side instruction. New styles register in one line.

MODE 01 · default

Free / contextual inference

No fixed style — let the visual direction emerge naturally from scene + topic context.

design_style: null

MODE 02 · fixed

Explicit rotation

Cycle through a user-supplied list. Guarantees coverage of priority styles.

--design-styles "Dark,
 Glassmorphism,Cyberpunk"

MODE 03 · auto

Heuristic scene-matching

inferStyles() ranks styles by scene keywords — e.g. health + meditation → Neumorphism, Minimalism, Dark.

--design-styles auto

// HOW WE GOT HERE

Iteration story: 4 stages, 4 fixes

Each stage is a real production-grade pain we discovered after running the previous version at scale, then a focused prompt or pipeline fix. Click any "Sample query" to read the actual generated text — same persona, side by side, before vs after.

Stage 1

Naive baseline — single prompt, no scaffolding

PAIN

No persona, no opener distribution, no scope discipline. Same first-N corpus topics get reused across batches; "Build a mobile X" pattern dominates; query framed as a single page or screen, not a 0-to-1 app.

Sample query (illustrative)

BEFOREBuild a mobile travel scrapbook page where users can create entries with cover photo, trip title, date range, and a collage-style photo grid… [identical opener and "page" framing across the entire batch]

Stage 2

Three-layer diversification 040a427

FIX

Layer-A least-used topic dedup across batches via corpus_usage.json state · 5-bucket opener hash (Build a / Need a / Create a / Make a / no opener) keyed by query_id · persona-tone semantic mapping from L2 → 5 ordinary-user archetypes.

RESULT

Cross-batch corpus-topic overlap: 100% → 0%. "Build a" share: 54% → 21%. 5 distinct persona voices visible in batch.

NEW PAIN

Audit found 49.5% of queries still framed as "Build a XX page where…" — the opener is now diverse but the scope noun is still page-level, causing downstream LLMs to generate single-page mocks instead of complete apps.

Stage 3

App-scope rewrite 07ed4af

FIX

New rule 7 forbids page / screen / view / section / module / feature / widget as the top-level scope noun. Use app or a specific app type (tracker / tool / reminder / planner / calculator / logger / manager / timer). Existing 200-row batch retrofitted via scripts/rewrite-app-scope.js.

RESULT

Single-page-framed queries: 49.5% → 0%. Average word count unchanged (91 → 91, no length inflation).

NEW PAIN

Audit found 76% of EN queries / 80% of ZH contained negation words; 45% had outright grievance dump patterns ("no stock photo, no cartoon, no confetti…"). Real users describing 0-to-1 apps speak in positive terms — they're building something they want, not refining something they hate.

Sample query (real, founder_like persona, wedding card creator)

BEFOREMake a wedding invitation card creator that feels personal and handcrafted, not like some cookie-cutter template factory — I want a small set of maybe four or five elegant layouts I can actually customize with our names, date, and a short line of text, and the font choices should lean traditional and warm, not trendy sans-serif stuff. No stock photo backgrounds, no cartoon illustrations, no confetti animations — just clean, tasteful design with maybe a soft floral border option. It should feel like something I made myself, not something that came off an assembly line.

Stage 4

Positive-framing prompt rewrite — current ee04965

FIX

(1) Rewrote founder_like voice — was literally instructed to "explain what NOT to include as much as what to include". (2) Flipped three "Do NOT" prompt rules into positive-form ("Open with the substance" / "Use everyday vocabulary" / "Use 'app' as top-level noun"). LLMs primed by negation imperatives mirror them in output. (3) Added explicit positive-framing rule limiting negation words to ≤1 per query.

RESULT

NEG words / query: 1.08 → 0.66 (-39%). Grievance patterns: 0.58 → 0.22 (-62%). By persona: curator -56%, maker -57%, planner -39%. Residual negations are now feature-value descriptions ("auto-saves so you never lose"), not grievance dumps.

Sample query (real, founder_like persona, ATS resume builder)

AFTERCreate a resume builder app that has its own quiet identity — one of those tools you actually feel good using — built around a small, carefully chosen set of ATS-friendly templates that are clean but still carry a bit of character, where I can fill in my experience and skills section by section and watch a live preview come together in a format that looks genuinely considered to a human reader while staying structured enough for automated hiring systems to parse without a fuss.

Each stage is a single, focused commit on main — fully reproducible. The full negation-density audit script is one inline node command; quantitative deltas are computed against the same persona breakdown so improvements are not just topic luck.

// THE RECEIPTS

Real samples from a production run

Six queries picked from a real 200-query corpus-direct run, spanning six L1 categories. Toggle EN ↔ 中文 in the top right — the data ships bilingual out of the box.

// RUN IT YOURSELF

Quickstart

From clone to 50 generated queries in under 10 minutes. Requires Node 18+ and the Claude CLI (or set a Packy-compatible API key).

    bash
# 1. Install
git clone https://github.com/PlevanTem/queryMaker.git
cd queryMaker && npm install

# 2. Dry-run the plan (no LLM call)
node scripts/run-corpus.js --total 50 --dry-run

# 3. Real run — 50 queries, ~3 min
node scripts/run-corpus.js --total 50 --concurrency 2

# 4. Inspect outputs
cat data/output/corpus_run/summary.json
head data/output/corpus_run/queries.jsonl
  

Each record in queries.jsonl includes the English query, a Chinese translation, word count, scene metadata, the corpus topic anchor, model name, and timing. Pipe directly into training, or run scripts/generate-analysis-report.js for a self-contained HTML report.

// HONEST BOUNDARIES

Limitations we're honest about

Synthetic data has edges. Listing them is not self-sabotage — it puts credibility on a multi-indicator evidence net, and it tells adopters which piece they should reinforce in their own setting.

Persona pool size

5 archetypes chosen by L2 semantic best-fit within the UI product domain. Good coverage on the head; long-tail user representativeness still needs reverse validation against real user logs.

Diversity measured at lexical layer

Diversity is currently measured at the trigram-Jaccard and corpus-distribution layers. Deeper evidence at the semantic / task-distribution / discriminator layers is out of scope for this repo.

No end-to-end downstream validation

The ultimate evidence — "model capability gain when this data is added to training" — is not in scope here due to downstream training resource limits. Adopters should run a controlled comparison in their own setting.

Narrow-corpus L2 scenes

Where an L2 scene's corpus is narrow, the persona channel cannot compensate — mode collapse can still happen. Known narrow scenes are flagged in data/intermediate/scenario_specs/; the recommendation is to grow the corpus rather than stack more personas.

Authenticity relies on expert review + heuristics

Currently design-expert blind review + 3-axis heuristic scoring (authenticity / specificity / diversity). Cold-metric counterparts (discriminator AUC, distribution distance) are not yet in place.

// THE MAP

What's inside

The shortest path to understanding the codebase.

mvp/query_factory_v2.js

Core engine: parsing, planning, generation, scoring (~2,600 LOC)

scripts/run-corpus.js

Production CLI for the corpus-direct pipeline

scripts/run-free.js

Persona-driven runner with --persona-scope control

scripts/test-corpus-methods.js

4-method evaluation harness on identical scenes

scripts/generate-analysis-report.js

Reusable: turn any scored JSONL into an interactive HTML report

scripts/corpus_data.json

The 2,440-topic corpus across 61 L2 categories

data/output/corpus_run/

Latest production run — plan, queries, summary, xlsx export

README.md · ARCHIVE/

Full pipeline diagram and the original 5-stage research blueprint