Problem
A community media platform wanted to scale long-form member-success stories without losing voice, fabricating details, or burying their editors under unstructured drafts. The bottleneck was not the writing — it was the grading and the gating.
Approach
A pipeline where every stage emits content and a structured eval. Voice fidelity, narrative clarity, fact grounding, sourcing tier — each a small grader returning a score plus a one-line rationale. Pieces below threshold get rewritten or sent back to editors with the failing criteria pre-highlighted. Critically: the system never publishes — the member sees and explicitly approves the draft before anything goes live. The evals are the product; the model is interchangeable.
Stack
TypeScript on Vercel, Anthropic for generation and grading, Supabase for state, intake through a form-builder webhook, output to Google Drive + the editor dashboard. The 8-point pass/fail + 5-point rubric lives in version control next to the prompts.
What shipped
In beta. Eleven member stories run end-to-end: voice fidelity averaging 4.7/5, all eleven landed concrete opening scenes, zero fabricated metrics across the batch. Editor green-lit pilot with four pre-launch gates locked in (intake validation, automated quantitative checks, member-approval workflow, press-tier sourcing protocol).
What’s next
Per-rubric calibration — surfacing where graders disagree with humans, and letting the team adjust thresholds without touching code.