How a production LLM-as-judge calibrates on voice

The problem: fluent is not the same as authentic

We ship B2B content for a living. Social posts, blog pieces, LinkedIn drafts, outreach emails. Every output has to sound like the person it’s written for.

Generic quality checks (grammar, SEO, factuality) pass fluent AI-polished copy that still fails voice. Wrong cadence. Wrong vocabulary. Wrong confidence level. Wrong authenticity. The copy reads fine until the client reads their own content and says “this isn’t me.”

At scale, that is a deal-killer.

So we built Brand Voice System. A production LLM-as-judge that sits in front of every published output and decides what ships.

This post is the design story. What broke. What worked. How we knew.

First attempt: one judge, one long prompt

Version one was the obvious thing. One LLM call. A long prompt listing dozens of style rules. “Avoid passive voice. Don’t use adjectives like ‘comprehensive.’ Prefer short sentences after long ones. No em dashes. No semicolons.” You can imagine the rest.

It flagged too many things as off-voice and missed the real tells.

Measured against a set of outputs I had hand-labeled as on-voice or off-voice:

False positive rate around 40%. The judge flagged clean copy as off-voice because a rule misfired on a correct usage.
False negative rate around 25%. The judge let fluent AI copy through because the rule list couldn’t describe “sounds polished but fake.”

Both sides unusable. Too noisy to auto-send, too loose to auto-reject.

The rule list had a structural problem. It was a flat set of prohibitions. It couldn’t describe what the output was doing at a structural level. It only knew what to complain about.

What I tried, in order

1. Decompose the judge into five dimensions

Instead of one general “is this on-voice” question, the judge scores five dimensions separately, then weights them:

Rhythm: sentence length variance, fragment placement, opener style
Vocabulary: presence of the person’s actual word list, absence of banned words
Tone: register match (peer vs vendor, confident vs hedging)
Authenticity: does this sound like a real human wrote it, not a sanitized SaaS agent
Mirroring: conditional on recipient tone; how well the output adapts to the target

Each dimension returns 0-100. Weighted sum becomes the overall score.

This made the failure modes legible. Now I could see which dimension was wrong, not just that something was off.

Most false positives came from vocabulary (rule lists were too aggressive about word choice). Most false negatives came from authenticity (the rule set was too vague to flag AI-fluent tells).

Once the dimensions were split, I could target each one. Tighten the rules that were dropping false positives. Loosen the rules that were letting AI-fluency through. Move weight to the dimensions that correlated best with my own spot-checks.

2. Build mode switching

A judge that works for cold outreach does not work for an internal note. A judge that works for a LinkedIn post is too strict for a draft.

Five modes, same five-dimension rubric, different weight overrides and different pass thresholds:

Strict (cold outreach): any AI tell fails. Red-flag detection paramount. Pass at 55.
Standard (warm email, client reply): Pass at 55.
Conversational (voice-authenticity-heavy): weight overrides set authenticity 0.30, rhythm 0.25, tone 0.25. Pass at 55.
Loose (LinkedIn, blog): authenticity weight bumps to 0.35. Adds a thinking-moves dimension. Pass at 50.
Draft (internal notes): only flags outright voice breaks. Pass at 40.

Modes are not presets. They are behavioral policies with different weight functions. Changing a mode changes what “passing” means.

That is reward-function engineering, done in YAML because I don’t have anything better available.

3. Make REPAIR feedback dimension-specific

The scoring layer produces three verdicts:

PASS (score >= 60): auto-send.
REPAIR (40-59): return feedback to the generator, regenerate, re-judge.
HUMAN (< 40): flag for manual review.

Early REPAIR loops had a subtle failure. The judge would return “this output is off, try again.” Claude would regenerate and make the same mistake. The loop could run twice and still fail, because nothing in the feedback pointed at the actual problem.

Fix: feedback names the dimension and the specific failure.

Not: “this output is off.” Instead: “Rhythm is off because the sentences are too uniform. Break one in half.” Or: “Authenticity is at 42. The post uses ‘leverage’ in paragraph two and reads as sanitized SaaS.”

REPAIR loops started converging in one or two rounds instead of looping indefinitely.

4. Add a calibration window

LLM judges drift. If you calibrate weights against 2024 outputs and the world shifts in 2026, your judge quietly starts rejecting things that are actually fine. Or worse, it starts passing things that are off-voice because the underlying model got better at polish.

So every six weeks, sample the PASS and REPAIR decisions from the last window. Compare against a human editor’s judgment. Re-fit weights if they drift.

One calibration moved the authenticity weight from 0.30 to 0.40 because the judge was under-penalizing AI-fluent tells introduced by a model update. Without the window, that drift would have cost us several weeks of shipped content before anyone noticed.

Inspiration on this came from a sister system that already did borderline-score sampling. Any output scoring 65-75 always triggers a second evaluation pass. 10% of high-scoring outputs get sampled for calibration too. I borrowed the pattern. Every production LLM-as-judge should have it.

How I knew it worked

Three signals, in order of strongest to weakest.

REPAIR convergence. Outputs that went to REPAIR now resolve within two rounds in roughly 85% of cases, compared to a 50% infinite-loop rate in the first version.

Human review agreement. When the judge says PASS and a human reviews anyway, the human agrees with PASS in roughly 92% of cases in my own spot-check batches. When the judge says HUMAN-review, the reviewer confirms a real issue in roughly 88% of cases. These are internal samples, not published benchmarks. The absolute numbers are less important than the relative direction: both sides of the judgment line up with human intuition more than they diverge.

Client-side evidence. Voice complaints dropped to near-zero once the judge went live. No client has said “this isn’t me” about a published output in the six months since.

All three signals are lagging indicators. None of them tells you the rubric is right. They tell you the rubric is now good enough to trust at production scale.

What I carry forward

Three transferable lessons.

Positive enforcement matters more than negative enforcement. Rules about what not to do are easy to write and easy to work around. A banned word becomes a subtly different banned word. (I once told the judge no em dashes. It started using single hyphens as visual separators. I saw you.) Rules about what to do (“rhythm the way this person writes rhythms”) are harder to dodge because they describe a positive signal, not a negative list.

Judge modes are policy layers. Different contexts need different definitions of “passing.” That is not a threshold problem. It is a policy problem. Build it in from the start.

Calibration needs sampling. An LLM judge that was right six weeks ago might be wrong today. The way you catch that is not alerting. It is sampling. 10% of passing outputs, every cycle, compared to a human ground-truth batch. Borrowed from the other team’s judge; should have been in voice-judge from day one.

The unexpected insight

The rubric stopped being about voice halfway through this process. The rubric became about designing a reward function that correlated with human taste.

That is the actual work. Dimensions, weights, modes, and feedback routing are all instruments for getting the reward function right. The LLM inside the judge is the easy part. The rubric is the craft.

Most people who try to build an LLM judge stop at the prompt. The prompt is the one-line abstract of a reward function that needs a full specification, calibration procedure, and feedback loop.

Once you treat it as reward-function engineering, the whole system clicks into place.

This is the case closest to reward-modeling work in my portfolio. If you’re working on evaluation infrastructure, LLM-as-judge systems, or reward models, I’d be happy to compare notes.