Blog Writer Skill

LLM Ensemble Evaluation

4 skill versions evaluated by 3 LLM judges across 10 writing parameters. Scores range 0–10.

Performance radar

Alpha
7.84
Ensemble avg
V2
8.64
Ensemble avg
V3
8.65
Ensemble avg
V4
8.97
Ensemble avg · champion

Highlights

Ensemble winner
V4
8.97 avg · 7 of 10 params
Humanness champion
V3
9.10 voice · 9.03 likeness
Conciseness champion
Alpha
8.73 ensemble avg
Divergent evaluator
Grok
Ranked V2 #1, not V4

Parameter breakdown — ensemble averages

Alpha V2 V3 V4

Per-model overall average

EvaluatorAlphaV2V3V4Ranked #1
ChatGPT 7.928.618.84 9.17 V4
Gemini 7.618.318.54 8.88 V4
Grok divergent 8.00 9.00 8.588.86 V2
Ensemble 7.848.648.65 8.97 V4

Category champions

Overall champion
V4
No weak param · min 8.67
8.97
Most human
V3
Fuzzy memory · lived texture
9.10
Most concise
Alpha
Ruthless editing · clean draft
8.73