4 skill versions evaluated by 3 LLM judges across 10 writing parameters. Scores range 0–10.
| Evaluator | Alpha | V2 | V3 | V4 | Ranked #1 |
|---|---|---|---|---|---|
| ChatGPT | 7.92 | 8.61 | 8.84 | 9.17 | V4 |
| Gemini | 7.61 | 8.31 | 8.54 | 8.88 | V4 |
| Grok divergent | 8.00 | 9.00 | 8.58 | 8.86 | V2 |
| Ensemble | 7.84 | 8.64 | 8.65 | 8.97 | V4 |