下一步·等待所有执行记录评估完成后,再查看综合判定和上线动作。
能力样本不足
Skill 让 Agent 多做成了多少事
—/ 100
—
▸原始数据与计算公式min(capability, cost, stability)
能力capability
评测均分A—B—
通过率A—B—
Δscore—
avgEvalA = mean(0 runs) = —
avgEvalB = mean(0 runs) = —
Δscore = avgEvalB − avgEvalA = —
score = clamp(50 + — × 2.5, 0, 100) = —
docs/skill-ab-scoring.md §3.1
成本cost
TokenA—B—
耗时A—B—
步数A—B—
ΔToken—
avgTokensA = mean(0 runs) = —
avgTokensB = mean(0 runs) = —
Δtoken = (avgTokensB − avgTokensA) / avgTokensA × 100% = —%
baseCost = piecewise(Δtoken; 0%→100, 20%→80, 100%→40, 200%→0) = —
coupling = 无能力分参考
score = clamp(baseCost + coupling, 0, 100) = —
A 组平均 Token 缺失或为 0,成本维度无法计算
docs/skill-ab-scoring.md §4.1
稳定性stability
触发率B—
方差B— (R=1)
invokeRate = 0 / 0 = —%
variance = (重复轮次 < 2,按 0 处理)
1 − var/0.25 = 1
score = —% × 1 = —
重复轮次不足,方差不可计算
docs/skill-ab-scoring.md §5.1
综合 (短板原则)verdict
capability = —
cost = —
stability = —
total = 任一维度缺失,无法出综合分
docs/skill-ab-scoring.md §6.1
SAMPLE N=0 / 推荐 ≥20 · 重复 1 轮 · 置信度 低策略:agent-skill-scoring-v2.1