PPO reward analytics

Reward range is [-2.0, +1.0] by design. Most components are penalties; mean is expected to be negative early — what matters is the trend.

Tasks with positive mean

0

0.0%

Tasks with negative mean

381

100.0%

Best task mean

-3.780

Worst task mean

-8.280

Per-update mean reward · all 3 shards

Reward distribution across the 381 tasks

Reward signal palette

Component	Sign	Magnitude
step_cost	−	0.01 / step
acting_blind	−	0.20
red_herring	−	0.15
repeat_command	−	0.15 / repeat (cap 0.45)
wrong_service	−	0.15
time_penalty	−	0.05 × max(0, step − 5)
phase_regression	−	0.10
blast_radius_increase	−	0.10
investigation_bonus	+	0.05 / unique read
useful_log_query	+	0.10
correct_mitigation	+	0.20
root_cause_correct	+	0.30
postmortem_quality	+	up to 0.20
phase_order	+	0.10
judge	±	up to 0.15