Reward signals

15 reward components observed across 36 training episodes. Source: training snapshot · last episode (task3).

Signal count

15

Episodes

36

Positive signals fired

332

Penalties fired

36

Last-episode breakdown

Mean across all episodes

Signal catalogue

Signal	Meaning	Last episode	Mean (n=36)	Fire count
step_cost penalty	-0.01 per step (efficiency pressure)	-0.010	-0.010	36 (↑0 ↓36)
root_cause_correct positive	+0.30 for correct postmortem root cause	—	+0.300	13 (↑13 ↓0)
postmortem_quality positive	+0..0.20 from postmortem grader	+0.000	+0.072	36 (↑13 ↓0)
phase_order positive	± Phase progression (triage → investigate → fix → verify)	+0.050	+0.072	36 (↑36 ↓0)
judge positive	LLM judge slice [-0.10, +0.15]	+0.075	+0.068	36 (↑36 ↓0)
evidence_alignment positive	Postmortem cites evidence found via logs	—	+0.100	7 (↑7 ↓0)
mitigation_efficiency positive	Mitigation in ≤3 writes	—	+0.150	16 (↑16 ↓0)
diverse_evidence positive	≥3 distinct evidence sources	—	+0.050	16 (↑16 ↓0)
quick_root_cause_bonus positive	Identified root cause within 4 steps	—	+0.100	9 (↑9 ↓0)
runbook_match_bonus positive	Postmortem lists correct affected service	+0.050	+0.050	26 (↑26 ↓0)
low_thrash_bonus positive	≤2 write actions in entire episode	+0.050	+0.050	36 (↑36 ↓0)
fast_resolution_bonus positive	Mitigation in ≤6 total steps	—	+0.080	16 (↑16 ↓0)
exploration_diversity positive	Bonus per unique action type used (cap 0.08)	+0.030	+0.043	36 (↑36 ↓0)
consistent_postmortem_bonus positive	Required postmortem fields present	+0.060	+0.060	36 (↑36 ↓0)
followup_quality_bonus positive	Postmortem proposes ≥3 followups	+0.030	+0.030	36 (↑36 ↓0)