LEGACY DATASET
These charts come from the kube-sre-gym-style heuristic + early notebook runs — the 11 hand-curated tasks in rl-agent/scenarios/{easy,medium,hard}/*.json, recorded into rl-agent/checkpoints/<run>/metrics.jsonl and colab/logs/reward_breakdown_history.jsonl. They do not include the 381-task PPO Kaggle run.

Reward signals

15 reward components observed across 36 training episodes. Source: training snapshot · last episode (task3).
Signal count
15
Episodes
36
Positive signals fired
332
Penalties fired
36

Last-episode breakdown

Mean across all episodes

Signal catalogue

SignalMeaningLast episodeMean (n=36)Fire count
step_cost penalty-0.01 per step (efficiency pressure)-0.010-0.01036 (↑0 ↓36)
root_cause_correct positive+0.30 for correct postmortem root cause+0.30013 (↑13 ↓0)
postmortem_quality positive+0..0.20 from postmortem grader+0.000+0.07236 (↑13 ↓0)
phase_order positive± Phase progression (triage → investigate → fix → verify)+0.050+0.07236 (↑36 ↓0)
judge positiveLLM judge slice [-0.10, +0.15]+0.075+0.06836 (↑36 ↓0)
evidence_alignment positivePostmortem cites evidence found via logs+0.1007 (↑7 ↓0)
mitigation_efficiency positiveMitigation in ≤3 writes+0.15016 (↑16 ↓0)
diverse_evidence positive≥3 distinct evidence sources+0.05016 (↑16 ↓0)
quick_root_cause_bonus positiveIdentified root cause within 4 steps+0.1009 (↑9 ↓0)
runbook_match_bonus positivePostmortem lists correct affected service+0.050+0.05026 (↑26 ↓0)
low_thrash_bonus positive≤2 write actions in entire episode+0.050+0.05036 (↑36 ↓0)
fast_resolution_bonus positiveMitigation in ≤6 total steps+0.08016 (↑16 ↓0)
exploration_diversity positiveBonus per unique action type used (cap 0.08)+0.030+0.04336 (↑36 ↓0)
consistent_postmortem_bonus positiveRequired postmortem fields present+0.060+0.06036 (↑36 ↓0)
followup_quality_bonus positivePostmortem proposes ≥3 followups+0.030+0.03036 (↑36 ↓0)