LEGACY DATASET
These charts come from the kube-sre-gym-style heuristic + early notebook runs — the 11 hand-curated tasks in
rl-agent/scenarios/{easy,medium,hard}/*.json, recorded into
rl-agent/checkpoints/<run>/metrics.jsonl and
colab/logs/reward_breakdown_history.jsonl. They do not include the 381-task PPO Kaggle run.
Reward signals
15 reward components observed across 36 training episodes.
Source: training snapshot · last episode (task3).
Positive signals fired
332
Signal catalogue
| Signal | Meaning | Last episode | Mean (n=36) | Fire count |
| step_cost penalty | -0.01 per step (efficiency pressure) | -0.010 | -0.010 | 36 (↑0 ↓36) |
| root_cause_correct positive | +0.30 for correct postmortem root cause | — | +0.300 | 13 (↑13 ↓0) |
| postmortem_quality positive | +0..0.20 from postmortem grader | +0.000 | +0.072 | 36 (↑13 ↓0) |
| phase_order positive | ± Phase progression (triage → investigate → fix → verify) | +0.050 | +0.072 | 36 (↑36 ↓0) |
| judge positive | LLM judge slice [-0.10, +0.15] | +0.075 | +0.068 | 36 (↑36 ↓0) |
| evidence_alignment positive | Postmortem cites evidence found via logs | — | +0.100 | 7 (↑7 ↓0) |
| mitigation_efficiency positive | Mitigation in ≤3 writes | — | +0.150 | 16 (↑16 ↓0) |
| diverse_evidence positive | ≥3 distinct evidence sources | — | +0.050 | 16 (↑16 ↓0) |
| quick_root_cause_bonus positive | Identified root cause within 4 steps | — | +0.100 | 9 (↑9 ↓0) |
| runbook_match_bonus positive | Postmortem lists correct affected service | +0.050 | +0.050 | 26 (↑26 ↓0) |
| low_thrash_bonus positive | ≤2 write actions in entire episode | +0.050 | +0.050 | 36 (↑36 ↓0) |
| fast_resolution_bonus positive | Mitigation in ≤6 total steps | — | +0.080 | 16 (↑16 ↓0) |
| exploration_diversity positive | Bonus per unique action type used (cap 0.08) | +0.030 | +0.043 | 36 (↑36 ↓0) |
| consistent_postmortem_bonus positive | Required postmortem fields present | +0.060 | +0.060 | 36 (↑36 ↓0) |
| followup_quality_bonus positive | Postmortem proposes ≥3 followups | +0.030 | +0.030 | 36 (↑36 ↓0) |