PPO KAGGLE · 381 TASKS
These charts come from the 3-shard PPO + LoRA training we ran on free Kaggle T4s. Source data:
kaggle ran notebooks/shard {1,2,3}/training_kaggle{N}.json + the 381 scenarios in
rl-agent/scenarios/sim/{easy,medium,hard}/*.json, pre-bundled into
rl-agent/showcase_data.json by scripts/build_showcase_data.py.
Every visible number is computed from those files.
PPO reward analytics
Reward range is [-2.0, +1.0] by design. Most components are penalties; mean is expected to be negative early — what matters is the trend.
Tasks with positive mean
0
0.0%
Tasks with negative mean
381
100.0%
Per-update mean reward · all 3 shards
Reward distribution across the 381 tasks
Reward signal palette
| Component | Sign | Magnitude |
| step_cost | − | 0.01 / step |
| acting_blind | − | 0.20 |
| red_herring | − | 0.15 |
| repeat_command | − | 0.15 / repeat (cap 0.45) |
| wrong_service | − | 0.15 |
| time_penalty | − | 0.05 × max(0, step − 5) |
| phase_regression | − | 0.10 |
| blast_radius_increase | − | 0.10 |
| investigation_bonus | + | 0.05 / unique read |
| useful_log_query | + | 0.10 |
| correct_mitigation | + | 0.20 |
| root_cause_correct | + | 0.30 |
| postmortem_quality | + | up to 0.20 |
| phase_order | + | 0.10 |
| judge | ± | up to 0.15 |