PPO KAGGLE · 381 TASKS
These charts come from the 3-shard PPO + LoRA training we ran on free Kaggle T4s. Source data: kaggle ran notebooks/shard {1,2,3}/training_kaggle{N}.json + the 381 scenarios in rl-agent/scenarios/sim/{easy,medium,hard}/*.json, pre-bundled into rl-agent/showcase_data.json by scripts/build_showcase_data.py. Every visible number is computed from those files.

PPO reward analytics

Reward range is [-2.0, +1.0] by design. Most components are penalties; mean is expected to be negative early — what matters is the trend.
Tasks with positive mean
0
0.0%
Tasks with negative mean
381
100.0%
Best task mean
-3.780
Worst task mean
-8.280

Per-update mean reward · all 3 shards

Reward distribution across the 381 tasks

Reward signal palette

ComponentSignMagnitude
step_cost0.01 / step
acting_blind0.20
red_herring0.15
repeat_command0.15 / repeat (cap 0.45)
wrong_service0.15
time_penalty0.05 × max(0, step − 5)
phase_regression0.10
blast_radius_increase0.10
investigation_bonus+0.05 / unique read
useful_log_query+0.10
correct_mitigation+0.20
root_cause_correct+0.30
postmortem_quality+up to 0.20
phase_order+0.10
judge±up to 0.15