PPO KAGGLE · 381 TASKS
These charts come from the 3-shard PPO + LoRA training we ran on free Kaggle T4s. Source data:
kaggle ran notebooks/shard {1,2,3}/training_kaggle{N}.json + the 381 scenarios in
rl-agent/scenarios/sim/{easy,medium,hard}/*.json, pre-bundled into
rl-agent/showcase_data.json by scripts/build_showcase_data.py.
Every visible number is computed from those files.
PPO + LoRA · 3-Shard Overview
Phi-3.5-mini actor + DeepSeek-R1 critic, 4-bit, sharded across 3 free Kaggle T4s.
Scenarios
381
all simulated incidents
Tasks trained
381
100% coverage
PPO updates
180
60 × 3 shards
Wall time
22.6h
aggregate, all shards
Final mean reward
-0.533
Δ vs. first -0.592
Per-shard final outcome
Shard 1
-0.523
final mean reward · 127 tasks · 454 min
Shard 2
-0.469
final mean reward · 127 tasks · 459 min
Shard 3
-0.607
final mean reward · 127 tasks · 441 min
Mean-reward trajectory per shard
Hyper-parameters
| Param | Value | Param | Value |
| actor | /kaggle/input/models/Microsoft/phi-3/pytorch/phi-3.5-mini-instruct/2 | critic | /kaggle/input/models/deepseek-ai/deepseek-r1-0528/transformers/deepseek-r1-0528-qwen3-8b/1 |
| lora_r | 16 | lora_alpha | 32 |
| lr | 1e-05 | gamma | 0.95 |
| gae_lambda | 0.92 | clip_eps | 0.2 |
| kl_coef | 0.02 | rollouts/update | 3 |
| max_steps/episode | 12 | updates/shard | 60 |