PPO KAGGLE · 381 TASKS
These charts come from the 3-shard PPO + LoRA training we ran on free Kaggle T4s. Source data: kaggle ran notebooks/shard {1,2,3}/training_kaggle{N}.json + the 381 scenarios in rl-agent/scenarios/sim/{easy,medium,hard}/*.json, pre-bundled into rl-agent/showcase_data.json by scripts/build_showcase_data.py. Every visible number is computed from those files.

PPO + LoRA · 3-Shard Overview

Phi-3.5-mini actor + DeepSeek-R1 critic, 4-bit, sharded across 3 free Kaggle T4s.
Scenarios
381
all simulated incidents
Tasks trained
381
100% coverage
PPO updates
180
60 × 3 shards
Wall time
22.6h
aggregate, all shards
Final mean reward
-0.533
Δ vs. first -0.592

Per-shard final outcome

Shard 1
-0.523
final mean reward · 127 tasks · 454 min
Shard 2
-0.469
final mean reward · 127 tasks · 459 min
Shard 3
-0.607
final mean reward · 127 tasks · 441 min

Mean-reward trajectory per shard

Hyper-parameters

ParamValueParamValue
actor/kaggle/input/models/Microsoft/phi-3/pytorch/phi-3.5-mini-instruct/2critic/kaggle/input/models/deepseek-ai/deepseek-r1-0528/transformers/deepseek-r1-0528-qwen3-8b/1
lora_r16lora_alpha32
lr1e-05gamma0.95
gae_lambda0.92clip_eps0.2
kl_coef0.02rollouts/update3
max_steps/episode12updates/shard60