PPO KAGGLE · 381 TASKS
These charts come from the 3-shard PPO + LoRA training we ran on free Kaggle T4s. Source data:
kaggle ran notebooks/shard {1,2,3}/training_kaggle{N}.json + the 381 scenarios in
rl-agent/scenarios/sim/{easy,medium,hard}/*.json, pre-bundled into
rl-agent/showcase_data.json by scripts/build_showcase_data.py.
Every visible number is computed from those files.
Judge configuration · PPO mode
During the Kaggle run we used the DeepSeek-R1 critic for value estimation, not the LLM judge — it remains fully wired for fine-tuning passes.
Two scoring paths
| Path | Used for | Where |
| Critic value | Per-(state, action) value baseline V(s,a) consumed by GAE. | environment/llm_critic.py |
| LLM judge | Per-action 0–1 score sliced into the reward; persona-styled (junior/senior/principal). | environment/judge.py |
Critic prompt rubric (excerpt)
You are an expert SRE auditor. Given:
- the current cluster observation,
- a candidate action the agent is about to take,
return a score 0..10:
0 = catastrophic / clearly wrong
3 = stalls progress
5 = neutral
7 = useful read or sane fix
9 = directly addresses the root cause
Return ONLY the integer.
Switching at inference time
curl -X POST $BASE/judge/config \
-H 'Content-Type: application/json' \
-d '{"persona":"principal","use_llm":true}'