PPO KAGGLE · 381 TASKS
These charts come from the 3-shard PPO + LoRA training we ran on free Kaggle T4s. Source data: kaggle ran notebooks/shard {1,2,3}/training_kaggle{N}.json + the 381 scenarios in rl-agent/scenarios/sim/{easy,medium,hard}/*.json, pre-bundled into rl-agent/showcase_data.json by scripts/build_showcase_data.py. Every visible number is computed from those files.

Judge configuration · PPO mode

During the Kaggle run we used the DeepSeek-R1 critic for value estimation, not the LLM judge — it remains fully wired for fine-tuning passes.

Two scoring paths

PathUsed forWhere
Critic valuePer-(state, action) value baseline V(s,a) consumed by GAE.environment/llm_critic.py
LLM judgePer-action 0–1 score sliced into the reward; persona-styled (junior/senior/principal).environment/judge.py

Critic prompt rubric (excerpt)

You are an expert SRE auditor. Given:
  - the current cluster observation,
  - a candidate action the agent is about to take,
return a score 0..10:
  0 = catastrophic / clearly wrong
  3 = stalls progress
  5 = neutral
  7 = useful read or sane fix
  9 = directly addresses the root cause
Return ONLY the integer.

Switching at inference time

curl -X POST $BASE/judge/config \
  -H 'Content-Type: application/json' \
  -d '{"persona":"principal","use_llm":true}'