Judge configuration · PPO mode

During the Kaggle run we used the DeepSeek-R1 critic for value estimation, not the LLM judge — it remains fully wired for fine-tuning passes.

Two scoring paths

Path	Used for	Where
Critic value	Per-(state, action) value baseline V(s,a) consumed by GAE.	`environment/llm_critic.py`
LLM judge	Per-action 0–1 score sliced into the reward; persona-styled (junior/senior/principal).	`environment/judge.py`

Critic prompt rubric (excerpt)

You are an expert SRE auditor. Given:
  - the current cluster observation,
  - a candidate action the agent is about to take,
return a score 0..10:
  0 = catastrophic / clearly wrong
  3 = stalls progress
  5 = neutral
  7 = useful read or sane fix
  9 = directly addresses the root cause
Return ONLY the integer.

Switching at inference time

curl -X POST $BASE/judge/config \
  -H 'Content-Type: application/json' \
  -d '{"persona":"principal","use_llm":true}'