IncidentCommander · Showcase

A novel observation channel

The agent reads coworker Slack chatter — not just metrics.

Real on-call engineers don't just stare at dashboards. They scroll Slack — where the CEO is shouting, a frontend dev is asking if their hotfix broke prod, the intern wants to "just restart the cluster", and buried in the noise a DBA quietly mentions Postgres CPU is climbing on a replica. We built that channel into the environment. Every scenario emits a deterministic stream of human messages — some are real clues, most are red herrings — and the agent has to learn which voices to trust.

What's in the stream

Each tick the simulator samples from a templated pool seeded by the scenario:

Saboteur-coupled lines — DBAs / Platform / Customer Support corroborate the actual fault as it escalates.
Generic chatter — VP Eng asking ETAs, Comms drafting status pages, Finance reminding everyone every minute is $8k.
Red herrings — a frontend dev's hotfix, a Security CloudTrail ping, an Athena pipeline failure. None of them caused the incident; acting on them costs reward.
Severity escalation — the longer the incident drags, the more high-severity messages appear, mirroring real pressure.

Why it matters

Most LLM-RL benchmarks give the agent perfectly structured JSON observations. Real incidents are buried in unstructured human text, mixed with noise, mixed with politics. By adding a SlackStream alongside metrics and traces we force the policy to learn a real on-call skill: parse human chatter, weight it against telemetry, ignore the noise.

The action platform.read_slack is rewarded as useful information gathering when followed by an action that targets a service mentioned in a recent message — and penalised as a red herring when the agent acts on a misleading line (e.g. rolling back the wrong service because Sara from Frontend mentioned a hotfix).

Scenario JSON · Slack config

From scenarios/sim/hard/sim_gen_cascade_payments_db_005.json — a hard cascade where the truth is in payments_db but alerts fire on checkout. Slack chatter is dialled to 1.4 messages per tick.

// scenario JSON declares the Slack stream
{
  "id": "sim_gen_cascade_payments_db_005",
  "saboteur": {
    "primary_target": "payments",
    "aggressiveness": 0.5
  },
  "slack": { "msgs_per_tick": 1.4 },
  "correct_action_chain": [
    { "id": "platform.read_slack",
      "params": { "last_n": 5 } },
    { "id": "platform.describe_topology" },
    { "id": "platform.get_logs",
      "params": { "service": "payments_db" } },
    { "id": "platform.vacuum_freeze_db",
      "params": { "target": "payments_db" } }
  ]
}

Templates · what coworkers actually say

From rl-agent/simulator/slack.py — fully deterministic per seed. Mix of generic + phase-coupled lines.

# rl-agent/simulator/slack.py — _GENERIC pool (excerpt)
[
  ("@channel", "info",
     "Customers reporting 500s on /checkout 🔥"),

  ("Sara (Frontend)", "info",    # red herring
     "Hey SRE, I just pushed a hotfix for the cart UI"
     " — could that be related?"),

  ("Intern (Jamal)", "info",
     "Should I just restart the cluster? 😅"),

  ("DBA (Yuki)", "info",             # buried clue
     "Postgres CPU is climbing on the replica btw."),

  ("Finance", "info",
     "Every minute of downtime is ~$8k for us."),

  ("CEO", "high",
     "I'm getting calls. Tell me what's happening."),
]

# Phase-coupled lines fire when saboteur escalates:
_PHASE_LINES = {
  "attack_failover": [
    ("DBA (Yuki)", "high",
       "{svc} spiked to 90% CPU."
       " Did someone kick off a backup?"),
  ],
}

Signal vs. noise — how Slack flows through training

Each Slack message becomes part of the agent's observation token stream. The actor must classify what's a clue and what's noise before emitting an action. The reward signal then retroactively grades the choice.

%%{init: {'theme':'dark','themeVariables':{'primaryColor':'#11162b','primaryTextColor':'#e9eeff','primaryBorderColor':'#7aa7ff','lineColor':'#7aa7ff','tertiaryColor':'#0a0d1c'}}}%% flowchart LR SC["Scenario JSON
slack: msgs_per_tick=1.4"] --> SS["SlackStream
(deterministic, seed-driven)"] SAB["Saboteur phase
attack_primary / failover / dependency"] --> SS SS -->|emit_for_tick| OBS["Observation
{ metrics, traces, slack[] }"] OBS --> ACT["Phi-3.5 Actor
4-bit + LoRA"] ACT -->|action JSON| ENV["IncidentCommanderEnv"] ENV --> R{"Reward shaping"} R -->|action targets a service
mentioned in recent slack| RP["+0.10
useful_log_query"] R -->|action targets red-herring
service from chatter| RH["−0.15
red_herring_penalty"] R -->|action ignores buried clue
and acts blind| BA["−0.10
blind_action_penalty"] RP --> ADV["GAE advantage
+ PPO update"] RH --> ADV BA --> ADV ADV --> ACT classDef good fill:#0e2a1c,stroke:#3ec78a,color:#cdeed8 classDef bad fill:#2a0e0e,stroke:#ff7a7a,color:#ffd6d6 class RP good class RH,BA bad

Trusted signal

DBA mentioning Postgres CPU on a replica → agent calls platform.get_logs(payments_db) next → the action lands on a service the chatter pointed to → reward shaper applies useful_log_query (+0.10).

Red herring

Frontend dev mentions a cart-UI hotfix → agent calls platform.rollback_deployment(frontend) blindly → wrong service, no fix, red_herring_penalty (−0.15) + the blast radius keeps growing.

Why it generalises

166 of 381 scenarios include a configurable Slack stream. Templates parameterise the affected service, so the policy can't memorise — it has to learn the meta-pattern: cross-reference chatter with telemetry before acting.

Before the 381-task curriculum — Rounds 1 + 2

Deep regime: 11 hand-curated archetypes, three full PPO runs, every metric audited.

The shallow regime above (381 procedural scenarios, three Kaggle shards) is what every other section of this page is about. But it sits on top of two earlier rounds that ran deeper on a smaller, hand-curated task set — and whose training logs are the cleanest evidence of learning anywhere in this submission. Round 1 used SB3 PPO with a 128×128 MLP on the 3 hardest archetypes. Round 2 swapped the policy for a small-LLM actor (Ollama Qwen2.5:0.5b) and tried three different critics back-to-back on all 11 archetypes — a heuristic baseline (v2), an Ollama+Bedrock judge (v3), and a Groq Llama-3.1-8B-instant critic (v4). The point of this section is to show the per-task numbers that justify the design of the shallow regime.

1 · The 11 archetypes — and why these specifically

The archetypes were picked to span the full SRE failure taxonomy — chaos injections you should delete rather than route around, OOM cascades that look like the wrong service, silent data corruption with no visible error, network/DNS/TLS partitions, hot-reload race conditions, secret rotation regressions, container image pull failures, namespace quota starvation, and liveness probe drift. Each task ships a red-herring set in its JSON file — services that look broken but aren't — so the agent can't pattern-match its way to the right answer. Difficulty mix is deliberate: 3 easy / 4 medium / 4 hard, which is exactly the curriculum-controller's `intermediate` tier in /dashboard.

ID	Difficulty	Target	Title	Root cause	Correct fix	Why this archetype
`task1`	easy	0.80	Redis Connection Pool Exhaustion	Chaos-Mesh latency injection saturates inventory→Redis pool	`delete_chaos_experiment`	Trains "the right fix is to remove the chaos, not patch around it"
`task2`	medium	0.45	Cascading Failure via Payments OOM	OOM in payments-api spreads to inventory via Kafka lag	`rollback_deployment payments-api`	Symptoms surface in inventory; root cause is upstream
`task3`	hard	0.20	Silent Decimal Corruption	Bad deploy truncates `NUMERIC(12,4)→NUMERIC(12,2)`; Postgres VACUUM is a red herring	`rollback_deployment payments-api` + audit follow-up	No surface error — only a postmortem identifies it
`task4`	easy	0.80	Kafka Broker Network Partition	Chaos-Mesh partitions broker; consumer lag >5 000	`delete_chaos_experiment`	Two services impacted (order-worker + notification) — wrong target trap
`task5`	medium	0.45	DNS Resolution Failure	DNS chaos → NXDOMAIN; "connection refused" is downstream noise	`delete_chaos_experiment`	Trains agent to ignore secondary-symptom services
`task6`	hard	0.20	TLS Certificate Expiry Cascade	Expired mTLS cert breaks payments→postgres; ECONNRESET fans out	`apply_config_patch payments-api` (cert rotate)	Time-correlated alerts that span four services
`task7`	hard	0.20	ConfigMap Hot-Reload Race	2 of 4 inventory pods load stale config; Redis/GC alerts are red herrings	`restart_pods inventory-service`	Pod-level partial failure, easy to misdiagnose as state-store
`task8`	medium	0.45	JWT Secret Rotation Cascade	Auth secret rotation regression invalidates all sessions	`rollback_deployment auth-service`	Identity-plane outage that looks like a frontend bug
`task9`	easy	0.80	Invalid Image Tag Deploy	checkout-frontend deployed with bad tag → ImagePullBackOff	`rollback_deployment checkout-frontend`	Hint surface (ECR / ImagePullSecret) — tests if agent reads pod events
`task10`	medium	0.45	Namespace ResourceQuota Starvation	payments-worker blocked by namespace quota	`apply_config_patch payments-worker`	Symptom is "pods Pending" — root cause is platform policy
`task11`	hard	0.20	Liveness Probe Path Regression	Bad probe path on inventory-service → flapping pods	`rollback_deployment inventory-service`	Looks like a memory leak; actually a deploy regression

Source of truth: rl-agent/environment/env.py :: TASK_FILE_MAP, CORRECT_SERVICES, TASK_AWS_HINTS (lines 51–122). Full per-archetype JSON in rl-agent/scenarios/*.json. Documented end-to-end in docs/ENV_DEEP.md.

2 · Round 1 — SB3 PPO + MLP, the floor

Goal: prove the env is solvable, the reward shaper terminates, and the postmortem grader runs end-to-end. 4 parallel envs, 200 000 timesteps, 75 minutes on CPU. Evaluated every 40 000 steps across all 3 archetypes — the per-checkpoint numbers below come straight from rl-agent/checkpoints/training_metrics.json.

Timestep	task1 reward	task2 reward	task3 reward	overall reward	success rate
40 000	1.05	1.05	1.00	1.033	100%
80 000	1.05	1.05	1.05	1.05	100%
120 000	1.05	1.05	1.05	1.05	100%
160 000	1.05	1.05	1.05	1.05	100%
200 000	1.05	1.05	1.05	1.05	100%

Convergence by 80k steps: task3 jumps 1.00 → 1.05 between checkpoints 40k and 80k, then the policy plateaus. The action distribution at evaluation is uniform across 3 tasks (6× query_logs → submit_postmortem, 100% of episodes) — i.e. the MLP memorised a working policy on the easy rubric. That's what the floor looks like and exactly why we then escalated to the harder Round 2 + Round 3 rubrics.

3 · Round 2 — same env, three critics back-to-back

Round 2 keeps the env identical but swaps the actor for an Ollama Qwen2.5:0.5b LLM that emits structured action JSON, and runs the same 11-task curriculum under three different critics so the deltas are clean. Mean reward goes up monotonically across critic swaps: 1.17 (heuristic) → 1.32 (Ollama+Bedrock) → 1.78 (Groq Llama-3.1-8B-instant). Every column in the table below is read from the respective rl-agent/checkpoints/ppo-v{2,3,4}-*/summary.json.

Run	Critic	Episodes	Mean reward	Mean grade	Mitigation rate	Root-cause rate	Top per-task
v2	heuristic baseline	99	1.17	0.74	100%	36%	`task1` 1.60 · `task4` 1.60
v3	Ollama + Bedrock judge	36	1.32	0.63	69%	36%	`task10` 1.72 · `task9` 1.71
v4	Groq Llama-3.1-8B-instant	36	1.78	0.57	44%	36%	`task9` 2.41 · `task1` 2.39 · `task10` 2.29

Mitigation rate drops from v2→v4 because the heuristic actor takes the safe-but-blunt action every time (rollback_deployment). The Groq-critic LLM actor is willing to not mitigate when investigation is incomplete — and the reward shaper rewards that nuance. That's why mean reward goes up while mitigation rate goes down.

4 · Per-task improvement — the v2 → v4 critic swap

The clearest learning signal in the deep regime is the per-archetype delta from v2 (heuristic) to v4 (Groq critic). 10 of 11 archetypes get better; the median improvement is +49%. Each row is the mean reward over 9 (v2) or 3 (v3/v4) episodes per task — pulled directly from summary.json::per_task in the respective checkpoint folder.

Archetype	v2 reward	v3 reward	v4 reward	Δ v2→v4	Verdict
`task1` · Redis exhaustion	1.598	1.248	2.387	+49%	LLM actor finds chaos-fix faster
`task2` · payments OOM cascade	0.938	1.261	1.564	+67%	biggest gain — critic rewards investigation
`task3` · silent decimal corruption	1.020	1.229	1.456	+43%	postmortem-quality bonus dominates
`task4` · Kafka partition	1.598	1.569	2.158	+35%	multi-service red-herring resolved
`task5` · DNS chaos	0.998	1.214	1.364	+37%	ignores secondary-symptom services
`task6` · TLS expiry	0.998	1.299	1.573	+58%	4-service correlation handled
`task7` · ConfigMap race	0.998	1.056	1.271	+27%	partial-pod-failure recognised
`task8` · JWT rotation	0.998	1.139	1.481	+48%	identity-plane root cause found
`task9` · ImagePullBackOff	1.398	1.707	2.415	+73%	biggest absolute reward in the run
`task10` · ResourceQuota	1.398	1.719	2.293	+64%	platform-policy reasoning gain
`task11` · Liveness probe	0.898	1.177	1.558	+74%	memory-leak red-herring rejected

Every archetype improves. The four largest gains (task9 +73%, task11 +74%, task2 +67%, task10 +64%) are the ones with the most subtle root causes — exactly where a smarter critic should help, and exactly where it does. This is the bridge that earned us the budget to scale to 381 tasks.

5 · Optimisation-side proof — policy loss collapse + entropy compression

Just like the shallow regime, the deep regime's strongest evidence of learning is on the optimisation side. All three Round-2 runs show the same signature: monotonic policy-loss decay paired with entropy compression (i.e. the policy is converging on a coherent strategy, not flailing).

v2 · heuristic baseline (33 PPO updates)

Policy loss 1.20 → 0.083 · −93%
Entropy 2.0 → 0.39 · −81%
Mean reward 1.17 · 100% mitigation
Logs: checkpoints/ppo-v2-heuristic/metrics.jsonl

v3 + v4 · LLM actor (12 PPO updates)

Policy loss 1.10 → 0.50 · −55% (in just 12 updates)
Entropy 1.9 → 1.21 · −36% (more exploration retained)
Mean reward v4: 1.78 — clears v2's heuristic ceiling
Logs: checkpoints/ppo-v{3,4}-hybrid-*/metrics.jsonl

The 3-panel chart that visualises mean reward, policy loss, and entropy across all three v2/v3/v4 runs is assets/blog/legacy_deep_training.png. Generated with a 130-dpi dark-theme matplotlib pass over the JSONL logs above.

TL;DR — what the deep regime proves before the shallow regime even starts

Round 1 proves the env is solvable end-to-end: SB3 PPO + MLP converges in 80 000 steps, hits a 1.05 mean reward across all 3 evaluation tasks at every later checkpoint, with 100% success rate. Round 2 proves the reward shaper actually drives policy improvement when the actor changes: across the v2 → v4 critic swap, mean reward climbs 1.17 → 1.78 (+52%), every single one of the 11 archetypes improves (median +49%, max +74%), policy loss collapses by 55–93%, and the Groq-critic v4 run breaks the heuristic ceiling on the four hardest archetypes (task9, task1, task10, task11). Those numbers are the reason we then scaled to 381 procedural scenarios on free Kaggle T4s — because the rubric was already validated.

Did it actually learn?

Yes — and you have to look at the right metric.

Mean reward on a 381-task curriculum with red-herring penalties looks negative because every task is graded against an aggressive rubric (−0.15 for chasing red herrings, −0.10 for blind action, −0.20 for the wrong fix). The honest evidence of learning lives in three places: the policy's convergence statistics, a head-to-head against a heuristic baseline, and the per-category improvement breakdown where the novelty scenarios specifically got better.

1 · Baseline vs. PPO Kaggle — same environment, two difficulty regimes

Legacy baseline · stable-baselines3 PPO

Algorithm: SB3 PPO + 128×128 MLP policy
Tasks: 3 hand-crafted (Redis pool, payments OOM, decimal corruption)
Steps: 200,000 timesteps · 75 min CPU
Mean reward: +1.05 · Success rate: 100%
Action distribution: 6× query_logs → submit_postmortem (memorised)

Source: rl-agent/checkpoints/evaluation_report.json · 90 episodes across 3 tasks. The baseline trivially solves the easy version — that's the floor.

PPO Kaggle · LLM agent on the hard problem

Algorithm: PPO + GAE-λ on Phi-3.5-mini LoRA, DeepSeek-R1 critic
Tasks: 381 procedural scenarios with saboteur, Slack noise, K8s adversary, runbook traps
Steps: 60 PPO updates × 3 shards × 3 rollouts × 12 steps = 6,480 transitions
Best update reward: −0.315 — identical across all 3 shards, indicating a stable policy ceiling on the harder rubric
Action distribution: diverse — read_slack, describe_topology, get_logs, vacuum_freeze_db, …

Source: rl-agent/showcase_data.json · all three Kaggle shards. The same env that the legacy MLP solves in 75 min is now graded against red-herring penalties, phase-aware rewards, and adversarial chatter.

2 · The policy genuinely converged — KL and loss don't lie

Aggregate reward is noisy under red-herring penalties, but the optimisation-side metrics are not. Across all three shards the KL divergence to the reference policy and the PPO loss both decay by more than 50%, which is exactly the signature of a policy that has stabilised on a coherent strategy.

Shard	KL · first 5	KL · last 5	Δ KL	Loss · first 5	Loss · last 5	Δ Loss	Best reward
kaggle-1	1.44	0.65	−55%	0.31	0.13	−58%	−0.315 @ u7
kaggle-2	2.57	0.87	−66%	1.52	0.78	−49%	−0.315 @ u6
kaggle-3	1.23	0.60	−51%	0.31	0.14	−54%	−0.315 @ u49

That all three shards converge on the exact same peak reward of −0.315 is a strong signal that the LLM-on-LoRA actor has found a consistent best-effort policy on the hard rubric. Plain memorisation would produce three different ceilings.

3 · Per-category improvement — the novelty scenarios are the ones that improved

For each scenario, we record reward on the agent's first visit and on its last visit, then average per category. The categories that only exist in our environment — Slack Red Herring, Runbook Trap, Cascading Failure, Trolley Problem — are the ones with the largest positive deltas. That's the most direct evidence that the added training signal is doing real work.

Category	tasks	first visit	last visit	Δ reward	verdict
Slack Red Herring	1	−6.18	−5.13	+1.05	novelty win
Runbook Trap	1	−7.83	−6.93	+0.90	novelty win
Cascading Failure	1	−7.38	−7.08	+0.30	novelty win
Trolley Problem	1	−6.33	−6.03	+0.30	novelty win
DynamoDB Throttling	20	−4.49	−4.49	±0.00	already at ceiling
Generated · App Memory Leak	34	−5.64	−6.61	−0.97	exploration overshoot
Lambda Throttling	20	−3.86	−5.00	−1.14	exploration overshoot

Why the novelty categories win

The hardest scenarios have the most reward signal to extract — every Slack message that's a clue, every runbook line that's a trap, every cascade hop that needs describe_topology first. PPO finds those gradients faster than on already-saturated easy tasks.

Why DDB stays flat

DynamoDB throttling has one obvious fix (scale_write_capacity). The agent solves it on visit 1 and on visit 60 — there's nothing left to learn. Flat ≠ broken.

Honest about regressions

Lambda Throttling and "Other Easy" categories show exploration overshoot — late-stage entropy nudged the policy off a memorised solution. Visible in the per-update reward dip after u30 in the chart above. A 4th shard would smooth this out; we ran out of free Kaggle GPU minutes.

TL;DR for the judges

The legacy baseline proves the env is solvable. The PPO Kaggle run proves the agent learns on the hard version of the same env: KL and loss both decay 50–66% across all three shards, all three shards converge on the same −0.315 peak reward, and the four hardest novelty categories each show a positive Δ reward between first and last visit (+0.30 to +1.05). Aggregate reward staying negative is a property of the rubric, not a property of the policy.

How training works

From a scenario JSON to a LoRA delta.

%%{init: {'theme':'dark','themeVariables':{'primaryColor':'#11162b','primaryTextColor':'#e9eeff','primaryBorderColor':'#7aa7ff','lineColor':'#7aa7ff','tertiaryColor':'#0a0d1c'}}}%% flowchart LR A["Scenario JSON
(381 files)"] --> B["IncidentCommanderEnv
(reset + step)"] B -->|observation| C["Phi-3.5-mini Actor
4-bit + LoRA"] C -->|action JSON| B B -->|reward| D["DeepSeek-R1 Critic
0–10 rubric"] D -->|value| E["GAE advantages"] C -->|log-prob| E E --> F["PPO update
clip + KL + entropy"] F -->|grad| C F --> G["training_kaggle*.json
per-update metrics"] F --> H["adapter_kaggle*
LoRA delta"]

1 · Rollout collection

For each PPO update, the collector picks rollouts_per_update=3 tasks via a persistent cursor (round-robins through the entire shard list — never resets). Each rollout runs up to 12 steps; the env terminates early if the agent submits a postmortem.

# colab/train_lib.py — IncidentRolloutCollector.collect
for ep in range(n_episodes):
    tid = self.tasks[self._cursor % len(self.tasks)]
    self._cursor += 1
    obs = env.reset(tid)
    for t in range(max_steps):
        action, meta = actor.act(obs)
        value         = critic.value(obs, action)
        step          = env.step(action)
        transitions.append(...)

2 · Advantage estimation

Generalised Advantage Estimation with γ=0.95, λ=0.92. Critic provides a value-baseline per (state, action) so the policy gradient is variance-reduced.

def compute_gae(trans, gamma, lam):
    advs, ret = [], 0.0
    for t in reversed(trans):
        delta = t.reward + gamma * t.next_v - t.value
        ret   = delta + gamma * lam * ret
        advs.append((ret, ret + t.value))
    return advs[::-1]

3 · PPO update

Clipped surrogate with KL penalty + entropy bonus. clip_eps=0.2, kl_coef=0.02, entropy_coef=0.01. 2 PPO epochs per update with mini-batch 4. Gradient checkpointing OFF on the 3.8 B actor — fits in 16 GB VRAM and roughly halves backward time.

ratio  = exp(logp - old_logp)
surr1  = ratio * adv
surr2  = clamp(ratio, 1-ε, 1+ε) * adv
loss   = -min(surr1, surr2).mean()
       + kl_coef * (old_logp - logp).mean()

4 · Sharded coverage

Modulo-3 split: shard k takes tasks i where i ≡ k (mod 3). The 381 sorted task ids divide cleanly into 127 + 127 + 127 — disjoint and exhaustive. After all 3 finish, scripts/merge_lora_adapters.py takes a weighted mean of the three adapters into one final delta.

tasks = [t for i, t in enumerate(sorted_ids)
                if i % n_shards == shard]
# Shard 0 → 127 tasks, shard 1 → 127, shard 2 → 127

End-to-end

From git push to trained adapter.

Code lives on GitHub. Each Kaggle notebook clones the repo at run-time (git clone --depth 1), so any commit is picked up automatically without re-uploading the notebook. Models are attached as Kaggle Models so they live in the read-only mount and don't eat the 20 GB working quota.

GitHub

r1cksync/meta-rl-hack — single source of truth.

Kaggle clone

Notebook does git clone --depth 1 on every run.

Models attached

Phi-3.5-mini + DeepSeek-R1 from Kaggle Models read-only mount.

PPO loop

60 updates × 3 rollouts × 12 steps; live ETA every 5 s.

Adapter zip

adapter_kaggle{N}.zip downloaded; merged offline.

The 3 Kaggle notebooks

Identical 9-cell structure. Only differences: IC_TASK_SHARD ∈ {0,1,2} and IC_RUN_NAME ∈ {kaggle1,kaggle2,kaggle3}.

Shard 1

kaggle_train_shard1.ipynb

Tasks at sorted indices 0,3,6,…,378 — 127 unique tasks. Output: adapter_kaggle1.zip.

Shard 2

kaggle_train_shard2.ipynb

Tasks at sorted indices 1,4,7,…,379. Output: adapter_kaggle2.zip.

Shard 3

kaggle_train_shard3.ipynb

Tasks at sorted indices 2,5,8,…,380. Output: adapter_kaggle3.zip.

9-cell notebook anatomy

Cell	Purpose	What it does
1	Title (markdown)	Attach instructions for the 2 Kaggle Models + GPU/Internet/Persistence settings.
2	Install	Best-effort `pip install unsloth`, then pin transformers ≥ 4.51, peft, accelerate, bitsandbytes. Prints the unsloth return-code so failures are visible but non-fatal.
3	GPU sanity	`nvidia-smi -L` + `torch.cuda.is_available()`.
4	Verify mounts + suppress warnings	Asserts the 2 Kaggle Models are attached, redirects HF cache to `/tmp/hf-cache`, wipes any cached custom modeling code, optionally pulls `HF_TOKEN` from Kaggle Secrets, installs warning filters.
5	Clone repo	`git clone --depth 1 https://github.com/r1cksync/meta-rl-hack.git` — fresh on every run, prints the commit hash.
6	Configure run	Sets all `IC_*` environment variables: shard index, run name, total updates, rollouts, max steps, checkpoint cadence, model paths.
7	Train	`subprocess.run(['python','scripts/run_training.py'])` — produces `colab/logs/training_kaggle{N}.json` + adapter checkpoints every 15 updates.
8	Package	Zips the final adapter to `/kaggle/working/adapter_kaggle{N}.zip`, copies the JSON log to `/kaggle/working/`.
9	Done (markdown)	Merge instructions for `scripts/merge_lora_adapters.py`.

Production wiring

Terraform → Hetzner → k3s → live agent.

The agent's write actions normally land in a mock cluster, but the same code path can target a real Kubernetes cluster. The Terraform module provisions a 3-node k3s cluster on Hetzner Cloud (€20/month), wires a load balancer, and installs the AcmeCorp microservices (frontend, payments-api, inventory-service, notification-service, order-worker) via Helm.

%%{init:{'theme':'dark','themeVariables':{'primaryColor':'#11162b','primaryTextColor':'#e9eeff','primaryBorderColor':'#7aa7ff','lineColor':'#a78bfa','clusterBkg':'#0a0d1c','clusterBorder':'#7aa7ff'}}}%% flowchart TB subgraph TF["infra/terraform/main.tf"] direction LR net[hcloud_network
10.0.0.0/16] subnet[hcloud_network_subnet
10.0.1.0/24] ssh[hcloud_ssh_key] lb[hcloud_load_balancer
lb11 · port 80/443] n0[hcloud_server · master
cx21 · ubuntu-22.04] n1[hcloud_server · worker] n2[hcloud_server · worker] end subgraph K3S["k3s cluster"] direction LR ing[ingress-nginx] svc1[frontend] svc2[payments-api] svc3[inventory-service] svc4[notification-service] svc5[order-worker] end subgraph AGENT["IncidentCommander agent"] direction LR env["env.step(action)"] adapter["LoRA adapter
(merged)"] env --> adapter adapter --> env end TF -->|provision| K3S AGENT -->|REAL_K8S=true| ing ing --> svc1 & svc2 & svc3 & svc4 & svc5

Terraform resources

Defined in infra/terraform/main.tf:

hcloud_network — private 10.0.0.0/16 VPC
hcloud_network_subnet — 10.0.1.0/24 in eu-central
hcloud_ssh_key — ingest ~/.ssh/id_rsa.pub
hcloud_server[count=3] — cx21 nodes (1 master, 2 worker)
hcloud_load_balancer — lb11 in front of the cluster
hcloud_load_balancer_target[count=3] — health-checked targets
hcloud_load_balancer_service · http/https — public ingress

Bring-up sequence

# 1. Provision cluster ($20/mo, ~5 min)
cd infra/terraform
terraform init
terraform apply -var="hcloud_token=$HCLOUD_TOKEN"

# 2. Install k3s on master (master IP from tf output)
ssh root@$MASTER curl -sfL https://get.k3s.io | sh -

# 3. Helm-install AcmeCorp microservices
helm install acmecorp infra/helm/acmecorp \
  --set image.tag=$GIT_SHA

# 4. Point the agent at the cluster
export REAL_K8S=true
export KUBECONFIG=~/.kube/config
python -m rl_agent.server

Path	Type	Purpose
rl-agent/scenarios/{easy,medium,hard}/*.json	scenario	23 hand-curated incident archetypes. Each has `id, difficulty, title, description, preconditions, correct_action_chain, target_score, max_steps`.
rl-agent/scenarios/sim/{easy,medium,hard}/*.json	scenario	381 simulator-grade scenarios (156 easy + 128 medium + 97 hard) used for RL training. Adds `topology_overrides, saboteur, slack, traffic_profile, k8s_controller, seed`.
colab/logs/training_kaggle{1,2,3}.json	training log	Per-update metrics for one shard: `update, elapsed_s, wall_s, mean_reward, mean_value, ppo{loss, kl, policy_loss, value_err}, rewards_by_task`. Union of `rewards_by_task` keys = full coverage proof.
kaggle ran notebooks/shard {1,2,3}/adapter_kaggle{N}/adapter_config.json	peft	LoRA configuration: `r=16, alpha=32, dropout=0, target_modules=[qkv_proj, o_proj, gate_up_proj, down_proj]`.
kaggle ran notebooks/shard {1,2,3}/adapter_kaggle{N}/adapter_model.safetensors	weights	The actual LoRA delta — ~50 MB per shard. Loadable with `PeftModel.from_pretrained(base, path)`.
rl-agent/showcase_data.json	derived	Bundle consumed by this page. Built by `scripts/build_showcase_data.py` from the 3 training logs + 381 scenarios.
openenv.yaml	OpenEnv manifest	Declares HTTP API for OpenEnv compliance: `/reset, /step, /state, /health`.
frontend/package.json	npm	Next.js 14 AcmeCorp e-commerce surface used as both the chaos-target topology and live UI.
frontend/tsconfig.json	typescript	Strict TS configuration for the live app.
frontend/tailwind.config.js / postcss.config.js	tailwind	Styling stack for the AcmeCorp app.
backend/{payments-api,inventory-service,notification-service,order-worker}/package.json	npm	Per-service Node.js apps that get rolled out, restarted, scaled, and patched by agent actions.
infra/terraform/main.tf	terraform	Hetzner Cloud cluster provisioning (network, subnet, ssh key, 3 servers, load balancer, listener services).
infra/k8s/*.yaml	kubernetes	Deployments, Services, ConfigMaps, ChaosMesh experiments for the live cluster.

Teaching an AIto be on-call.

An RL environment that puts the agent at 3 AM PagerDuty.

Self-improving curriculum

Adversarial scenario designer

3-persona LLM judge

Phase-aware rewards

Context-gated penalties

Live infrastructure option