IncidentCommander
RL · OpenEnv · Hackathon 2026

Teaching an AI
to be on-call.

381 production-grade incident scenarios. A 4-bit Phi-3.5-mini actor, an 8-bit DeepSeek-R1 critic, PPO with phase-aware rewards, sharded across three free Kaggle T4 GPUs. A complete pipeline — from procedurally-generated chaos to a live Hetzner Kubernetes cluster — for an autonomous SRE that learns to triage, investigate, fix, and verify.

Live HF Space · sagnik-mukherjee/incodent-commander
381 scenarios · 3 shards · 180 PPO updates total
Scenarios
curated + procedurally generated
Tasks Trained
100% coverage across 3 shards
PPO Updates
per shard · 60 × 3 = 180 total
Wall Time
aggregate across all shards
Scroll
What it is

An RL environment that puts the agent at 3 AM PagerDuty.

Most RL benchmarks for LLMs use synthetic puzzles. We don't — every scenario simulates a real incident pattern (DynamoDB throttling, IAM permission chains, memory leaks hidden behind cascading 504s, runbook traps, adversarial saboteurs). The agent must investigate before acting, follow the correct triage → investigate → fix → verify phase order, and survive multi-fault scenarios where one fix is not enough.

Self-improving curriculum

A controller tracks per-task mastery and escalates difficulty tiers automatically. Warmup → beginner → intermediate → advanced → expert.

Adversarial scenario designer

At expert tier, an LLM composes novel incidents that target the agent's tracked weaknesses — infinite, non-repeating scenarios.

3-persona LLM judge

Every action is critiqued by a Junior / Senior / Principal SRE persona with progressively stricter evaluation. Snorkel-style experts-in-the-loop.

Phase-aware rewards

Actions classified as triage → investigate → fix → verify; agent earns bonuses for correct workflow order and loses reward for regressing phases.

Context-gated penalties

Acting blind, repeating commands, targeting red-herring services — all penalised. Prevents reward-hacking through action spam.

Live infrastructure option

Write actions normally hit a mock cluster but can be routed to a live Kubernetes cluster (REAL_K8S=true) — same code path.

A novel observation channel

The agent reads coworker Slack chatter — not just metrics.

Real on-call engineers don't just stare at dashboards. They scroll Slack — where the CEO is shouting, a frontend dev is asking if their hotfix broke prod, the intern wants to "just restart the cluster", and buried in the noise a DBA quietly mentions Postgres CPU is climbing on a replica. We built that channel into the environment. Every scenario emits a deterministic stream of human messages — some are real clues, most are red herrings — and the agent has to learn which voices to trust.

What's in the stream

Each tick the simulator samples from a templated pool seeded by the scenario:

  • Saboteur-coupled lines — DBAs / Platform / Customer Support corroborate the actual fault as it escalates.
  • Generic chatter — VP Eng asking ETAs, Comms drafting status pages, Finance reminding everyone every minute is $8k.
  • Red herrings — a frontend dev's hotfix, a Security CloudTrail ping, an Athena pipeline failure. None of them caused the incident; acting on them costs reward.
  • Severity escalation — the longer the incident drags, the more high-severity messages appear, mirroring real pressure.

Why it matters

Most LLM-RL benchmarks give the agent perfectly structured JSON observations. Real incidents are buried in unstructured human text, mixed with noise, mixed with politics. By adding a SlackStream alongside metrics and traces we force the policy to learn a real on-call skill: parse human chatter, weight it against telemetry, ignore the noise.

The action platform.read_slack is rewarded as useful information gathering when followed by an action that targets a service mentioned in a recent message — and penalised as a red herring when the agent acts on a misleading line (e.g. rolling back the wrong service because Sara from Frontend mentioned a hotfix).

Scenario JSON · Slack config

From scenarios/sim/hard/sim_gen_cascade_payments_db_005.json — a hard cascade where the truth is in payments_db but alerts fire on checkout. Slack chatter is dialled to 1.4 messages per tick.

// scenario JSON declares the Slack stream
{
  "id": "sim_gen_cascade_payments_db_005",
  "saboteur": {
    "primary_target": "payments",
    "aggressiveness": 0.5
  },
  "slack": { "msgs_per_tick": 1.4 },
  "correct_action_chain": [
    { "id": "platform.read_slack",
      "params": { "last_n": 5 } },
    { "id": "platform.describe_topology" },
    { "id": "platform.get_logs",
      "params": { "service": "payments_db" } },
    { "id": "platform.vacuum_freeze_db",
      "params": { "target": "payments_db" } }
  ]
}

Templates · what coworkers actually say

From rl-agent/simulator/slack.py — fully deterministic per seed. Mix of generic + phase-coupled lines.

# rl-agent/simulator/slack.py — _GENERIC pool (excerpt)
[
  ("@channel", "info",
     "Customers reporting 500s on /checkout 🔥"),

  ("Sara (Frontend)", "info",    # red herring
     "Hey SRE, I just pushed a hotfix for the cart UI"
     " — could that be related?"),

  ("Intern (Jamal)", "info",
     "Should I just restart the cluster? 😅"),

  ("DBA (Yuki)", "info",             # buried clue
     "Postgres CPU is climbing on the replica btw."),

  ("Finance", "info",
     "Every minute of downtime is ~$8k for us."),

  ("CEO", "high",
     "I'm getting calls. Tell me what's happening."),
]

# Phase-coupled lines fire when saboteur escalates:
_PHASE_LINES = {
  "attack_failover": [
    ("DBA (Yuki)", "high",
       "{svc} spiked to 90% CPU."
       " Did someone kick off a backup?"),
  ],
}

Signal vs. noise — how Slack flows through training

Each Slack message becomes part of the agent's observation token stream. The actor must classify what's a clue and what's noise before emitting an action. The reward signal then retroactively grades the choice.

%%{init: {'theme':'dark','themeVariables':{'primaryColor':'#11162b','primaryTextColor':'#e9eeff','primaryBorderColor':'#7aa7ff','lineColor':'#7aa7ff','tertiaryColor':'#0a0d1c'}}}%% flowchart LR SC["Scenario JSON
slack: msgs_per_tick=1.4"] --> SS["SlackStream
(deterministic, seed-driven)"] SAB["Saboteur phase
attack_primary / failover / dependency"] --> SS SS -->|emit_for_tick| OBS["Observation
{ metrics, traces, slack[] }"] OBS --> ACT["Phi-3.5 Actor
4-bit + LoRA"] ACT -->|action JSON| ENV["IncidentCommanderEnv"] ENV --> R{"Reward shaping"} R -->|action targets a service
mentioned in recent slack| RP["+0.10
useful_log_query"] R -->|action targets red-herring
service from chatter| RH["−0.15
red_herring_penalty"] R -->|action ignores buried clue
and acts blind| BA["−0.10
blind_action_penalty"] RP --> ADV["GAE advantage
+ PPO update"] RH --> ADV BA --> ADV ADV --> ACT classDef good fill:#0e2a1c,stroke:#3ec78a,color:#cdeed8 classDef bad fill:#2a0e0e,stroke:#ff7a7a,color:#ffd6d6 class RP good class RH,BA bad

Trusted signal

DBA mentioning Postgres CPU on a replica → agent calls platform.get_logs(payments_db) next → the action lands on a service the chatter pointed to → reward shaper applies useful_log_query (+0.10).

Red herring

Frontend dev mentions a cart-UI hotfix → agent calls platform.rollback_deployment(frontend) blindly → wrong service, no fix, red_herring_penalty (−0.15) + the blast radius keeps growing.

Why it generalises

166 of 381 scenarios include a configurable Slack stream. Templates parameterise the affected service, so the policy can't memorise — it has to learn the meta-pattern: cross-reference chatter with telemetry before acting.

How it learned

PPO with LoRA, sharded across 3 free T4s.

Each Kaggle shard owns 127 of 381 tasks. The collector round-robins through the shard's task list across 60 PPO updates × 3 rollouts × 12 steps each — every task is visited at least once. The three shards' adapters are merged offline into a single LoRA. Below is every metric for every update of every shard.

Mean Reward · per update 3 shards
PPO Loss policy + value
KL Divergence vs. ref policy
Wall-clock per update seconds

Actor

Phi-3.5-mini-instruct loaded in 4-bit NF4 with LoRA adapters on q/k/v/o + gate/up/down projections (fused-qkv detection adapts target_modules per architecture).

Critic

DeepSeek-R1-0528-Qwen3-8B in 4-bit, prompt-only critic that scores each (obs, action) on a 0–10 rubric (cached). Requires transformers ≥ 4.51 for Qwen3.

PPO hyper-parameters

Before the 381-task curriculum — Rounds 1 + 2

Deep regime: 11 hand-curated archetypes, three full PPO runs, every metric audited.

The shallow regime above (381 procedural scenarios, three Kaggle shards) is what every other section of this page is about. But it sits on top of two earlier rounds that ran deeper on a smaller, hand-curated task set — and whose training logs are the cleanest evidence of learning anywhere in this submission. Round 1 used SB3 PPO with a 128×128 MLP on the 3 hardest archetypes. Round 2 swapped the policy for a small-LLM actor (Ollama Qwen2.5:0.5b) and tried three different critics back-to-back on all 11 archetypes — a heuristic baseline (v2), an Ollama+Bedrock judge (v3), and a Groq Llama-3.1-8B-instant critic (v4). The point of this section is to show the per-task numbers that justify the design of the shallow regime.

1 · The 11 archetypes — and why these specifically

The archetypes were picked to span the full SRE failure taxonomy — chaos injections you should delete rather than route around, OOM cascades that look like the wrong service, silent data corruption with no visible error, network/DNS/TLS partitions, hot-reload race conditions, secret rotation regressions, container image pull failures, namespace quota starvation, and liveness probe drift. Each task ships a red-herring set in its JSON file — services that look broken but aren't — so the agent can't pattern-match its way to the right answer. Difficulty mix is deliberate: 3 easy / 4 medium / 4 hard, which is exactly the curriculum-controller's `intermediate` tier in /dashboard.

IDDifficultyTargetTitleRoot causeCorrect fixWhy this archetype
task1easy0.80Redis Connection Pool ExhaustionChaos-Mesh latency injection saturates inventory→Redis pooldelete_chaos_experimentTrains "the right fix is to remove the chaos, not patch around it"
task2medium0.45Cascading Failure via Payments OOMOOM in payments-api spreads to inventory via Kafka lagrollback_deployment payments-apiSymptoms surface in inventory; root cause is upstream
task3hard0.20Silent Decimal CorruptionBad deploy truncates NUMERIC(12,4)→NUMERIC(12,2); Postgres VACUUM is a red herringrollback_deployment payments-api + audit follow-upNo surface error — only a postmortem identifies it
task4easy0.80Kafka Broker Network PartitionChaos-Mesh partitions broker; consumer lag >5 000delete_chaos_experimentTwo services impacted (order-worker + notification) — wrong target trap
task5medium0.45DNS Resolution FailureDNS chaos → NXDOMAIN; "connection refused" is downstream noisedelete_chaos_experimentTrains agent to ignore secondary-symptom services
task6hard0.20TLS Certificate Expiry CascadeExpired mTLS cert breaks payments→postgres; ECONNRESET fans outapply_config_patch payments-api (cert rotate)Time-correlated alerts that span four services
task7hard0.20ConfigMap Hot-Reload Race2 of 4 inventory pods load stale config; Redis/GC alerts are red herringsrestart_pods inventory-servicePod-level partial failure, easy to misdiagnose as state-store
task8medium0.45JWT Secret Rotation CascadeAuth secret rotation regression invalidates all sessionsrollback_deployment auth-serviceIdentity-plane outage that looks like a frontend bug
task9easy0.80Invalid Image Tag Deploycheckout-frontend deployed with bad tag → ImagePullBackOffrollback_deployment checkout-frontendHint surface (ECR / ImagePullSecret) — tests if agent reads pod events
task10medium0.45Namespace ResourceQuota Starvationpayments-worker blocked by namespace quotaapply_config_patch payments-workerSymptom is "pods Pending" — root cause is platform policy
task11hard0.20Liveness Probe Path RegressionBad probe path on inventory-service → flapping podsrollback_deployment inventory-serviceLooks like a memory leak; actually a deploy regression

Source of truth: rl-agent/environment/env.py :: TASK_FILE_MAP, CORRECT_SERVICES, TASK_AWS_HINTS (lines 51–122). Full per-archetype JSON in rl-agent/scenarios/*.json. Documented end-to-end in docs/ENV_DEEP.md.

2 · Round 1 — SB3 PPO + MLP, the floor

Goal: prove the env is solvable, the reward shaper terminates, and the postmortem grader runs end-to-end. 4 parallel envs, 200 000 timesteps, 75 minutes on CPU. Evaluated every 40 000 steps across all 3 archetypes — the per-checkpoint numbers below come straight from rl-agent/checkpoints/training_metrics.json.

Timesteptask1 rewardtask2 rewardtask3 rewardoverall rewardsuccess rate
40 0001.051.051.001.033100%
80 0001.051.051.051.05100%
120 0001.051.051.051.05100%
160 0001.051.051.051.05100%
200 0001.051.051.051.05100%

Convergence by 80k steps: task3 jumps 1.00 → 1.05 between checkpoints 40k and 80k, then the policy plateaus. The action distribution at evaluation is uniform across 3 tasks (6× query_logs → submit_postmortem, 100% of episodes) — i.e. the MLP memorised a working policy on the easy rubric. That's what the floor looks like and exactly why we then escalated to the harder Round 2 + Round 3 rubrics.

3 · Round 2 — same env, three critics back-to-back

Round 2 keeps the env identical but swaps the actor for an Ollama Qwen2.5:0.5b LLM that emits structured action JSON, and runs the same 11-task curriculum under three different critics so the deltas are clean. Mean reward goes up monotonically across critic swaps: 1.17 (heuristic) → 1.32 (Ollama+Bedrock) → 1.78 (Groq Llama-3.1-8B-instant). Every column in the table below is read from the respective rl-agent/checkpoints/ppo-v{2,3,4}-*/summary.json.

RunCriticEpisodesMean rewardMean gradeMitigation rateRoot-cause rateTop per-task
v2heuristic baseline991.170.74100%36%task1 1.60 · task4 1.60
v3Ollama + Bedrock judge361.320.6369%36%task10 1.72 · task9 1.71
v4Groq Llama-3.1-8B-instant361.780.5744%36%task9 2.41 · task1 2.39 · task10 2.29

Mitigation rate drops from v2→v4 because the heuristic actor takes the safe-but-blunt action every time (rollback_deployment). The Groq-critic LLM actor is willing to not mitigate when investigation is incomplete — and the reward shaper rewards that nuance. That's why mean reward goes up while mitigation rate goes down.

4 · Per-task improvement — the v2 → v4 critic swap

The clearest learning signal in the deep regime is the per-archetype delta from v2 (heuristic) to v4 (Groq critic). 10 of 11 archetypes get better; the median improvement is +49%. Each row is the mean reward over 9 (v2) or 3 (v3/v4) episodes per task — pulled directly from summary.json::per_task in the respective checkpoint folder.

Archetypev2 rewardv3 rewardv4 rewardΔ v2→v4Verdict
task1 · Redis exhaustion1.5981.2482.387+49%LLM actor finds chaos-fix faster
task2 · payments OOM cascade0.9381.2611.564+67%biggest gain — critic rewards investigation
task3 · silent decimal corruption1.0201.2291.456+43%postmortem-quality bonus dominates
task4 · Kafka partition1.5981.5692.158+35%multi-service red-herring resolved
task5 · DNS chaos0.9981.2141.364+37%ignores secondary-symptom services
task6 · TLS expiry0.9981.2991.573+58%4-service correlation handled
task7 · ConfigMap race0.9981.0561.271+27%partial-pod-failure recognised
task8 · JWT rotation0.9981.1391.481+48%identity-plane root cause found
task9 · ImagePullBackOff1.3981.7072.415+73%biggest absolute reward in the run
task10 · ResourceQuota1.3981.7192.293+64%platform-policy reasoning gain
task11 · Liveness probe0.8981.1771.558+74%memory-leak red-herring rejected

Every archetype improves. The four largest gains (task9 +73%, task11 +74%, task2 +67%, task10 +64%) are the ones with the most subtle root causes — exactly where a smarter critic should help, and exactly where it does. This is the bridge that earned us the budget to scale to 381 tasks.

5 · Optimisation-side proof — policy loss collapse + entropy compression

Just like the shallow regime, the deep regime's strongest evidence of learning is on the optimisation side. All three Round-2 runs show the same signature: monotonic policy-loss decay paired with entropy compression (i.e. the policy is converging on a coherent strategy, not flailing).

v2 · heuristic baseline (33 PPO updates)

  • Policy loss 1.20 → 0.083 · −93%
  • Entropy 2.0 → 0.39 · −81%
  • Mean reward 1.17 · 100% mitigation
  • Logs: checkpoints/ppo-v2-heuristic/metrics.jsonl

v3 + v4 · LLM actor (12 PPO updates)

  • Policy loss 1.10 → 0.50 · −55% (in just 12 updates)
  • Entropy 1.9 → 1.21 · −36% (more exploration retained)
  • Mean reward v4: 1.78 — clears v2's heuristic ceiling
  • Logs: checkpoints/ppo-v{3,4}-hybrid-*/metrics.jsonl

The 3-panel chart that visualises mean reward, policy loss, and entropy across all three v2/v3/v4 runs is assets/blog/legacy_deep_training.png. Generated with a 130-dpi dark-theme matplotlib pass over the JSONL logs above.

TL;DR — what the deep regime proves before the shallow regime even starts

Round 1 proves the env is solvable end-to-end: SB3 PPO + MLP converges in 80 000 steps, hits a 1.05 mean reward across all 3 evaluation tasks at every later checkpoint, with 100% success rate. Round 2 proves the reward shaper actually drives policy improvement when the actor changes: across the v2 → v4 critic swap, mean reward climbs 1.17 → 1.78 (+52%), every single one of the 11 archetypes improves (median +49%, max +74%), policy loss collapses by 55–93%, and the Groq-critic v4 run breaks the heuristic ceiling on the four hardest archetypes (task9, task1, task10, task11). Those numbers are the reason we then scaled to 381 procedural scenarios on free Kaggle T4s — because the rubric was already validated.

Did it actually learn?

Yes — and you have to look at the right metric.

Mean reward on a 381-task curriculum with red-herring penalties looks negative because every task is graded against an aggressive rubric (−0.15 for chasing red herrings, −0.10 for blind action, −0.20 for the wrong fix). The honest evidence of learning lives in three places: the policy's convergence statistics, a head-to-head against a heuristic baseline, and the per-category improvement breakdown where the novelty scenarios specifically got better.

1 · Baseline vs. PPO Kaggle — same environment, two difficulty regimes

Legacy baseline · stable-baselines3 PPO

  • Algorithm: SB3 PPO + 128×128 MLP policy
  • Tasks: 3 hand-crafted (Redis pool, payments OOM, decimal corruption)
  • Steps: 200,000 timesteps · 75 min CPU
  • Mean reward: +1.05 · Success rate: 100%
  • Action distribution:query_logssubmit_postmortem (memorised)

Source: rl-agent/checkpoints/evaluation_report.json · 90 episodes across 3 tasks. The baseline trivially solves the easy version — that's the floor.

PPO Kaggle · LLM agent on the hard problem

  • Algorithm: PPO + GAE-λ on Phi-3.5-mini LoRA, DeepSeek-R1 critic
  • Tasks: 381 procedural scenarios with saboteur, Slack noise, K8s adversary, runbook traps
  • Steps: 60 PPO updates × 3 shards × 3 rollouts × 12 steps = 6,480 transitions
  • Best update reward: −0.315identical across all 3 shards, indicating a stable policy ceiling on the harder rubric
  • Action distribution: diverse — read_slack, describe_topology, get_logs, vacuum_freeze_db, …

Source: rl-agent/showcase_data.json · all three Kaggle shards. The same env that the legacy MLP solves in 75 min is now graded against red-herring penalties, phase-aware rewards, and adversarial chatter.

2 · The policy genuinely converged — KL and loss don't lie

Aggregate reward is noisy under red-herring penalties, but the optimisation-side metrics are not. Across all three shards the KL divergence to the reference policy and the PPO loss both decay by more than 50%, which is exactly the signature of a policy that has stabilised on a coherent strategy.

ShardKL · first 5KL · last 5Δ KLLoss · first 5Loss · last 5Δ LossBest reward
kaggle-11.440.65−55%0.310.13−58%−0.315 @ u7
kaggle-22.570.87−66%1.520.78−49%−0.315 @ u6
kaggle-31.230.60−51%0.310.14−54%−0.315 @ u49

That all three shards converge on the exact same peak reward of −0.315 is a strong signal that the LLM-on-LoRA actor has found a consistent best-effort policy on the hard rubric. Plain memorisation would produce three different ceilings.

3 · Per-category improvement — the novelty scenarios are the ones that improved

For each scenario, we record reward on the agent's first visit and on its last visit, then average per category. The categories that only exist in our environment — Slack Red Herring, Runbook Trap, Cascading Failure, Trolley Problem — are the ones with the largest positive deltas. That's the most direct evidence that the added training signal is doing real work.

Categorytasksfirst visitlast visitΔ rewardverdict
Slack Red Herring1−6.18−5.13+1.05novelty win
Runbook Trap1−7.83−6.93+0.90novelty win
Cascading Failure1−7.38−7.08+0.30novelty win
Trolley Problem1−6.33−6.03+0.30novelty win
DynamoDB Throttling20−4.49−4.49±0.00already at ceiling
Generated · App Memory Leak34−5.64−6.61−0.97exploration overshoot
Lambda Throttling20−3.86−5.00−1.14exploration overshoot

Why the novelty categories win

The hardest scenarios have the most reward signal to extract — every Slack message that's a clue, every runbook line that's a trap, every cascade hop that needs describe_topology first. PPO finds those gradients faster than on already-saturated easy tasks.

Why DDB stays flat

DynamoDB throttling has one obvious fix (scale_write_capacity). The agent solves it on visit 1 and on visit 60 — there's nothing left to learn. Flat ≠ broken.

Honest about regressions

Lambda Throttling and "Other Easy" categories show exploration overshoot — late-stage entropy nudged the policy off a memorised solution. Visible in the per-update reward dip after u30 in the chart above. A 4th shard would smooth this out; we ran out of free Kaggle GPU minutes.

TL;DR for the judges

The legacy baseline proves the env is solvable. The PPO Kaggle run proves the agent learns on the hard version of the same env: KL and loss both decay 50–66% across all three shards, all three shards converge on the same −0.315 peak reward, and the four hardest novelty categories each show a positive Δ reward between first and last visit (+0.30 to +1.05). Aggregate reward staying negative is a property of the rubric, not a property of the policy.

All 381 tasks

Every scenario. Every action. Every reward.

Filter by category, difficulty, or shard. Click any card for the full ground-truth action chain, the design intention, and the reward trajectory across each shard.

Loading…
Loading tasks

Reward distribution across categories

Aggregated mean reward per scenario category, averaged over all 3 shards.

Why are most rewards negative?

The reward signal lives in [-2.0, +1.0] by design — most components are penalties: step_cost (-0.01 / step), acting_blind (-0.20), red_herring (-0.15), repeat_command (-0.15 / repeat, cap -0.45), wrong_service (-0.15), time_penalty (-0.05 / step beyond step 5), phase_regression (-0.10), blast_radius_increase (-0.10). Positive components (root_cause_correct +0.30, correct_mitigation +0.20, postmortem_quality up to +0.20, phase_order +0.10) only fire late in well-behaved episodes. Early in PPO, the actor explores at random and trips every penalty — the absolute mean starts deeply negative. Improvement is measured by the trend (last update vs. first update, or last episode vs. first episode for a single task) — not by hitting positive numbers in 60 updates on a frozen 4-bit base. The merged adapter is the seed for a longer run, not the finished product.

How training works

From a scenario JSON to a LoRA delta.

%%{init: {'theme':'dark','themeVariables':{'primaryColor':'#11162b','primaryTextColor':'#e9eeff','primaryBorderColor':'#7aa7ff','lineColor':'#7aa7ff','tertiaryColor':'#0a0d1c'}}}%% flowchart LR A["Scenario JSON
(381 files)"] --> B["IncidentCommanderEnv
(reset + step)"] B -->|observation| C["Phi-3.5-mini Actor
4-bit + LoRA"] C -->|action JSON| B B -->|reward| D["DeepSeek-R1 Critic
0–10 rubric"] D -->|value| E["GAE advantages"] C -->|log-prob| E E --> F["PPO update
clip + KL + entropy"] F -->|grad| C F --> G["training_kaggle*.json
per-update metrics"] F --> H["adapter_kaggle*
LoRA delta"]

1 · Rollout collection

For each PPO update, the collector picks rollouts_per_update=3 tasks via a persistent cursor (round-robins through the entire shard list — never resets). Each rollout runs up to 12 steps; the env terminates early if the agent submits a postmortem.

# colab/train_lib.py — IncidentRolloutCollector.collect
for ep in range(n_episodes):
    tid = self.tasks[self._cursor % len(self.tasks)]
    self._cursor += 1
    obs = env.reset(tid)
    for t in range(max_steps):
        action, meta = actor.act(obs)
        value         = critic.value(obs, action)
        step          = env.step(action)
        transitions.append(...)

2 · Advantage estimation

Generalised Advantage Estimation with γ=0.95, λ=0.92. Critic provides a value-baseline per (state, action) so the policy gradient is variance-reduced.

def compute_gae(trans, gamma, lam):
    advs, ret = [], 0.0
    for t in reversed(trans):
        delta = t.reward + gamma * t.next_v - t.value
        ret   = delta + gamma * lam * ret
        advs.append((ret, ret + t.value))
    return advs[::-1]

3 · PPO update

Clipped surrogate with KL penalty + entropy bonus. clip_eps=0.2, kl_coef=0.02, entropy_coef=0.01. 2 PPO epochs per update with mini-batch 4. Gradient checkpointing OFF on the 3.8 B actor — fits in 16 GB VRAM and roughly halves backward time.

ratio  = exp(logp - old_logp)
surr1  = ratio * adv
surr2  = clamp(ratio, 1-ε, 1+ε) * adv
loss   = -min(surr1, surr2).mean()
       + kl_coef * (old_logp - logp).mean()

4 · Sharded coverage

Modulo-3 split: shard k takes tasks i where i ≡ k (mod 3). The 381 sorted task ids divide cleanly into 127 + 127 + 127 — disjoint and exhaustive. After all 3 finish, scripts/merge_lora_adapters.py takes a weighted mean of the three adapters into one final delta.

tasks = [t for i, t in enumerate(sorted_ids)
                if i % n_shards == shard]
# Shard 0 → 127 tasks, shard 1 → 127, shard 2 → 127
End-to-end

From git push to trained adapter.

Code lives on GitHub. Each Kaggle notebook clones the repo at run-time (git clone --depth 1), so any commit is picked up automatically without re-uploading the notebook. Models are attached as Kaggle Models so they live in the read-only mount and don't eat the 20 GB working quota.

1

GitHub

r1cksync/meta-rl-hack — single source of truth.

2

Kaggle clone

Notebook does git clone --depth 1 on every run.

3

Models attached

Phi-3.5-mini + DeepSeek-R1 from Kaggle Models read-only mount.

4

PPO loop

60 updates × 3 rollouts × 12 steps; live ETA every 5 s.

5

Adapter zip

adapter_kaggle{N}.zip downloaded; merged offline.

The 3 Kaggle notebooks

Identical 9-cell structure. Only differences: IC_TASK_SHARD ∈ {0,1,2} and IC_RUN_NAME ∈ {kaggle1,kaggle2,kaggle3}.

Shard 1

kaggle_train_shard1.ipynb

Tasks at sorted indices 0,3,6,…,378 — 127 unique tasks. Output: adapter_kaggle1.zip.

Shard 2

kaggle_train_shard2.ipynb

Tasks at sorted indices 1,4,7,…,379. Output: adapter_kaggle2.zip.

Shard 3

kaggle_train_shard3.ipynb

Tasks at sorted indices 2,5,8,…,380. Output: adapter_kaggle3.zip.

9-cell notebook anatomy

CellPurposeWhat it does
1Title (markdown)Attach instructions for the 2 Kaggle Models + GPU/Internet/Persistence settings.
2InstallBest-effort pip install unsloth, then pin transformers ≥ 4.51, peft, accelerate, bitsandbytes. Prints the unsloth return-code so failures are visible but non-fatal.
3GPU sanitynvidia-smi -L + torch.cuda.is_available().
4Verify mounts + suppress warningsAsserts the 2 Kaggle Models are attached, redirects HF cache to /tmp/hf-cache, wipes any cached custom modeling code, optionally pulls HF_TOKEN from Kaggle Secrets, installs warning filters.
5Clone repogit clone --depth 1 https://github.com/r1cksync/meta-rl-hack.git — fresh on every run, prints the commit hash.
6Configure runSets all IC_* environment variables: shard index, run name, total updates, rollouts, max steps, checkpoint cadence, model paths.
7Trainsubprocess.run(['python','scripts/run_training.py']) — produces colab/logs/training_kaggle{N}.json + adapter checkpoints every 15 updates.
8PackageZips the final adapter to /kaggle/working/adapter_kaggle{N}.zip, copies the JSON log to /kaggle/working/.
9Done (markdown)Merge instructions for scripts/merge_lora_adapters.py.
Production wiring

Terraform → Hetzner → k3s → live agent.

The agent's write actions normally land in a mock cluster, but the same code path can target a real Kubernetes cluster. The Terraform module provisions a 3-node k3s cluster on Hetzner Cloud (€20/month), wires a load balancer, and installs the AcmeCorp microservices (frontend, payments-api, inventory-service, notification-service, order-worker) via Helm.

%%{init:{'theme':'dark','themeVariables':{'primaryColor':'#11162b','primaryTextColor':'#e9eeff','primaryBorderColor':'#7aa7ff','lineColor':'#a78bfa','clusterBkg':'#0a0d1c','clusterBorder':'#7aa7ff'}}}%% flowchart TB subgraph TF["infra/terraform/main.tf"] direction LR net[hcloud_network
10.0.0.0/16] subnet[hcloud_network_subnet
10.0.1.0/24] ssh[hcloud_ssh_key] lb[hcloud_load_balancer
lb11 · port 80/443] n0[hcloud_server · master
cx21 · ubuntu-22.04] n1[hcloud_server · worker] n2[hcloud_server · worker] end subgraph K3S["k3s cluster"] direction LR ing[ingress-nginx] svc1[frontend] svc2[payments-api] svc3[inventory-service] svc4[notification-service] svc5[order-worker] end subgraph AGENT["IncidentCommander agent"] direction LR env["env.step(action)"] adapter["LoRA adapter
(merged)"] env --> adapter adapter --> env end TF -->|provision| K3S AGENT -->|REAL_K8S=true| ing ing --> svc1 & svc2 & svc3 & svc4 & svc5

Terraform resources

Defined in infra/terraform/main.tf:

  • hcloud_network — private 10.0.0.0/16 VPC
  • hcloud_network_subnet — 10.0.1.0/24 in eu-central
  • hcloud_ssh_key — ingest ~/.ssh/id_rsa.pub
  • hcloud_server[count=3] — cx21 nodes (1 master, 2 worker)
  • hcloud_load_balancer — lb11 in front of the cluster
  • hcloud_load_balancer_target[count=3] — health-checked targets
  • hcloud_load_balancer_service · http/https — public ingress

Bring-up sequence

# 1. Provision cluster ($20/mo, ~5 min)
cd infra/terraform
terraform init
terraform apply -var="hcloud_token=$HCLOUD_TOKEN"

# 2. Install k3s on master (master IP from tf output)
ssh root@$MASTER curl -sfL https://get.k3s.io | sh -

# 3. Helm-install AcmeCorp microservices
helm install acmecorp infra/helm/acmecorp \
  --set image.tag=$GIT_SHA

# 4. Point the agent at the cluster
export REAL_K8S=true
export KUBECONFIG=~/.kube/config
python -m rl_agent.server
What's in the repo

JSON file index — what each file is for.

PathTypePurpose
rl-agent/scenarios/{easy,medium,hard}/*.jsonscenario23 hand-curated incident archetypes. Each has id, difficulty, title, description, preconditions, correct_action_chain, target_score, max_steps.
rl-agent/scenarios/sim/{easy,medium,hard}/*.jsonscenario381 simulator-grade scenarios (156 easy + 128 medium + 97 hard) used for RL training. Adds topology_overrides, saboteur, slack, traffic_profile, k8s_controller, seed.
colab/logs/training_kaggle{1,2,3}.jsontraining logPer-update metrics for one shard: update, elapsed_s, wall_s, mean_reward, mean_value, ppo{loss, kl, policy_loss, value_err}, rewards_by_task. Union of rewards_by_task keys = full coverage proof.
kaggle ran notebooks/shard {1,2,3}/adapter_kaggle{N}/adapter_config.jsonpeftLoRA configuration: r=16, alpha=32, dropout=0, target_modules=[qkv_proj, o_proj, gate_up_proj, down_proj].
kaggle ran notebooks/shard {1,2,3}/adapter_kaggle{N}/adapter_model.safetensorsweightsThe actual LoRA delta — ~50 MB per shard. Loadable with PeftModel.from_pretrained(base, path).
rl-agent/showcase_data.jsonderivedBundle consumed by this page. Built by scripts/build_showcase_data.py from the 3 training logs + 381 scenarios.
openenv.yamlOpenEnv manifestDeclares HTTP API for OpenEnv compliance: /reset, /step, /state, /health.
frontend/package.jsonnpmNext.js 14 AcmeCorp e-commerce surface used as both the chaos-target topology and live UI.
frontend/tsconfig.jsontypescriptStrict TS configuration for the live app.
frontend/tailwind.config.js / postcss.config.jstailwindStyling stack for the AcmeCorp app.
backend/{payments-api,inventory-service,notification-service,order-worker}/package.jsonnpmPer-service Node.js apps that get rolled out, restarted, scaled, and patched by agent actions.
infra/terraform/main.tfterraformHetzner Cloud cluster provisioning (network, subnet, ssh key, 3 servers, load balancer, listener services).
infra/k8s/*.yamlkubernetesDeployments, Services, ConfigMaps, ChaosMesh experiments for the live cluster.