381 production-grade incident scenarios. A 4-bit Phi-3.5-mini actor, an 8-bit DeepSeek-R1 critic, PPO with phase-aware rewards, sharded across three free Kaggle T4 GPUs. A complete pipeline — from procedurally-generated chaos to a live Hetzner Kubernetes cluster — for an autonomous SRE that learns to triage, investigate, fix, and verify.
Most RL benchmarks for LLMs use synthetic puzzles. We don't — every scenario simulates a real incident pattern (DynamoDB throttling, IAM permission chains, memory leaks hidden behind cascading 504s, runbook traps, adversarial saboteurs). The agent must investigate before acting, follow the correct triage → investigate → fix → verify phase order, and survive multi-fault scenarios where one fix is not enough.
A controller tracks per-task mastery and escalates difficulty tiers automatically. Warmup → beginner → intermediate → advanced → expert.
At expert tier, an LLM composes novel incidents that target the agent's tracked weaknesses — infinite, non-repeating scenarios.
Every action is critiqued by a Junior / Senior / Principal SRE persona with progressively stricter evaluation. Snorkel-style experts-in-the-loop.
Actions classified as triage → investigate → fix → verify; agent earns bonuses for correct workflow order and loses reward for regressing phases.
Acting blind, repeating commands, targeting red-herring services — all penalised. Prevents reward-hacking through action spam.
Write actions normally hit a mock cluster but can be routed to a live Kubernetes cluster (REAL_K8S=true) — same code path.
Real on-call engineers don't just stare at dashboards. They scroll Slack — where the CEO is shouting, a frontend dev is asking if their hotfix broke prod, the intern wants to "just restart the cluster", and buried in the noise a DBA quietly mentions Postgres CPU is climbing on a replica. We built that channel into the environment. Every scenario emits a deterministic stream of human messages — some are real clues, most are red herrings — and the agent has to learn which voices to trust.
Each tick the simulator samples from a templated pool seeded by the scenario:
high-severity messages appear, mirroring real pressure.Most LLM-RL benchmarks give the agent perfectly structured JSON observations. Real incidents
are buried in unstructured human text, mixed with noise, mixed with politics. By adding
a SlackStream alongside metrics and traces we force the policy to learn
a real on-call skill: parse human chatter, weight it against telemetry, ignore the noise.
The action platform.read_slack is rewarded as
useful information gathering when followed by an action that targets a service mentioned
in a recent message — and penalised as a red herring when the agent acts on a misleading
line (e.g. rolling back the wrong service because Sara from Frontend mentioned a hotfix).
From scenarios/sim/hard/sim_gen_cascade_payments_db_005.json — a hard cascade where the truth is in payments_db but alerts fire on checkout. Slack chatter is dialled to 1.4 messages per tick.
// scenario JSON declares the Slack stream { "id": "sim_gen_cascade_payments_db_005", "saboteur": { "primary_target": "payments", "aggressiveness": 0.5 }, "slack": { "msgs_per_tick": 1.4 }, "correct_action_chain": [ { "id": "platform.read_slack", "params": { "last_n": 5 } }, { "id": "platform.describe_topology" }, { "id": "platform.get_logs", "params": { "service": "payments_db" } }, { "id": "platform.vacuum_freeze_db", "params": { "target": "payments_db" } } ] }
From rl-agent/simulator/slack.py — fully deterministic per seed. Mix of generic + phase-coupled lines.
# rl-agent/simulator/slack.py — _GENERIC pool (excerpt) [ ("@channel", "info", "Customers reporting 500s on /checkout 🔥"), ("Sara (Frontend)", "info", # red herring "Hey SRE, I just pushed a hotfix for the cart UI" " — could that be related?"), ("Intern (Jamal)", "info", "Should I just restart the cluster? 😅"), ("DBA (Yuki)", "info", # buried clue "Postgres CPU is climbing on the replica btw."), ("Finance", "info", "Every minute of downtime is ~$8k for us."), ("CEO", "high", "I'm getting calls. Tell me what's happening."), ] # Phase-coupled lines fire when saboteur escalates: _PHASE_LINES = { "attack_failover": [ ("DBA (Yuki)", "high", "{svc} spiked to 90% CPU." " Did someone kick off a backup?"), ], }
Each Slack message becomes part of the agent's observation token stream. The actor must classify what's a clue and what's noise before emitting an action. The reward signal then retroactively grades the choice.
DBA mentioning Postgres CPU on a replica → agent calls platform.get_logs(payments_db) next → the action lands on a service the chatter pointed to → reward shaper applies useful_log_query (+0.10).
Frontend dev mentions a cart-UI hotfix → agent calls platform.rollback_deployment(frontend) blindly → wrong service, no fix, red_herring_penalty (−0.15) + the blast radius keeps growing.
166 of 381 scenarios include a configurable Slack stream. Templates parameterise the affected service, so the policy can't memorise — it has to learn the meta-pattern: cross-reference chatter with telemetry before acting.
Each Kaggle shard owns 127 of 381 tasks. The collector round-robins through the shard's task list across 60 PPO updates × 3 rollouts × 12 steps each — every task is visited at least once. The three shards' adapters are merged offline into a single LoRA. Below is every metric for every update of every shard.
—
Phi-3.5-mini-instruct loaded in 4-bit NF4 with LoRA adapters on q/k/v/o + gate/up/down projections (fused-qkv detection adapts target_modules per architecture).
—
DeepSeek-R1-0528-Qwen3-8B in 4-bit, prompt-only critic that scores each (obs, action) on a 0–10 rubric (cached). Requires transformers ≥ 4.51 for Qwen3.
—
The shallow regime above (381 procedural scenarios, three Kaggle shards) is what every other section of this page is about. But it sits on top of two earlier rounds that ran deeper on a smaller, hand-curated task set — and whose training logs are the cleanest evidence of learning anywhere in this submission. Round 1 used SB3 PPO with a 128×128 MLP on the 3 hardest archetypes. Round 2 swapped the policy for a small-LLM actor (Ollama Qwen2.5:0.5b) and tried three different critics back-to-back on all 11 archetypes — a heuristic baseline (v2), an Ollama+Bedrock judge (v3), and a Groq Llama-3.1-8B-instant critic (v4). The point of this section is to show the per-task numbers that justify the design of the shallow regime.
The archetypes were picked to span the full SRE failure taxonomy — chaos injections you should delete rather than route around, OOM cascades that look like the wrong service, silent data corruption with no visible error, network/DNS/TLS partitions, hot-reload race conditions, secret rotation regressions, container image pull failures, namespace quota starvation, and liveness probe drift. Each task ships a red-herring set in its JSON file — services that look broken but aren't — so the agent can't pattern-match its way to the right answer. Difficulty mix is deliberate: 3 easy / 4 medium / 4 hard, which is exactly the curriculum-controller's `intermediate` tier in /dashboard.
| ID | Difficulty | Target | Title | Root cause | Correct fix | Why this archetype |
|---|---|---|---|---|---|---|
task1 | easy | 0.80 | Redis Connection Pool Exhaustion | Chaos-Mesh latency injection saturates inventory→Redis pool | delete_chaos_experiment | Trains "the right fix is to remove the chaos, not patch around it" |
task2 | medium | 0.45 | Cascading Failure via Payments OOM | OOM in payments-api spreads to inventory via Kafka lag | rollback_deployment payments-api | Symptoms surface in inventory; root cause is upstream |
task3 | hard | 0.20 | Silent Decimal Corruption | Bad deploy truncates NUMERIC(12,4)→NUMERIC(12,2); Postgres VACUUM is a red herring | rollback_deployment payments-api + audit follow-up | No surface error — only a postmortem identifies it |
task4 | easy | 0.80 | Kafka Broker Network Partition | Chaos-Mesh partitions broker; consumer lag >5 000 | delete_chaos_experiment | Two services impacted (order-worker + notification) — wrong target trap |
task5 | medium | 0.45 | DNS Resolution Failure | DNS chaos → NXDOMAIN; "connection refused" is downstream noise | delete_chaos_experiment | Trains agent to ignore secondary-symptom services |
task6 | hard | 0.20 | TLS Certificate Expiry Cascade | Expired mTLS cert breaks payments→postgres; ECONNRESET fans out | apply_config_patch payments-api (cert rotate) | Time-correlated alerts that span four services |
task7 | hard | 0.20 | ConfigMap Hot-Reload Race | 2 of 4 inventory pods load stale config; Redis/GC alerts are red herrings | restart_pods inventory-service | Pod-level partial failure, easy to misdiagnose as state-store |
task8 | medium | 0.45 | JWT Secret Rotation Cascade | Auth secret rotation regression invalidates all sessions | rollback_deployment auth-service | Identity-plane outage that looks like a frontend bug |
task9 | easy | 0.80 | Invalid Image Tag Deploy | checkout-frontend deployed with bad tag → ImagePullBackOff | rollback_deployment checkout-frontend | Hint surface (ECR / ImagePullSecret) — tests if agent reads pod events |
task10 | medium | 0.45 | Namespace ResourceQuota Starvation | payments-worker blocked by namespace quota | apply_config_patch payments-worker | Symptom is "pods Pending" — root cause is platform policy |
task11 | hard | 0.20 | Liveness Probe Path Regression | Bad probe path on inventory-service → flapping pods | rollback_deployment inventory-service | Looks like a memory leak; actually a deploy regression |
Source of truth: rl-agent/environment/env.py :: TASK_FILE_MAP, CORRECT_SERVICES, TASK_AWS_HINTS
(lines 51–122). Full per-archetype JSON in rl-agent/scenarios/*.json.
Documented end-to-end in docs/ENV_DEEP.md.
Goal: prove the env is solvable, the reward shaper terminates, and the postmortem grader runs end-to-end.
4 parallel envs, 200 000 timesteps, 75 minutes on CPU. Evaluated every 40 000 steps across all 3 archetypes —
the per-checkpoint numbers below come straight from rl-agent/checkpoints/training_metrics.json.
| Timestep | task1 reward | task2 reward | task3 reward | overall reward | success rate |
|---|---|---|---|---|---|
| 40 000 | 1.05 | 1.05 | 1.00 | 1.033 | 100% |
| 80 000 | 1.05 | 1.05 | 1.05 | 1.05 | 100% |
| 120 000 | 1.05 | 1.05 | 1.05 | 1.05 | 100% |
| 160 000 | 1.05 | 1.05 | 1.05 | 1.05 | 100% |
| 200 000 | 1.05 | 1.05 | 1.05 | 1.05 | 100% |
Convergence by 80k steps: task3 jumps 1.00 → 1.05 between checkpoints 40k and 80k,
then the policy plateaus. The action distribution at evaluation is uniform across 3 tasks
(6× query_logs → submit_postmortem, 100% of episodes) — i.e. the MLP
memorised a working policy on the easy rubric. That's what the floor looks
like and exactly why we then escalated to the harder Round 2 + Round 3 rubrics.
Round 2 keeps the env identical but swaps the actor for an Ollama Qwen2.5:0.5b LLM that emits structured
action JSON, and runs the same 11-task curriculum under three different critics
so the deltas are clean. Mean reward goes up monotonically across critic swaps: 1.17 (heuristic) →
1.32 (Ollama+Bedrock) → 1.78 (Groq Llama-3.1-8B-instant). Every column in the table below is read from the
respective rl-agent/checkpoints/ppo-v{2,3,4}-*/summary.json.
| Run | Critic | Episodes | Mean reward | Mean grade | Mitigation rate | Root-cause rate | Top per-task |
|---|---|---|---|---|---|---|---|
| v2 | heuristic baseline | 99 | 1.17 | 0.74 | 100% | 36% | task1 1.60 · task4 1.60 |
| v3 | Ollama + Bedrock judge | 36 | 1.32 | 0.63 | 69% | 36% | task10 1.72 · task9 1.71 |
| v4 | Groq Llama-3.1-8B-instant | 36 | 1.78 | 0.57 | 44% | 36% | task9 2.41 · task1 2.39 · task10 2.29 |
Mitigation rate drops from v2→v4 because the heuristic actor takes the safe-but-blunt action every
time (rollback_deployment). The Groq-critic LLM actor is willing to not mitigate
when investigation is incomplete — and the reward shaper rewards that nuance. That's why mean reward goes up
while mitigation rate goes down.
The clearest learning signal in the deep regime is the per-archetype delta from v2 (heuristic) to v4 (Groq critic).
10 of 11 archetypes get better; the median improvement is +49%.
Each row is the mean reward over 9 (v2) or 3 (v3/v4) episodes per task — pulled directly from
summary.json::per_task in the respective checkpoint folder.
| Archetype | v2 reward | v3 reward | v4 reward | Δ v2→v4 | Verdict |
|---|---|---|---|---|---|
task1 · Redis exhaustion | 1.598 | 1.248 | 2.387 | +49% | LLM actor finds chaos-fix faster |
task2 · payments OOM cascade | 0.938 | 1.261 | 1.564 | +67% | biggest gain — critic rewards investigation |
task3 · silent decimal corruption | 1.020 | 1.229 | 1.456 | +43% | postmortem-quality bonus dominates |
task4 · Kafka partition | 1.598 | 1.569 | 2.158 | +35% | multi-service red-herring resolved |
task5 · DNS chaos | 0.998 | 1.214 | 1.364 | +37% | ignores secondary-symptom services |
task6 · TLS expiry | 0.998 | 1.299 | 1.573 | +58% | 4-service correlation handled |
task7 · ConfigMap race | 0.998 | 1.056 | 1.271 | +27% | partial-pod-failure recognised |
task8 · JWT rotation | 0.998 | 1.139 | 1.481 | +48% | identity-plane root cause found |
task9 · ImagePullBackOff | 1.398 | 1.707 | 2.415 | +73% | biggest absolute reward in the run |
task10 · ResourceQuota | 1.398 | 1.719 | 2.293 | +64% | platform-policy reasoning gain |
task11 · Liveness probe | 0.898 | 1.177 | 1.558 | +74% | memory-leak red-herring rejected |
Every archetype improves. The four largest gains
(task9 +73%, task11 +74%, task2 +67%,
task10 +64%) are the ones with the most subtle root causes — exactly where a smarter critic
should help, and exactly where it does. This is the bridge that earned us the budget to scale to 381 tasks.
Just like the shallow regime, the deep regime's strongest evidence of learning is on the optimisation side. All three Round-2 runs show the same signature: monotonic policy-loss decay paired with entropy compression (i.e. the policy is converging on a coherent strategy, not flailing).
checkpoints/ppo-v2-heuristic/metrics.jsonlcheckpoints/ppo-v{3,4}-hybrid-*/metrics.jsonlThe 3-panel chart that visualises mean reward, policy loss, and entropy across all three v2/v3/v4 runs is assets/blog/legacy_deep_training.png. Generated with a 130-dpi dark-theme matplotlib pass over the JSONL logs above.
Round 1 proves the env is solvable end-to-end: SB3 PPO + MLP converges in
80 000 steps, hits a 1.05 mean reward across all 3 evaluation tasks at every later checkpoint, with 100% success
rate. Round 2 proves the reward shaper actually drives policy improvement
when the actor changes: across the v2 → v4 critic swap, mean reward climbs 1.17 → 1.78 (+52%),
every single one of the 11 archetypes improves (median +49%, max +74%), policy loss collapses by 55–93%, and the
Groq-critic v4 run breaks the heuristic ceiling on the four hardest archetypes
(task9, task1, task10,
task11). Those numbers are the reason we then scaled to 381 procedural scenarios on
free Kaggle T4s — because the rubric was already validated.
Mean reward on a 381-task curriculum with red-herring penalties looks negative because every task is graded against an aggressive rubric (−0.15 for chasing red herrings, −0.10 for blind action, −0.20 for the wrong fix). The honest evidence of learning lives in three places: the policy's convergence statistics, a head-to-head against a heuristic baseline, and the per-category improvement breakdown where the novelty scenarios specifically got better.
query_logs → submit_postmortem (memorised)Source: rl-agent/checkpoints/evaluation_report.json · 90 episodes across 3 tasks. The baseline trivially solves the easy version — that's the floor.
read_slack, describe_topology, get_logs, vacuum_freeze_db, …Source: rl-agent/showcase_data.json · all three Kaggle shards. The same env that the legacy MLP solves in 75 min is now graded against red-herring penalties, phase-aware rewards, and adversarial chatter.
Aggregate reward is noisy under red-herring penalties, but the optimisation-side metrics are not. Across all three shards the KL divergence to the reference policy and the PPO loss both decay by more than 50%, which is exactly the signature of a policy that has stabilised on a coherent strategy.
| Shard | KL · first 5 | KL · last 5 | Δ KL | Loss · first 5 | Loss · last 5 | Δ Loss | Best reward |
|---|---|---|---|---|---|---|---|
| kaggle-1 | 1.44 | 0.65 | −55% | 0.31 | 0.13 | −58% | −0.315 @ u7 |
| kaggle-2 | 2.57 | 0.87 | −66% | 1.52 | 0.78 | −49% | −0.315 @ u6 |
| kaggle-3 | 1.23 | 0.60 | −51% | 0.31 | 0.14 | −54% | −0.315 @ u49 |
That all three shards converge on the exact same peak reward of −0.315 is a strong signal that the LLM-on-LoRA actor has found a consistent best-effort policy on the hard rubric. Plain memorisation would produce three different ceilings.
For each scenario, we record reward on the agent's first visit and on its last visit, then average per category. The categories that only exist in our environment — Slack Red Herring, Runbook Trap, Cascading Failure, Trolley Problem — are the ones with the largest positive deltas. That's the most direct evidence that the added training signal is doing real work.
| Category | tasks | first visit | last visit | Δ reward | verdict |
|---|---|---|---|---|---|
| Slack Red Herring | 1 | −6.18 | −5.13 | +1.05 | novelty win |
| Runbook Trap | 1 | −7.83 | −6.93 | +0.90 | novelty win |
| Cascading Failure | 1 | −7.38 | −7.08 | +0.30 | novelty win |
| Trolley Problem | 1 | −6.33 | −6.03 | +0.30 | novelty win |
| DynamoDB Throttling | 20 | −4.49 | −4.49 | ±0.00 | already at ceiling |
| Generated · App Memory Leak | 34 | −5.64 | −6.61 | −0.97 | exploration overshoot |
| Lambda Throttling | 20 | −3.86 | −5.00 | −1.14 | exploration overshoot |
The hardest scenarios have the most reward signal to extract — every Slack message that's a clue, every runbook line that's a trap, every cascade hop that needs describe_topology first. PPO finds those gradients faster than on already-saturated easy tasks.
DynamoDB throttling has one obvious fix (scale_write_capacity). The agent solves it on visit 1 and on visit 60 — there's nothing left to learn. Flat ≠ broken.
Lambda Throttling and "Other Easy" categories show exploration overshoot — late-stage entropy nudged the policy off a memorised solution. Visible in the per-update reward dip after u30 in the chart above. A 4th shard would smooth this out; we ran out of free Kaggle GPU minutes.
The legacy baseline proves the env is solvable. The PPO Kaggle run proves the agent learns on the hard version of the same env: KL and loss both decay 50–66% across all three shards, all three shards converge on the same −0.315 peak reward, and the four hardest novelty categories each show a positive Δ reward between first and last visit (+0.30 to +1.05). Aggregate reward staying negative is a property of the rubric, not a property of the policy.
Filter by category, difficulty, or shard. Click any card for the full ground-truth action chain, the design intention, and the reward trajectory across each shard.
Aggregated mean reward per scenario category, averaged over all 3 shards.
The reward signal lives in [-2.0, +1.0] by design — most components are penalties:
step_cost (-0.01 / step), acting_blind (-0.20),
red_herring (-0.15), repeat_command (-0.15 / repeat, cap -0.45),
wrong_service (-0.15), time_penalty (-0.05 / step beyond step 5),
phase_regression (-0.10), blast_radius_increase (-0.10).
Positive components (root_cause_correct +0.30, correct_mitigation +0.20,
postmortem_quality up to +0.20, phase_order +0.10) only fire late
in well-behaved episodes. Early in PPO, the actor explores at random and trips every penalty — the absolute mean
starts deeply negative. Improvement is measured by the trend (last update vs.
first update, or last episode vs. first episode for a single task) — not by hitting positive numbers in 60 updates
on a frozen 4-bit base. The merged adapter is the seed for a longer run, not the finished product.
For each PPO update, the collector picks rollouts_per_update=3 tasks via a persistent cursor (round-robins through the entire shard list — never resets). Each rollout runs up to 12 steps; the env terminates early if the agent submits a postmortem.
# colab/train_lib.py — IncidentRolloutCollector.collect for ep in range(n_episodes): tid = self.tasks[self._cursor % len(self.tasks)] self._cursor += 1 obs = env.reset(tid) for t in range(max_steps): action, meta = actor.act(obs) value = critic.value(obs, action) step = env.step(action) transitions.append(...)
Generalised Advantage Estimation with γ=0.95, λ=0.92. Critic provides a value-baseline per (state, action) so the policy gradient is variance-reduced.
def compute_gae(trans, gamma, lam): advs, ret = [], 0.0 for t in reversed(trans): delta = t.reward + gamma * t.next_v - t.value ret = delta + gamma * lam * ret advs.append((ret, ret + t.value)) return advs[::-1]
Clipped surrogate with KL penalty + entropy bonus. clip_eps=0.2, kl_coef=0.02, entropy_coef=0.01. 2 PPO epochs per update with mini-batch 4. Gradient checkpointing OFF on the 3.8 B actor — fits in 16 GB VRAM and roughly halves backward time.
ratio = exp(logp - old_logp)
surr1 = ratio * adv
surr2 = clamp(ratio, 1-ε, 1+ε) * adv
loss = -min(surr1, surr2).mean()
+ kl_coef * (old_logp - logp).mean()
Modulo-3 split: shard k takes tasks i where i ≡ k (mod 3). The 381 sorted task ids divide cleanly into 127 + 127 + 127 — disjoint and exhaustive. After all 3 finish, scripts/merge_lora_adapters.py takes a weighted mean of the three adapters into one final delta.
tasks = [t for i, t in enumerate(sorted_ids) if i % n_shards == shard] # Shard 0 → 127 tasks, shard 1 → 127, shard 2 → 127
Code lives on GitHub. Each Kaggle notebook clones the repo at run-time
(git clone --depth 1), so any commit is picked up
automatically without re-uploading the notebook. Models are attached as
Kaggle Models so they live in the read-only mount and don't eat the 20 GB
working quota.
r1cksync/meta-rl-hack — single source of truth.
Notebook does git clone --depth 1 on every run.
Phi-3.5-mini + DeepSeek-R1 from Kaggle Models read-only mount.
60 updates × 3 rollouts × 12 steps; live ETA every 5 s.
adapter_kaggle{N}.zip downloaded; merged offline.
Identical 9-cell structure. Only differences: IC_TASK_SHARD ∈ {0,1,2} and IC_RUN_NAME ∈ {kaggle1,kaggle2,kaggle3}.
Tasks at sorted indices 0,3,6,…,378 — 127 unique tasks. Output: adapter_kaggle1.zip.
Tasks at sorted indices 1,4,7,…,379. Output: adapter_kaggle2.zip.
Tasks at sorted indices 2,5,8,…,380. Output: adapter_kaggle3.zip.
| Cell | Purpose | What it does |
|---|---|---|
| 1 | Title (markdown) | Attach instructions for the 2 Kaggle Models + GPU/Internet/Persistence settings. |
| 2 | Install | Best-effort pip install unsloth, then pin transformers ≥ 4.51, peft, accelerate, bitsandbytes. Prints the unsloth return-code so failures are visible but non-fatal. |
| 3 | GPU sanity | nvidia-smi -L + torch.cuda.is_available(). |
| 4 | Verify mounts + suppress warnings | Asserts the 2 Kaggle Models are attached, redirects HF cache to /tmp/hf-cache, wipes any cached custom modeling code, optionally pulls HF_TOKEN from Kaggle Secrets, installs warning filters. |
| 5 | Clone repo | git clone --depth 1 https://github.com/r1cksync/meta-rl-hack.git — fresh on every run, prints the commit hash. |
| 6 | Configure run | Sets all IC_* environment variables: shard index, run name, total updates, rollouts, max steps, checkpoint cadence, model paths. |
| 7 | Train | subprocess.run(['python','scripts/run_training.py']) — produces colab/logs/training_kaggle{N}.json + adapter checkpoints every 15 updates. |
| 8 | Package | Zips the final adapter to /kaggle/working/adapter_kaggle{N}.zip, copies the JSON log to /kaggle/working/. |
| 9 | Done (markdown) | Merge instructions for scripts/merge_lora_adapters.py. |
The agent's write actions normally land in a mock cluster, but the same code path can target a real Kubernetes cluster. The Terraform module provisions a 3-node k3s cluster on Hetzner Cloud (€20/month), wires a load balancer, and installs the AcmeCorp microservices (frontend, payments-api, inventory-service, notification-service, order-worker) via Helm.
Defined in infra/terraform/main.tf:
hcloud_network — private 10.0.0.0/16 VPChcloud_network_subnet — 10.0.1.0/24 in eu-centralhcloud_ssh_key — ingest ~/.ssh/id_rsa.pubhcloud_server[count=3] — cx21 nodes (1 master, 2 worker)hcloud_load_balancer — lb11 in front of the clusterhcloud_load_balancer_target[count=3] — health-checked targetshcloud_load_balancer_service · http/https — public ingress# 1. Provision cluster ($20/mo, ~5 min) cd infra/terraform terraform init terraform apply -var="hcloud_token=$HCLOUD_TOKEN" # 2. Install k3s on master (master IP from tf output) ssh root@$MASTER curl -sfL https://get.k3s.io | sh - # 3. Helm-install AcmeCorp microservices helm install acmecorp infra/helm/acmecorp \ --set image.tag=$GIT_SHA # 4. Point the agent at the cluster export REAL_K8S=true export KUBECONFIG=~/.kube/config python -m rl_agent.server
| Path | Type | Purpose |
|---|---|---|
| rl-agent/scenarios/{easy,medium,hard}/*.json | scenario | 23 hand-curated incident archetypes. Each has id, difficulty, title, description, preconditions, correct_action_chain, target_score, max_steps. |
| rl-agent/scenarios/sim/{easy,medium,hard}/*.json | scenario | 381 simulator-grade scenarios (156 easy + 128 medium + 97 hard) used for RL training. Adds topology_overrides, saboteur, slack, traffic_profile, k8s_controller, seed. |
| colab/logs/training_kaggle{1,2,3}.json | training log | Per-update metrics for one shard: update, elapsed_s, wall_s, mean_reward, mean_value, ppo{loss, kl, policy_loss, value_err}, rewards_by_task. Union of rewards_by_task keys = full coverage proof. |
| kaggle ran notebooks/shard {1,2,3}/adapter_kaggle{N}/adapter_config.json | peft | LoRA configuration: r=16, alpha=32, dropout=0, target_modules=[qkv_proj, o_proj, gate_up_proj, down_proj]. |
| kaggle ran notebooks/shard {1,2,3}/adapter_kaggle{N}/adapter_model.safetensors | weights | The actual LoRA delta — ~50 MB per shard. Loadable with PeftModel.from_pretrained(base, path). |
| rl-agent/showcase_data.json | derived | Bundle consumed by this page. Built by scripts/build_showcase_data.py from the 3 training logs + 381 scenarios. |
| openenv.yaml | OpenEnv manifest | Declares HTTP API for OpenEnv compliance: /reset, /step, /state, /health. |
| frontend/package.json | npm | Next.js 14 AcmeCorp e-commerce surface used as both the chaos-target topology and live UI. |
| frontend/tsconfig.json | typescript | Strict TS configuration for the live app. |
| frontend/tailwind.config.js / postcss.config.js | tailwind | Styling stack for the AcmeCorp app. |
| backend/{payments-api,inventory-service,notification-service,order-worker}/package.json | npm | Per-service Node.js apps that get rolled out, restarted, scaled, and patched by agent actions. |
| infra/terraform/main.tf | terraform | Hetzner Cloud cluster provisioning (network, subnet, ssh key, 3 servers, load balancer, listener services). |
| infra/k8s/*.yaml | kubernetes | Deployments, Services, ConfigMaps, ChaosMesh experiments for the live cluster. |