LEGACY DATASET
These charts come from the kube-sre-gym-style heuristic + early notebook runs — the 11 hand-curated tasks in
rl-agent/scenarios/{easy,medium,hard}/*.json, recorded into
rl-agent/checkpoints/<run>/metrics.jsonl and
colab/logs/reward_breakdown_history.jsonl. They do not include the 381-task PPO Kaggle run.
Cluster health
Backend status snapshot (simulated EKS state)
Namespaces & deployments
| Namespace | Deployment | Replicas | Ready | Status |
| acmecorp | payments-api | 3 | 3 | Running |
| acmecorp | inventory-service | 2 | 2 | Running |
| acmecorp | checkout-frontend | 2 | 2 | Running |
| acmecorp | auth-service | 2 | 1 | Degraded |
| acmecorp | order-worker | 3 | 3 | Running |
| acmecorp | notification-service | 2 | 2 | Running |
| chaos-mesh | chaos-controller-manager | 1 | 1 | Running |
| observability | prometheus-server | 1 | 1 | Running |
| observability | loki | 1 | 1 | Running |
| observability | grafana | 1 | 1 | Running |
Observability stack (live probes)
Each integration probes the configured URL (env vars PROMETHEUS_URL,
LOKI_URL, GRAFANA_URL, ALERTMANAGER_URL).
| Service | URL | Status | Detail |
| Prometheus | not set | unset | |
| Loki | not set | unset | |
| Grafana | not set | unset | |
| Alertmanager | not set | unset | |
Fault injectors available
| Kind | Effect |
| image_tag_bump | Roll deployment to a nonexistent tag → ImagePullBackOff |
| env_mutation | Inject a bad env var to break startup |
| resource_limit | Shrink CPU/memory limit to force OOMKill |
| liveness_probe | Flip liveness probe to always-fail → CrashLoopBackOff |
| network_latency | tc netem injection between pod & Redis |
| configmap_corruption | Mutate ConfigMap key → race conditions |
| secret_rotation | Delete JWT secret → auth regressions |
| namespace_quota | Apply ResourceQuota that blocks scheduling |
| dns_chaos | CoreDNS rewrite → NXDOMAIN |