LEGACY DATASET
These charts come from the kube-sre-gym-style heuristic + early notebook runs — the 11 hand-curated tasks in rl-agent/scenarios/{easy,medium,hard}/*.json, recorded into rl-agent/checkpoints/<run>/metrics.jsonl and colab/logs/reward_breakdown_history.jsonl. They do not include the 381-task PPO Kaggle run.

Cluster health

Backend status snapshot (simulated EKS state)

Namespaces & deployments

NamespaceDeploymentReplicasReadyStatus
acmecorppayments-api33Running
acmecorpinventory-service22Running
acmecorpcheckout-frontend22Running
acmecorpauth-service21Degraded
acmecorporder-worker33Running
acmecorpnotification-service22Running
chaos-meshchaos-controller-manager11Running
observabilityprometheus-server11Running
observabilityloki11Running
observabilitygrafana11Running

Observability stack (live probes)

Each integration probes the configured URL (env vars PROMETHEUS_URL, LOKI_URL, GRAFANA_URL, ALERTMANAGER_URL).
ServiceURLStatusDetail
Prometheusnot setunset
Lokinot setunset
Grafananot setunset
Alertmanagernot setunset

Fault injectors available

KindEffect
image_tag_bumpRoll deployment to a nonexistent tag → ImagePullBackOff
env_mutationInject a bad env var to break startup
resource_limitShrink CPU/memory limit to force OOMKill
liveness_probeFlip liveness probe to always-fail → CrashLoopBackOff
network_latencytc netem injection between pod & Redis
configmap_corruptionMutate ConfigMap key → race conditions
secret_rotationDelete JWT secret → auth regressions
namespace_quotaApply ResourceQuota that blocks scheduling
dns_chaosCoreDNS rewrite → NXDOMAIN