LEGACY DATASET
These charts come from the kube-sre-gym-style heuristic + early notebook runs — the 11 hand-curated tasks in
rl-agent/scenarios/{easy,medium,hard}/*.json, recorded into
rl-agent/checkpoints/<run>/metrics.jsonl and
colab/logs/reward_breakdown_history.jsonl. They do not include the 381-task PPO Kaggle run.
AWS wiring & training evidence
Account 969831127386 · Region us-east-1 ·
Live services reachable here: 0/15 ·
Last training run: 2026-04-23 04:18 UTC
S3 objects uploaded
7
bucket: ic-checkpoints-969831127386-us-east-1
CloudWatch datapoints
48
namespace: IncidentCommander
S3 uploads (last run — prefix runs/1776917730-ppo-v4-hybrid-ollama-groq)
| Object key | Size |
s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/adversarial_history.jsonl | 792 bytes |
s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/cluster_snapshot.json | 1,375 bytes |
s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/critic_shaping.jsonl | 17,808 bytes |
s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/curriculum_snapshot.json | 2,677 bytes |
s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/metrics.jsonl | 5,524 bytes |
s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/reward_breakdown_history.jsonl | 17,526 bytes |
s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/summary.json | 2,977 bytes |
CloudWatch custom metrics published
| Metric | N | Mean | Min | Max |
| MeanReward | 12 | 1.777 | 1.401 | 2.074 |
| MitigationRate | 12 | 44.444 | 0.000 | 100.000 |
| RewardStd | 12 | 0.471 | 0.110 | 0.695 |
| RootCauseRate | 12 | 36.111 | 0.000 | 66.667 |
DynamoDB curriculum writes
| Task | Mastery | Tier |
| DynamoDB table not configured (set DYNAMODB_CURRICULUM_TABLE). |
Live AWS service reachability (this process)
Each integration in environment/aws_integrations.py
is gated on env vars + boto3 credentials. The HF Space typically has no creds,
so the trainer-side evidence above is the source of truth.
| Service | State | Detail |
| CloudWatch Logs | not configured here | — |
| S3 | not configured here | prefix: grpo |
| DynamoDB | not configured here | — |
| CloudWatch Metrics | not configured here | namespace: IncidentCommander |
| SNS | not configured here | — |
| Secrets Manager | not configured here | — |
| SQS | not configured here | — |
| Lambda | not configured here | — |
| EventBridge | not configured here | bus: default · source: incident-commander.agent |
| X-Ray | not configured here | note: set XRAY_ENABLED=1 to probe |
| SSM | not configured here | note: GetParameter access |
| KMS | not configured here | — |
| EKS | not configured here | — |
| ECR | not configured here | — |
| Bedrock | not configured here | note: IAM policy grants bedrock:InvokeModel |
Architecture (provisioned by infra/aws/main.tf)
| Service | Role in IncidentCommander |
| VPC + subnets + NAT | Dedicated 10.20.0.0/16 network across 2 AZs. |
| EKS | Managed Kubernetes where AcmeCorp microservices + Chaos Mesh run. |
| ECR | Private registry for the env server + trainer images. |
| S3 | Versioned bucket for training snapshots, metrics & critic shaping logs. |
| DynamoDB | PITR-enabled table holding per-run / per-task mastery. |
| CloudWatch Logs | Two log groups: /aws/eks/.../application + /aws/incident-commander/agent. |
| CloudWatch Metrics | Custom IncidentCommander namespace + ErrorRate > 5% alarm. |
| SNS | Alert topic the agent publishes to when a mitigation is applied. |
| Secrets Manager | Stores OPENAI_API_KEY so no secrets are baked into the image. |
| IAM (IRSA) | Least-privilege pod role — S3/Dynamo/Logs/CW/SNS/Secrets/Bedrock. |
| Lambda | Subscribed to the SNS topic; POSTs /reset to trigger new episodes. |
| Bedrock | Optional second critic backend (Claude / Titan via InvokeModel). |
Recorder errors (last run)
| Operation | Error |
| No errors recorded. |
How the bucket gets data
python -m training.train_hybrid --critic groq --updates 12 \
--rollouts-per-update 3 --episode-limit 8 --all-tasks \
--out-dir checkpoints/ppo-v4-hybrid-ollama-groq \
--env-file ../.env.aws.local
# Trainer uploads the run dir to s3://$S3_CHECKPOINT_BUCKET/runs/<ts>-<name>/
# Publishes per-update CloudWatch metrics under namespace IncidentCommander
# Writes per-task mastery to DynamoDB ($DYNAMODB_CURRICULUM_TABLE if set)
# Saves aws_evidence.json so this dashboard can render real ARNs/keys.