LEGACY DATASET
These charts come from the kube-sre-gym-style heuristic + early notebook runs — the 11 hand-curated tasks in rl-agent/scenarios/{easy,medium,hard}/*.json, recorded into rl-agent/checkpoints/<run>/metrics.jsonl and colab/logs/reward_breakdown_history.jsonl. They do not include the 381-task PPO Kaggle run.

AWS wiring & training evidence

Account 969831127386 · Region us-east-1 · Live services reachable here: 0/15 · Last training run: 2026-04-23 04:18 UTC
S3 objects uploaded
7
bucket: ic-checkpoints-969831127386-us-east-1
CloudWatch datapoints
48
namespace: IncidentCommander
CloudWatch log events
12
DynamoDB writes
0
SNS messages
0
Errors
0

S3 uploads (last run — prefix runs/1776917730-ppo-v4-hybrid-ollama-groq)

Object keySize
s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/adversarial_history.jsonl792 bytes
s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/cluster_snapshot.json1,375 bytes
s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/critic_shaping.jsonl17,808 bytes
s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/curriculum_snapshot.json2,677 bytes
s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/metrics.jsonl5,524 bytes
s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/reward_breakdown_history.jsonl17,526 bytes
s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/summary.json2,977 bytes

CloudWatch custom metrics published

MetricNMeanMinMax
MeanReward121.7771.4012.074
MitigationRate1244.4440.000100.000
RewardStd120.4710.1100.695
RootCauseRate1236.1110.00066.667

DynamoDB curriculum writes

TaskMasteryTier
DynamoDB table not configured (set DYNAMODB_CURRICULUM_TABLE).

Live AWS service reachability (this process)

Each integration in environment/aws_integrations.py is gated on env vars + boto3 credentials. The HF Space typically has no creds, so the trainer-side evidence above is the source of truth.
ServiceStateDetail
CloudWatch Logsnot configured here
S3not configured hereprefix: grpo
DynamoDBnot configured here
CloudWatch Metricsnot configured herenamespace: IncidentCommander
SNSnot configured here
Secrets Managernot configured here
SQSnot configured here
Lambdanot configured here
EventBridgenot configured herebus: default · source: incident-commander.agent
X-Raynot configured herenote: set XRAY_ENABLED=1 to probe
SSMnot configured herenote: GetParameter access
KMSnot configured here
EKSnot configured here
ECRnot configured here
Bedrocknot configured herenote: IAM policy grants bedrock:InvokeModel

Architecture (provisioned by infra/aws/main.tf)

ServiceRole in IncidentCommander
VPC + subnets + NATDedicated 10.20.0.0/16 network across 2 AZs.
EKSManaged Kubernetes where AcmeCorp microservices + Chaos Mesh run.
ECRPrivate registry for the env server + trainer images.
S3Versioned bucket for training snapshots, metrics & critic shaping logs.
DynamoDBPITR-enabled table holding per-run / per-task mastery.
CloudWatch LogsTwo log groups: /aws/eks/.../application + /aws/incident-commander/agent.
CloudWatch MetricsCustom IncidentCommander namespace + ErrorRate > 5% alarm.
SNSAlert topic the agent publishes to when a mitigation is applied.
Secrets ManagerStores OPENAI_API_KEY so no secrets are baked into the image.
IAM (IRSA)Least-privilege pod role — S3/Dynamo/Logs/CW/SNS/Secrets/Bedrock.
LambdaSubscribed to the SNS topic; POSTs /reset to trigger new episodes.
BedrockOptional second critic backend (Claude / Titan via InvokeModel).

Recorder errors (last run)

OperationError
No errors recorded.

How the bucket gets data

python -m training.train_hybrid --critic groq --updates 12 \
    --rollouts-per-update 3 --episode-limit 8 --all-tasks \
    --out-dir checkpoints/ppo-v4-hybrid-ollama-groq \
    --env-file ../.env.aws.local

# Trainer uploads the run dir to s3://$S3_CHECKPOINT_BUCKET/runs/<ts>-<name>/
# Publishes per-update CloudWatch metrics under namespace IncidentCommander
# Writes per-task mastery to DynamoDB ($DYNAMODB_CURRICULUM_TABLE if set)
# Saves aws_evidence.json so this dashboard can render real ARNs/keys.