AWS wiring & training evidence

Account 969831127386 · Region us-east-1 · Live services reachable here: 0/15 · Last training run: 2026-04-23 04:18 UTC

S3 objects uploaded

bucket: ic-checkpoints-969831127386-us-east-1

CloudWatch datapoints

namespace: IncidentCommander

CloudWatch log events

DynamoDB writes

SNS messages

Errors

S3 uploads (last run — prefix `runs/1776917730-ppo-v4-hybrid-ollama-groq`)

Object key	Size
`s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/adversarial_history.jsonl`	792 bytes
`s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/cluster_snapshot.json`	1,375 bytes
`s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/critic_shaping.jsonl`	17,808 bytes
`s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/curriculum_snapshot.json`	2,677 bytes
`s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/metrics.jsonl`	5,524 bytes
`s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/reward_breakdown_history.jsonl`	17,526 bytes
`s3://ic-checkpoints-969831127386-us-east-1/runs/1776917730-ppo-v4-hybrid-ollama-groq/summary.json`	2,977 bytes

CloudWatch custom metrics published

Metric	N	Mean	Min	Max
MeanReward	12	1.777	1.401	2.074
MitigationRate	12	44.444	0.000	100.000
RewardStd	12	0.471	0.110	0.695
RootCauseRate	12	36.111	0.000	66.667

DynamoDB curriculum writes

Task	Mastery	Tier
DynamoDB table not configured (set DYNAMODB_CURRICULUM_TABLE).

Live AWS service reachability (this process)

Each integration in environment/aws_integrations.py is gated on env vars + boto3 credentials. The HF Space typically has no creds, so the trainer-side evidence above is the source of truth.

Service	State	Detail
CloudWatch Logs	not configured here	—
S3	not configured here	`prefix`: grpo
DynamoDB	not configured here	—
CloudWatch Metrics	not configured here	`namespace`: IncidentCommander
SNS	not configured here	—
Secrets Manager	not configured here	—
SQS	not configured here	—
Lambda	not configured here	—
EventBridge	not configured here	`bus`: default · `source`: incident-commander.agent
X-Ray	not configured here	`note`: set XRAY_ENABLED=1 to probe
SSM	not configured here	`note`: GetParameter access
KMS	not configured here	—
EKS	not configured here	—
ECR	not configured here	—
Bedrock	not configured here	`note`: IAM policy grants bedrock:InvokeModel

Architecture (provisioned by `infra/aws/main.tf`)

Service	Role in IncidentCommander
VPC + subnets + NAT	Dedicated 10.20.0.0/16 network across 2 AZs.
EKS	Managed Kubernetes where AcmeCorp microservices + Chaos Mesh run.
ECR	Private registry for the env server + trainer images.
S3	Versioned bucket for training snapshots, metrics & critic shaping logs.
DynamoDB	PITR-enabled table holding per-run / per-task mastery.
CloudWatch Logs	Two log groups: /aws/eks/.../application + /aws/incident-commander/agent.
CloudWatch Metrics	Custom IncidentCommander namespace + ErrorRate > 5% alarm.
SNS	Alert topic the agent publishes to when a mitigation is applied.
Secrets Manager	Stores OPENAI_API_KEY so no secrets are baked into the image.
IAM (IRSA)	Least-privilege pod role — S3/Dynamo/Logs/CW/SNS/Secrets/Bedrock.
Lambda	Subscribed to the SNS topic; POSTs /reset to trigger new episodes.
Bedrock	Optional second critic backend (Claude / Titan via InvokeModel).

Recorder errors (last run)

Operation	Error
No errors recorded.

How the bucket gets data

python -m training.train_hybrid --critic groq --updates 12 \
    --rollouts-per-update 3 --episode-limit 8 --all-tasks \
    --out-dir checkpoints/ppo-v4-hybrid-ollama-groq \
    --env-file ../.env.aws.local

# Trainer uploads the run dir to s3://$S3_CHECKPOINT_BUCKET/runs/<ts>-<name>/
# Publishes per-update CloudWatch metrics under namespace IncidentCommander
# Writes per-task mastery to DynamoDB ($DYNAMODB_CURRICULUM_TABLE if set)
# Saves aws_evidence.json so this dashboard can render real ARNs/keys.