Part 5: Eyes on the System — Agentic Modernization

Here's the operational posture of a typical legacy application:

CI/CD pipeline: none
Automated tests: somewhere between "a few" and "aspirational"
Infrastructure as Code: none
Monitoring: the server admin checks if the process is running
Alerting: someone calls when the website is down
Runbooks: tribal knowledge
SLOs: undefined

Sound familiar? You're not alone.

This is normal. These applications were built when "operations" meant "restart the service when it crashes." The observability architect agent designs everything from scratch — not just monitoring, but the entire operational posture for a cloud-native system.

SLOs that evolve with the migration

The agent defines two sets of SLOs: migration phase and post-stabilization. You can't hold a system in active migration to the same availability targets as a stable production system.

During migration, targets are relaxed — 99.5% availability for primary services, 99.0% for admin interfaces. Post-stabilization, they tighten — 99.9% for critical paths, 99.5% for supporting services.

The error budget policy is concrete: at 50% burn, alert and review. At 80% burn, freeze non-critical deploys. At 100% burn, incident review required before the next release.

These aren't aspirational numbers. They're calibrated to the architecture — multi-AZ ECS Fargate with Aurora Multi-AZ can realistically deliver 99.9%. A single-AZ deployment can't. The SLOs match the infrastructure investment.

!Observability Pipeline — sources emit metrics, traces, and logs flowing to dashboards, alarms, and SLO tracking

Per-service monitoring

Every service in the target architecture gets a monitoring specification with specific metrics and alarm thresholds:

Container services — CPU, memory, running task count, healthy host count, response time
Serverless functions — invocations, errors, duration, throttles, concurrent executions
Databases — CPU, freeable memory, connection count, read/write latency, replication lag
Caches — engine CPU, cache hit rate, evictions, replication lag. The hit rate alarm matters most — if it drops below your target (typically 85-95% depending on workload), the cache isn't absorbing enough reads and the database is getting hammered
API Gateway — 4xx/5xx error rates, latency, integration latency

Notification routing follows severity: P1-Critical → PagerDuty, P2-High → Slack #ops-alerts, P3-Medium → Slack #ops-info, P4-Low → email digest.

Distributed tracing and structured logging

The agent designs X-Ray tracing across the full request path — from CloudFront through API Gateway, into containers or Lambda, through database calls, and back. Sampling rates vary by environment: 5-10% head-based sampling in production (cost management), high in staging, 100% in development. Consider tail-based sampling to ensure error and high-latency traces are always captured regardless of the head-based rate.

For legacy runtimes in containers (ColdFusion on Lucee, Java on Tomcat), X-Ray daemon runs as a sidecar. For Lambda, the X-Ray SDK layer. The key is correlating traces across the boundary between the legacy container and new serverless services — this is what makes the Strangler Fig transition debuggable.

All services emit JSON-structured logs with consistent fields: timestamp, level, service name, trace ID, correlation ID, request ID, user ID, action, duration, environment. These align with OpenTelemetry semantic conventions where applicable. This consistency is what makes CloudWatch Logs Insights queries work across services.

Log retention is tiered: hot (CloudWatch, 30 days), warm (S3 Standard-IA, 90 days), cold (S3 Glacier, 1 year). For regulated industries, audit log retention requirements may extend to 7 years or more — the agent flags this as a compliance consideration and adjusts the Glacier lifecycle policy accordingly.

Observability cost considerations

Observability isn't free, and it can cause bill shock if not managed. CloudWatch Logs ingestion runs ~$0.50/GB, and X-Ray traces cost ~$0.50 per million segments recorded. For a high-traffic application producing gigabytes of logs daily, these costs add up.

The agent designs cost-aware observability: production sampling rates for X-Ray (5% rather than 100%), log level filtering (INFO and above in production, DEBUG in development), metric filters that extract signals from logs without storing every line, and CloudWatch Logs Insights for ad-hoc queries rather than always-on log analytics. Container Insights provides simplified ECS monitoring with pre-built dashboards — useful for teams new to CloudWatch.

CI/CD from scratch

Most legacy codebases have inadequate or entirely manual deployment processes. The agent designs a modern pipeline:

Build: Lint infrastructure templates (cfn-lint), run security rules (cdk-nag), execute unit tests, build container images, package Lambda functions, run contract tests between services (Pact).

Quality gates: Container images scanned by Amazon Inspector — Critical or High CVE fails the build. Infrastructure validated against security best practices. Contract tests verify inter-service API compatibility.

Deployment: ECS services get Blue/Green via CodeDeploy — traffic shifts to the new task set, and if the 5xx error rate on canary traffic exceeds a threshold (typically 0.5-1% for critical services, up to 2% for lower-priority services) during a bake period, CodeDeploy automatically rolls back. Lambda functions get canary deployments: 10% of traffic routes to the new version for 5 minutes, and if the error rate stays below 1%, traffic shifts to 100%. If errors spike during the canary window, the alias reverts to the previous version automatically. Database migrations are forward-only and backward-compatible — new columns are nullable, old columns retained, no destructive changes until the next release validates the migration.

Post-deploy: CloudWatch Synthetics canaries run critical user journeys continuously. If a canary fails shortly after deployment, the deployment rolls back automatically.

Chaos engineering

The agent designs AWS Fault Injection Service experiments:

AZ failure — terminate all compute in one AZ, verify traffic shifts to survivors
Database failover — force primary failover to replica, verify application reconnects
Service throttling — inject throttling on extracted services, verify circuit breaker falls back to legacy
Cache failure — force cache failover, verify application tolerates cold cache
Network disruption — inject packet loss between compute and database tiers
Dependency unavailability — simulate SQS/EventBridge/external API outages, verify graceful degradation
Latency injection — add artificial delay to downstream calls, verify timeouts and circuit breakers behave correctly

These validate that the resilience designed into the architecture actually works. In other words: break it on purpose before your customers break it by accident. Run them in staging regularly, in production occasionally with an incident commander on-call.

Disaster recovery

The agent designs DR tiered by service criticality:

Tier	RTO	RPO	Approach
Critical (primary app, API, database)	Minutes	~1 min	Multi-AZ with auto-failover, cross-region read replica
Important (admin, auth)	Under an hour	Minutes	Multi-AZ, managed service HA
Standard (batch jobs, email, workflows)	Hours	~1 hour	Idempotent handlers, SQS message retention, replay from DLQ

Regional DR uses a pilot light strategy: database cross-region read replica, container image replication, S3 cross-region replication, infrastructure pre-synthesized as CDK for rapid deployment.

The testing pyramid

The agent designs a complete testing strategy because modernization without testing is just a rewrite with extra steps:

Layer	Purpose	Coverage Target	Rationale
Unit	Pure function / method testing	80% for new code, 40% for migrated legacy	New code has no excuse for low coverage; legacy code gets pragmatic coverage focused on the riskiest paths (data access, auth, business rules)
Contract	Consumer-driven API contracts (Pact)	100% of inter-service API surfaces	Every service boundary is a potential breaking point — contract tests are the safety net
Component	Single service in isolation	70% per service including DB interactions	Validates service behavior with real database queries, not just mocks
Integration	Multi-service with real AWS resources	All CRUD operations succeed, data consistency verified across services	Catches issues that unit and component tests miss — network, serialization, auth
Performance	Load / stress / soak testing	P99 < target latency, sustain peak load	Establishes baseline and validates the architecture handles expected traffic
Chaos	Fault injection	Graceful degradation on AZ failure, DB failover	Validates resilience claims from the architecture
Synthetic	Production canary monitors	99.9% success on critical user journeys	Continuous production validation — catches regressions between deployments

The performance baseline step is critical: capture current metrics before cutover. If you don't know what "normal" looks like today, you can't tell if the cloud deployment is better or worse.

What we've learned

The observability gap is the most consistent finding across legacy codebases. In most applications we've analyzed, monitoring is minimal or nonexistent, test coverage is low, and deployment is manual. The observability architect's job is typically to design from scratch.

The good news: designing from scratch means you get to do it right. No legacy monitoring debt. No flaky test suites. No Jenkins server held together with duct tape. You start with the modern stack and build forward.

Cost signal: The observability stack isn't free — CloudWatch Logs ingestion, X-Ray traces, Container Insights, and tiered log retention (especially S3 Glacier for compliance) all carry costs that the cost analyzer accounts for in Part 6.

Next: Part 6 — The Money Talk — where the cost analyzer pulls real AWS pricing data and builds the TCO model that gets the project funded. The observability and security costs flagged in Parts 4-5 feed directly into the model.

Security Is Not a Phase

The Money Talk

Build

Consult

Educate

Eyes on the System