Here's the operational posture of a typical legacy application:
- CI/CD pipeline: none
- Automated tests: somewhere between "a few" and "aspirational"
- Infrastructure as Code: none
- Monitoring: the server admin checks if the process is running
- Alerting: someone calls when the website is down
- Runbooks: tribal knowledge
- SLOs: undefined
Sound familiar? You're not alone.
This is normal. These applications were built when "operations" meant "restart the service when it crashes." The observability architect agent designs everything from scratch — not just monitoring, but the entire operational posture for a cloud-native system.
SLOs that evolve with the migration
The agent defines two sets of SLOs: migration phase and post-stabilization. You can't hold a system in active migration to the same availability targets as a stable production system.
During migration, targets are relaxed — 99.5% availability for primary services, 99.0% for admin interfaces. Post-stabilization, they tighten — 99.9% for critical paths, 99.5% for supporting services.
The error budget policy is concrete: at 50% burn, alert and review. At 80% burn, freeze non-critical deploys. At 100% burn, incident review required before the next release.
These aren't aspirational numbers. They're calibrated to the architecture — multi-AZ ECS Fargate with Aurora Multi-AZ can realistically deliver 99.9%. A single-AZ deployment can't. The SLOs match the infrastructure investment.
Per-service monitoring
Every service in the target architecture gets a monitoring specification with specific metrics and alarm thresholds:
- Container services — CPU, memory, running task count, healthy host count, response time
- Serverless functions — invocations, errors, duration, throttles, concurrent executions
- Databases — CPU, freeable memory, connection count, read/write latency, replication lag
- Caches — engine CPU, cache hit rate, evictions, replication lag. The hit rate alarm matters most — if it drops below your target (typically 85-95% depending on workload), the cache isn't absorbing enough reads and the database is getting hammered
- API Gateway — 4xx/5xx error rates, latency, integration latency
Notification routing follows severity: P1-Critical → PagerDuty, P2-High → Slack #ops-alerts, P3-Medium → Slack #ops-info, P4-Low → email digest.
Distributed tracing and structured logging
The agent designs X-Ray tracing across the full request path — from CloudFront through API Gateway, into containers or Lambda, through database calls, and back. Sampling rates vary by environment: 5-10% head-based sampling in production (cost management), high in staging, 100% in development. Consider tail-based sampling to ensure error and high-latency traces are always captured regardless of the head-based rate.
For legacy runtimes in containers (ColdFusion on Lucee, Java on Tomcat), X-Ray daemon runs as a sidecar. For Lambda, the X-Ray SDK layer. The key is correlating traces across the boundary between the legacy container and new serverless services — this is what makes the Strangler Fig transition debuggable.
All services emit JSON-structured logs with consistent fields: timestamp, level, service name, trace ID, correlation ID, request ID, user ID, action, duration, environment. These align with OpenTelemetry semantic conventions where applicable. This consistency is what makes CloudWatch Logs Insights queries work across services.
Log retention is tiered: hot (CloudWatch, 30 days), warm (S3 Standard-IA, 90 days), cold (S3 Glacier, 1 year). For regulated industries, audit log retention requirements may extend to 7 years or more — the agent flags this as a compliance consideration and adjusts the Glacier lifecycle policy accordingly.
Observability cost considerations
Observability isn't free, and it can cause bill shock if not managed. CloudWatch Logs ingestion runs ~$0.50/GB, and X-Ray traces cost ~$0.50 per million segments recorded. For a high-traffic application producing gigabytes of logs daily, these costs add up.
The agent designs cost-aware observability: production sampling rates for X-Ray (5% rather than 100%), log level filtering (INFO and above in production, DEBUG in development), metric filters that extract signals from logs without storing every line, and CloudWatch Logs Insights for ad-hoc queries rather than always-on log analytics. Container Insights provides simplified ECS monitoring with pre-built dashboards — useful for teams new to CloudWatch.
CI/CD from scratch
Most legacy codebases have inadequate or entirely manual deployment processes. The agent designs a modern pipeline:
Build: Lint infrastructure templates (cfn-lint), run security rules (cdk-nag), execute unit tests, build container images, package Lambda functions, run contract tests between services (Pact).
Quality gates: Container images scanned by Amazon Inspector — Critical or High CVE fails the build. Infrastructure validated against security best practices. Contract tests verify inter-service API compatibility.
Deployment: ECS services get Blue/Green via CodeDeploy — traffic shifts to the new task set, and if the 5xx error rate on canary traffic exceeds a threshold (typically 0.5-1% for critical services, up to 2% for lower-priority services) during a bake period, CodeDeploy automatically rolls back. Lambda functions get canary deployments: 10% of traffic routes to the new version for 5 minutes, and if the error rate stays below 1%, traffic shifts to 100%. If errors spike during the canary window, the alias reverts to the previous version automatically. Database migrations are forward-only and backward-compatible — new columns are nullable, old columns retained, no destructive changes until the next release validates the migration.
Post-deploy: CloudWatch Synthetics canaries run critical user journeys continuously. If a canary fails shortly after deployment, the deployment rolls back automatically.
Chaos engineering
The agent designs AWS Fault Injection Service experiments:
- AZ failure — terminate all compute in one AZ, verify traffic shifts to survivors
- Database failover — force primary failover to replica, verify application reconnects
- Service throttling — inject throttling on extracted services, verify circuit breaker falls back to legacy
- Cache failure — force cache failover, verify application tolerates cold cache
- Network disruption — inject packet loss between compute and database tiers
- Dependency unavailability — simulate SQS/EventBridge/external API outages, verify graceful degradation
- Latency injection — add artificial delay to downstream calls, verify timeouts and circuit breakers behave correctly
These validate that the resilience designed into the architecture actually works. In other words: break it on purpose before your customers break it by accident. Run them in staging regularly, in production occasionally with an incident commander on-call.
Disaster recovery
The agent designs DR tiered by service criticality:
| Tier | RTO | RPO | Approach |
|---|---|---|---|
| Critical (primary app, API, database) | Minutes | ~1 min | Multi-AZ with auto-failover, cross-region read replica |
| Important (admin, auth) | Under an hour | Minutes | Multi-AZ, managed service HA |
| Standard (batch jobs, email, workflows) | Hours | ~1 hour | Idempotent handlers, SQS message retention, replay from DLQ |
Regional DR uses a pilot light strategy: database cross-region read replica, container image replication, S3 cross-region replication, infrastructure pre-synthesized as CDK for rapid deployment.
The testing pyramid
The agent designs a complete testing strategy because modernization without testing is just a rewrite with extra steps:
| Layer | Purpose | Coverage Target | Rationale |
|---|---|---|---|
| Unit | Pure function / method testing | 80% for new code, 40% for migrated legacy | New code has no excuse for low coverage; legacy code gets pragmatic coverage focused on the riskiest paths (data access, auth, business rules) |
| Contract | Consumer-driven API contracts (Pact) | 100% of inter-service API surfaces | Every service boundary is a potential breaking point — contract tests are the safety net |
| Component | Single service in isolation | 70% per service including DB interactions | Validates service behavior with real database queries, not just mocks |
| Integration | Multi-service with real AWS resources | All CRUD operations succeed, data consistency verified across services | Catches issues that unit and component tests miss — network, serialization, auth |
| Performance | Load / stress / soak testing | P99 < target latency, sustain peak load | Establishes baseline and validates the architecture handles expected traffic |
| Chaos | Fault injection | Graceful degradation on AZ failure, DB failover | Validates resilience claims from the architecture |
| Synthetic | Production canary monitors | 99.9% success on critical user journeys | Continuous production validation — catches regressions between deployments |
The performance baseline step is critical: capture current metrics before cutover. If you don't know what "normal" looks like today, you can't tell if the cloud deployment is better or worse.
What we've learned
The observability gap is the most consistent finding across legacy codebases. In most applications we've analyzed, monitoring is minimal or nonexistent, test coverage is low, and deployment is manual. The observability architect's job is typically to design from scratch.
The good news: designing from scratch means you get to do it right. No legacy monitoring debt. No flaky test suites. No Jenkins server held together with duct tape. You start with the modern stack and build forward.
Cost signal: The observability stack isn't free — CloudWatch Logs ingestion, X-Ray traces, Container Insights, and tiered log retention (especially S3 Glacier for compliance) all carry costs that the cost analyzer accounts for in Part 6.
Next: Part 6 — The Money Talk — where the cost analyzer pulls real AWS pricing data and builds the TCO model that gets the project funded. The observability and security costs flagged in Parts 4-5 feed directly into the model.