ATLAS: The 5 Phases Every Architect Must Know

The Problem

In Article 1, we introduced ATLAS and GOTCHA. You saw the before and after. Vague prompt vs structured prompt. Chaos vs clarity.

Maybe you tried the challenge at the end. You took a prompt and rewrote it with the GOTCHA structure. And maybe it worked better. But maybe it still didn’t feel right. The AI gave you something closer, but not quite what you needed.

Here’s why: GOTCHA is only as good as the thinking you put into it. And the thinking is where ATLAS lives.

Most developers skip ATLAS. They read the framework, nod, and jump straight to writing a GOTCHA prompt. They treat ATLAS like a formality. “Yeah, I know what I want to build. Let me just tell the AI.” But they don’t really know. They have a vague idea, not a plan.

I’ve seen this in enterprise projects. Not with AI — with regular development. A team starts building before they map data flows. They skip the part where you figure out how services talk to each other. And three weeks later, the Kubernetes cluster is restarting pods in a loop because Service A is calling Service B, which calls Service C, which calls Service A again. A circular dependency that nobody saw because nobody traced it.

ATLAS exists to prevent that. It’s five phases of thinking that produce five concrete deliverables. Skip a phase, and you’ll pay for it later.

The Solution

ATLAS is not a methodology. It’s a checklist. Five questions you answer before you touch any tool — AI or not.

A — Architect

Question: What are we building, and what are the boundaries?

This is the foundation. Before you think about code, frameworks, or AI prompts, you define the problem space. What does this system do? What does it NOT do? What are the constraints?

Architect is about decisions, not details. You’re not writing code here. You’re drawing the box around your system and deciding what lives inside it.

A good Architect phase answers:

What is the core purpose of this system?
Who uses it? (Users, other services, both?)
What are the hard constraints? (Performance targets, compliance rules, budget limits)
What technology decisions are already made? (Database choice, cloud provider, language)
What is explicitly out of scope?

The last one is critical. Defining what you’re NOT building is as important as defining what you are. Without it, scope creeps in from every direction — and your AI prompts will reflect that confusion.

T — Trace

Question: How does data flow through the system?

Trace is where you follow a request from start to finish. A user clicks a button. What happens? Where does the request go? What services process it? What data gets read, transformed, and written?

This is the phase most people skip. And it’s the phase that causes the most problems. If you don’t trace your data flows, you’ll end up with:

Services that don’t know how to talk to each other
Missing transformations between layers
Circular dependencies
Race conditions in async workflows

A good Trace phase produces a flow diagram. It doesn’t need to be formal UML. A simple list works:

1. Client sends POST /orders with cart items
2. API Gateway validates JWT, routes to Order Service
3. Order Service validates items against Inventory Service (sync call)
4. If items available → create order in PostgreSQL (status: pending)
5. Publish OrderCreated event to message queue
6. Payment Service picks up event, processes payment
7. Payment Service publishes PaymentCompleted or PaymentFailed
8. Order Service updates status to confirmed or cancelled
9. Notification Service sends email/SMS to user

Nine steps. Each one is a decision point where things can go wrong. Without this trace, you’d write a prompt like “create an order service” and the AI would guess most of these steps. It might get some right. It will definitely miss the edge cases.

L — Link

Question: How do components connect to each other?

Link is about integrations. Not what each service does internally, but how they communicate. What protocols, what contracts, what dependencies.

This is where architecture becomes visible. You decide:

Sync vs async communication (REST calls vs message queues)
Shared databases vs isolated data stores
API contracts and versioning
Authentication between services
Error propagation (what happens when a downstream service fails?)

Link is where you discover problems early. If Service A needs data from Service B, but Service B has no endpoint for that data — you find out now, not during implementation.

A good Link phase produces an integration map:

From	To	Method	Contract	Failure mode
API Gateway	Order Service	REST/HTTPS	OpenAPI 3.0	502 → retry 3x
Order Service	Inventory Service	gRPC	Proto v2	Circuit breaker, 5s timeout
Order Service	Message Queue	AMQP	OrderCreated event schema	Dead letter queue
Payment Service	Message Queue	AMQP	PaymentCompleted schema	Retry with backoff
Notification Service	Email provider	HTTPS	Provider SDK	Log and skip

Five integrations. Five potential failure points. Each one with a clear plan for when things go wrong. This table is gold for your AI prompts later — you’ll feed it directly into the GOTCHA Context layer.

A — Assemble

Question: In what order do we build this?

Assemble is the build plan. You’ve defined the system (Architect), mapped the data flows (Trace), and documented the integrations (Link). Now you decide what gets built first, second, third.

This sounds obvious, but the order matters more than you think. Build the database schema before the repository layer. Build the repository before the service layer. Build the service before the controller. Each layer depends on the one below it.

In a microservices project, Assemble also means deciding which service gets built first. Usually, it’s the one with the fewest external dependencies — because you can test it in isolation.

A good Assemble phase looks like:

Phase 1: Foundation
  - PostgreSQL schema (orders, order_items tables)
  - Message queue setup (topics and subscriptions)
  - Shared contract libraries (event schemas, error types)

Phase 2: Core Service
  - Order Service: repository → service → controller → middleware
  - Unit tests for each layer
  - Integration test with PostgreSQL (testcontainers)

Phase 3: Supporting Services
  - Inventory Service (can mock at first)
  - Payment Service (can mock at first)
  - Notification Service

Phase 4: Integration
  - Wire services together
  - End-to-end test: place order → payment → notification
  - Load test with target concurrency

Phase 5: Deployment
  - Kubernetes manifests (Deployment, Service, Ingress)
  - CI/CD pipeline
  - Monitoring and alerting

This is your roadmap. And when you ask the AI to help with each phase, you’ll give it exactly the right scope. Not “build an order system” but “build the repository layer for the orders table, following the repository pattern, with these specific methods.”

S — Stress-test

Question: How do we validate this under real conditions?

Stress-test is the last phase. You define how you’ll know the system actually works. Not “it compiles” or “the tests pass” — but it works under load, with real data, in an environment that looks like production.

This phase is about:

Performance targets (requests per second, latency percentiles)
Edge cases (what happens when the database is slow? when a service is down?)
Security validation (are JWT tokens validated correctly? are inputs sanitized?)
Data integrity (do transactions roll back properly? do events get lost?)

A good Stress-test phase defines concrete scenarios:

Scenario 1: Load
  - 500 concurrent users placing orders
  - P95 latency < 200ms for order creation
  - Zero data loss under load

Scenario 2: Failure
  - Kill Payment Service → orders stay pending, no errors to user
  - Database connection drops → circuit breaker activates within 3s
  - Invalid JWT → 401 response, no stack trace leaked

Scenario 3: Data integrity
  - Place 1000 orders → exactly 1000 rows in database
  - Simulate payment failure mid-transaction → order status = cancelled
  - Replay events → no duplicate orders (idempotency)

These scenarios become your acceptance criteria. And when you ask the AI to generate tests, you’ll give it these exact scenarios in the GOTCHA Args layer.

Execute

Let’s put it all together. Here’s ATLAS applied to a real project: a notification service for an e-commerce platform. This service listens for events (order confirmed, payment failed, shipment dispatched) and sends emails or SMS to users.

I’ll fill in each phase as if I were starting this project tomorrow.

Architect

PURPOSE: Notification service for e-commerce platform.
  Listens for business events and sends user notifications
  via email and SMS.

USERS: Internal services only (no direct user-facing API).
  Receives events from Order Service and Shipment Service.

CONSTRAINTS:
  - Must process events within 30 seconds of receipt
  - Must support at least 3 notification channels (email, SMS, push)
  - Must not lose notifications (at-least-once delivery)
  - Must respect user preferences (opt-out per channel)

TECH DECISIONS:
  - [.NET](https://dotnet.microsoft.com) 10 worker service (not a web API -- it's event-driven)
  - [PostgreSQL](https://www.postgresql.org/) for user preferences and notification log
  - [Azure Service Bus](https://learn.microsoft.com/en-us/azure/service-bus-messaging/) for event consumption
  - [Azure Communication Services](https://learn.microsoft.com/en-us/azure/communication-services/) for email and SMS
  - [Azure Notification Hubs](https://learn.microsoft.com/en-us/azure/notification-hubs/) for push notifications

OUT OF SCOPE:
  - Marketing emails (different system)
  - In-app notifications (frontend handles this)
  - Template management UI (templates are in code for now)

Trace

1. Order Service publishes OrderConfirmed event to Service Bus
2. Notification Service picks up event from subscription
3. Service queries PostgreSQL for user notification preferences
4. For each active channel (email, SMS, push):
   a. Load template for event type + channel
   b. Render template with event data
   c. Send via Azure Communication Services (email, SMS) or Notification Hubs (push)
   d. Log result in notification_log table
5. If send fails → retry 3 times with exponential backoff
6. If all retries fail → log as failed, publish NotificationFailed event
7. Acknowledge message from Service Bus only after processing

Link

From	To	Method	Contract	Failure mode
Service Bus	Notification Service	AMQP subscription	Event JSON schema	Dead letter after 10 attempts
Notification Service	PostgreSQL	TCP/EF Core	User preferences schema	Retry 3x, then fail event
Notification Service	ACS Email	HTTPS REST	ACS Email SDK	Retry 3x, log as failed
Notification Service	ACS SMS	HTTPS REST	ACS SMS SDK	Retry 3x, log as failed
Notification Service	Notification Hubs	HTTPS REST	NH REST API	Retry 3x, log as failed
Notification Service	Service Bus	AMQP publish	NotificationFailed schema	Log locally if bus is down

Assemble

Phase 1: Foundation
  - PostgreSQL schema: user_preferences, notification_log
  - Service Bus topic subscriptions
  - Event schema contracts (shared NuGet package)

Phase 2: Core
  - Event consumer (Service Bus listener)
  - Preference lookup (repository + service)
  - Template engine (simple string replacement for now)
  - Channel dispatcher (routes to correct provider)

Phase 3: Providers (all Azure)
  - Email provider (Azure Communication Services Email)
  - SMS provider (Azure Communication Services SMS)
  - Push provider (Azure Notification Hubs) -- can mock initially

Phase 4: Reliability
  - Retry logic with exponential backoff
  - Dead letter handling
  - Notification log persistence
  - Idempotency check (don't send same notification twice)

Phase 5: Deployment
  - Kubernetes Deployment (2 replicas minimum)
  - Health check endpoint
  - Prometheus metrics (events processed, notifications sent, failures)
  - Azure DevOps pipeline

Stress-test

Scenario 1: Throughput
  - Publish 1000 events in 60 seconds
  - All notifications processed within 30s of publish
  - No events lost

Scenario 2: Provider failure
  - ACS Email returns 500 → retry 3x → log as failed
  - ACS SMS timeout → circuit breaker after 5 failures
  - All providers down → events stay in Service Bus (not acknowledged)

Scenario 3: Idempotency
  - Deliver same event twice → only one notification sent
  - Check notification_log for duplicates

Scenario 4: User preferences
  - User opts out of email → no email sent, SMS still works
  - User has no preferences → use default (all channels)

That’s it. Five phases. Five deliverables. Each one feeds the next. And when you’re done, you have everything you need to write a precise GOTCHA prompt for the AI — or to start coding yourself.

The point is not that ATLAS is complicated. It’s that ATLAS forces you to make decisions before you start building. And those decisions are exactly what the AI needs to give you useful results.

Template

Here’s the ATLAS checklist you can copy and fill in for any project:

=== ATLAS CHECKLIST ===

[A] ARCHITECT
  Purpose:
  Users:
  Constraints:
  Tech decisions:
  Out of scope:

[T] TRACE
  1. (first step in the data flow)
  2.
  3.
  ...

[L] LINK
  | From | To | Method | Contract | Failure mode |
  |------|----|--------|----------|--------------|
  |      |    |        |          |              |

[A] ASSEMBLE
  Phase 1:
  Phase 2:
  Phase 3:
  ...

[S] STRESS-TEST
  Scenario 1:
  Scenario 2:
  Scenario 3:
  ...

Use it. Print it. Stick it next to your monitor. The five minutes you spend filling this in will save you hours of back-and-forth with the AI.

Challenge

Before Article 3, try this: pick a real project you’re working on — or one you want to build — and fill in the full ATLAS checklist. All five phases. Don’t skip Trace and Link. Those are the ones that matter most and the ones people always skip.

In Article 3, we’ll take the ATLAS checklist and show you exactly how each phase maps to a GOTCHA layer. You’ll see how your ATLAS thinking becomes the AI’s instructions — and why the mapping between human decisions and AI layers is the key to getting consistent results from a probabilistic system.

If this series helps you, consider buying me a coffee.