ATLAS + GOTCHA -- Part 6

Hands-On: A Production CI/CD Pipeline with Security Scanning

#ai #atlas #gotcha #azure-devops #cicd #kubernetes #security

The Problem

You built the API in Article 5. It works locally. Tests pass. Dockerfile builds.

Now you need to get it to production. And not just once — every time someone pushes a commit, the code should build, test, scan for vulnerabilities, build a container image, and deploy to AKS. Automatically. With gates that stop bad code from reaching production.

Most teams skip the security part. They build a pipeline that runs tests and deploys. That’s good, but it’s not enough. Container images contain vulnerable packages. Code has dependency vulnerabilities. JWT secrets end up hardcoded in commits. None of this shows up in unit tests.

I’ve worked on regulated infrastructure projects where a pipeline without security gates was not an option. GDPR, financial services, energy sector — all of them require automated security checks before production. Even if your project isn’t regulated, these checks catch real issues, every week.

The challenge: writing an Azure DevOps pipeline with security scanning is not trivial. The YAML gets long. The stages interact. The order matters. Secrets need to be handled carefully.

This is exactly the kind of complex, multi-step task where a structured approach pays off — and where AI can generate most of the YAML if you tell it exactly what you need.

The Solution

If you’ve been following this series, you know ATLAS and GOTCHA. If not, a quick recap: ATLAS is a 5-phase checklist (Architect, Trace, Link, Assemble, Stress-test) that forces you to think through a problem before you touch any tool. GOTCHA is the 6-layer prompt format (Goals, Orchestration, Tools, Context, Heuristics, Args) that translates that thinking into instructions an AI can process consistently. Together, they turn “write me a pipeline” into a precise specification.

We’ll design the pipeline with ATLAS first. Then translate to GOTCHA. Then look at the actual YAML.

The pipeline has five stages:

  1. Build — restore, compile, run unit tests
  2. SAST — static analysis (find vulnerable code patterns)
  3. Image — build Docker image, scan for CVEs, push to Azure Container Registry
  4. Deploy — rolling update to AKS
  5. Smoke test — verify the deployment is healthy

Here’s the ATLAS checklist for this pipeline — each letter maps to a phase of the design:

[A] ARCHITECT
  Purpose: Automated CI/CD pipeline for users-api.
    Build → test → security scan → deploy to AKS on every push to main.
    PRs run build + tests only (no deploy).
  Constraints:
    - Pipeline must fail if any security scan finds HIGH or CRITICAL issues
    - Container image must be signed before push to ACR
    - No secrets in pipeline YAML — all from Azure DevOps variable groups
    - Deploy must be zero-downtime (rolling update, readiness probe gated)
  Tech decisions: Azure DevOps, ACR, AKS, Trivy (container scan), dotnet-format
  Out of scope: blue/green deployments, performance testing, notification alerts

[T] TRACE
  On push to main:
    1. Agent checks out code
    2. dotnet restore → dotnet build → dotnet test (with coverage)
    3. Publish test results and coverage to pipeline
    4. dotnet format --verify-no-changes (fail on unformatted code)
    5. docker build → tag with build ID + git sha
    6. Trivy scan the image → fail if HIGH/CRITICAL CVE found
    7. docker push to ACR (only if scan passes)
    8. kubectl set image → wait for rollout to complete
    9. curl /healthz on the new pods → fail pipeline if not 200

  On PR (not main):
    Steps 1-3 only. No image build, no deploy.

[L] LINK
  | From           | To              | Auth               | Failure mode         |
  | -------------- | --------------- | ------------------ | -------------------- |
  | ADO pipeline   | ACR             | Service connection | Fail stage, alert    |
  | ADO pipeline   | AKS             | Service connection | Fail stage, alert    |
  | ADO agent      | Trivy           | Local binary       | Fail stage           |
  | ADO pipeline   | Variable group  | Azure Key Vault    | Pipeline blocked     |

[A] ASSEMBLE
  Phase 1: Pipeline skeleton (trigger, stages, agent pools)
  Phase 2: Build stage (restore, build, test, format check)
  Phase 3: Image stage (docker build, Trivy scan, ACR push)
  Phase 4: Deploy stage (kubectl rolling update, rollout watch)
  Phase 5: Smoke test stage (HTTP health check with retry)

[S] STRESS-TEST
  Scenario 1: Bad code
    - Push code with unit test failure → pipeline stops at Build stage
    - Push unformatted code → pipeline stops at format check
  Scenario 2: Vulnerable image
    - Base image with known HIGH CVE → pipeline stops at Image stage
    - Fix: update base image → pipeline proceeds
  Scenario 3: Failed deploy
    - New pod fails readiness probe → rollout stops, old pods stay up
    - Smoke test fails → pipeline marked failed, but service still running
  Scenario 4: Secret hygiene
    - No secrets in YAML, all resolved from variable group at runtime

Now we translate the ATLAS design into a GOTCHA prompt — the format the AI needs to generate consistent output. Each section maps to a GOTCHA layer:

=== GOALS (from Architect) ===
Generate an Azure DevOps YAML pipeline for a .NET 10 Web API that:
- Triggers on push to main (full pipeline) and on PRs (build + test only)
- Builds, tests, runs SAST (dotnet format), scans container with Trivy
- Fails immediately if any HIGH or CRITICAL CVE is found
- Builds and pushes Docker image to ACR (tagged: build ID + git sha)
- Deploys rolling update to AKS and waits for rollout to complete
- Runs HTTP smoke test (/healthz) after deploy

=== ORCHESTRATION (from Assemble) ===
Five stages in order, each depends on previous passing:
1. build: restore → build → test → format check → publish test results
2. scan: Trivy scan on local image (before push) → fail on HIGH/CRITICAL
3. image: docker push to ACR (only after scan passes)
4. deploy: kubectl set image + rollout status watch (5 min timeout)
5. smoketest: curl /healthz with 3 retries, 10s between retries

=== TOOLS (from Link) ===
- Azure DevOps YAML pipelines (multi-stage)
- dotnet 10 SDK (ubuntu-latest agent)
- Docker + Azure Container Registry
- Trivy (aquasec/trivy action or install script)
- kubectl + kubelogin (AKS deployment)
- Azure DevOps service connections (ACR + AKS)
- Azure Key Vault linked variable group for secrets

=== CONTEXT (from Trace + Link) ===
- Project: users-api (.NET 10 Web API)
- ACR name: myacr (variable: ACR_NAME)
- AKS cluster: my-cluster (variable: AKS_CLUSTER), namespace: users-api
- K8s deployment name: users-api
- Container name in deployment spec: users-api
- Secrets in variable group: DB_CONNECTION, JWT_SECRET
- Image tag pattern: $(Build.BuildId)-$(Build.SourceVersion | first 8 chars)

=== HEURISTICS ===
DO:
- Use dependsOn between stages to enforce order
- Use condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
  for deploy and smoketest stages
- Publish test results using PublishTestResults task (JUnit format)
- Use --exit-code 1 flag in Trivy to fail the pipeline on findings
- Set timeoutInMinutes on deploy stage to prevent hanging rollouts

DON'T:
- Don't hardcode secrets or connection strings in YAML
- Don't push the image before the Trivy scan passes
- Don't use latest tag — always use a specific, traceable tag
- Don't skip the rollout status check (kubectl rollout status blocks until ready)
- Don't ignore test coverage — publish it even if you don't gate on it yet

=== ARGS (from Stress-test) ===
Pool: ubuntu-latest
ACR_NAME: $(acrName) from variable group
AKS_CLUSTER: $(aksCluster) from variable group
AKS_RESOURCE_GROUP: $(aksResourceGroup) from variable group
Trivy severity: HIGH,CRITICAL
Rollout timeout: 5 minutes
Smoke test retries: 3, interval: 10 seconds
Test results format: JUnit

Execute

Here’s the pipeline YAML, generated from that prompt and lightly adjusted (the three things I mentioned in Article 5 happened here too — I’ll call them out as we go).

# azure-pipelines.yml
trigger:
  branches:
    include:
      - main
  paths:
    exclude:
      - "*.md"
      - "docs/**"

pr:
  branches:
    include:
      - main

variables:
  - group: users-api-secrets        # Azure Key Vault linked variable group
  - name: IMAGE_TAG
    value: "$(Build.BuildId)-$(Build.SourceVersion)"
  - name: IMAGE_FULL
    value: "$(acrName).azurecr.io/users-api:$(IMAGE_TAG)"

stages:
  # ─────────────────────────────────────────
  - stage: build
    displayName: Build & Test
    jobs:
      - job: build
        displayName: Build, test, format check
        pool:
          vmImage: ubuntu-latest
        steps:
          - task: UseDotNet@2
            inputs:
              version: "10.x"

          - script: dotnet restore
            displayName: Restore

          - script: dotnet build --no-restore --configuration Release
            displayName: Build

          - script: |
              dotnet test --no-build --configuration Release \
                --logger "junit;LogFilePath=$(Agent.TempDirectory)/test-results.xml" \
                --collect:"XPlat Code Coverage"
            displayName: Test

          - task: PublishTestResults@2
            condition: always()
            inputs:
              testResultsFormat: JUnit
              testResultsFiles: "$(Agent.TempDirectory)/test-results.xml"

          - script: dotnet format --verify-no-changes
            displayName: Format check

  # ─────────────────────────────────────────
  - stage: scan
    displayName: Security Scan
    dependsOn: build
    jobs:
      - job: trivy
        displayName: Trivy container scan
        pool:
          vmImage: ubuntu-latest
        steps:
          - task: Docker@2
            displayName: Build image (local, not pushed)
            inputs:
              command: build
              Dockerfile: Dockerfile
              tags: $(IMAGE_FULL)

          - script: |
              curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh \
                | sh -s -- -b /usr/local/bin v0.62.1
              trivy image \
                --exit-code 1 \
                --severity HIGH,CRITICAL \
                --no-progress \
                $(IMAGE_FULL)
            displayName: Scan with Trivy

  # ─────────────────────────────────────────
  - stage: image
    displayName: Push to ACR
    dependsOn: scan
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
    jobs:
      - job: push
        displayName: Push image
        pool:
          vmImage: ubuntu-latest
        steps:
          - task: Docker@2
            displayName: Build and push to ACR
            inputs:
              command: buildAndPush
              containerRegistry: acr-service-connection
              repository: users-api
              tags: $(IMAGE_TAG)

  # ─────────────────────────────────────────
  - stage: deploy
    displayName: Deploy to AKS
    dependsOn: image
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
    timeoutInMinutes: 10
    jobs:
      - deployment: deploy_aks
        displayName: Rolling deploy
        pool:
          vmImage: ubuntu-latest
        environment: production
        strategy:
          runOnce:
            deploy:
              steps:
                - task: KubernetesManifest@1
                  displayName: Set image
                  inputs:
                    action: patch
                    resourceToPatch: deployment/users-api
                    namespace: users-api
                    patch: |
                      spec:
                        template:
                          spec:
                            containers:
                            - name: users-api
                              image: $(acrName).azurecr.io/users-api:$(IMAGE_TAG)

                - task: Kubernetes@1
                  displayName: Wait for rollout
                  inputs:
                    command: rollout
                    arguments: "status deployment/users-api -n users-api --timeout=5m"

  # ─────────────────────────────────────────
  - stage: smoketest
    displayName: Smoke Test
    dependsOn: deploy
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
    jobs:
      - job: healthcheck
        displayName: Health check
        pool:
          vmImage: ubuntu-latest
        steps:
          - script: |
              for i in 1 2 3; do
                echo "Attempt $i..."
                STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
                  https://$(apiHostname)/healthz)
                if [ "$STATUS" = "200" ]; then
                  echo "Health check passed (HTTP $STATUS)"
                  exit 0
                fi
                echo "Got HTTP $STATUS, retrying in 10s..."
                sleep 10
              done
              echo "Health check failed after 3 attempts"
              exit 1
            displayName: Check /healthz (3 retries)

What We Adjusted

The AI generated a solid pipeline. Three adjustments:

1. The image tag. The AI used $(Build.BuildId) alone. We changed it to $(Build.BuildId)-$(Build.SourceVersion) (first 8 chars of git SHA). This way every image tag is traceable to the exact commit — useful when you need to know what’s running in production.

2. Trivy install script. The AI used a pinned GitHub Action for Trivy, which only works in GitHub Actions, not Azure DevOps. We switched to the install script that works on any Ubuntu agent.

3. The condition on deploy and smoketest. The AI used succeeded() only, which meant deploys would also run on PR pipelines if the branch happened to be named main on the PR side. We added the eq(variables['Build.SourceBranch'], 'refs/heads/main') check to be explicit.

This last adjustment is a good example of why the Heuristics layer in GOTCHA matters — it’s where you write explicit DO/DON’T rules for the AI. We had this in the prompt:

Use condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
for deploy and smoketest stages

And the AI still got it slightly wrong. Not because the heuristic was unclear — but because the AI applied it to the condition on the stage but also left a default succeeded() on the job level. We removed the job-level condition and kept only the stage-level one.

The lesson: review the generated code. A structured prompt doesn’t make AI infallible. It makes the output much closer to what you need, but you’re still the architect. The structured approach compresses the review from hours to minutes — but you still need to look.

Template

Here’s the ATLAS checklist for any CI/CD pipeline project:

=== CI/CD PIPELINE ATLAS CHECKLIST ===

[A] ARCHITECT
  Purpose: (what triggers it, what it produces)
  Trigger: (branches, events, PR vs merge)
  Constraints: (what must fail the pipeline, what must never happen)
  Secrets: (where they come from, never YAML)
  Out of scope: (what this pipeline doesn't do)

[T] TRACE
  On merge to main:
    1. (first step)
    2.
    ...
  On PR:
    (which steps only)

[L] LINK
  | From | To | Auth | Failure mode |

[A] ASSEMBLE
  Stage 1: (name + what it does)
  Stage 2:
  ...

[S] STRESS-TEST
  Scenario 1: What happens when tests fail?
  Scenario 2: What happens when a CVE is found?
  Scenario 3: What happens when the deploy fails?
  Scenario 4: Are there any secrets that could leak?

Challenge

You now have an API and a pipeline. Together, they form a complete system: code → test → scan → deploy → validate.

Before Article 7, try this: take the GOTCHA prompt from this article and change three things — a different container registry, a different cluster, and add a stage you need (maybe a database migration step, or a Slack notification on failure). See how much of the prompt you can reuse and what you need to adjust.

That exercise is a preview of what’s coming. In Article 7, we’ll look at the most common mistakes developers make when scaling this approach across bigger projects — the anti-gotchas — and build a library of ready-to-use prompt templates.

If this series helps you, consider buying me a coffee.

Comments

Loading comments...