Hands-On: Un Pipeline CI/CD de Producción con Escaneo de Seguridad

El Problema

Construiste la API en el Artículo 5. Funciona en local. Los tests pasan. El Dockerfile compila.

Ahora necesitas llevarla a producción. Y no solo una vez — cada vez que alguien haga push de un commit, el código debería compilar, testear, escanear vulnerabilidades, construir una imagen de contenedor y desplegar a AKS. Automáticamente. Con gates que paren el código malo antes de llegar a producción.

La mayoría de equipos se saltan la parte de seguridad. Construyen un pipeline que ejecuta tests y despliega. Eso está bien, pero no es suficiente. Las imágenes de contenedor contienen paquetes vulnerables. El código tiene vulnerabilidades en las dependencias. Los secrets de JWT acaban hardcodeados en los commits. Nada de esto aparece en los unit tests.

He trabajado en proyectos de infraestructura regulada donde un pipeline sin gates de seguridad no era una opción. GDPR, servicios financieros, sector energético — todos requieren checks de seguridad automatizados antes de producción. Incluso si tu proyecto no está regulado, estos checks encuentran problemas reales, cada semana.

El reto: escribir un pipeline de Azure DevOps con escaneo de seguridad no es trivial. El YAML se hace largo. Los stages interactúan. El orden importa. Los secrets necesitan manejarse con cuidado.

Este es exactamente el tipo de tarea compleja con múltiples pasos donde un enfoque estructurado compensa — y donde la IA puede generar la mayor parte del YAML si le dices exactamente qué necesitas.

La Solución

Si has seguido esta serie, ya conoces ATLAS y GOTCHA. Si no, un resumen rápido: ATLAS es un checklist de 5 fases (Architect, Trace, Link, Assemble, Stress-test) que te obliga a pensar un problema antes de tocar ninguna herramienta. GOTCHA es el formato de prompt de 6 capas (Goals, Orchestration, Tools, Context, Heuristics, Args) que traduce ese pensamiento en instrucciones que una IA puede procesar de forma consistente. Juntos, convierten “escríbeme un pipeline” en una especificación precisa.

Primero diseñaremos el pipeline con ATLAS. Después lo traducimos a GOTCHA. Y después vemos el YAML real.

El pipeline tiene cinco stages:

Build — restore, compilar, ejecutar unit tests
SAST — análisis estático (encontrar patrones de código vulnerable)
Image — construir imagen Docker, escanear CVEs, push a Azure Container Registry
Deploy — rolling update a AKS
Smoke test — verificar que el despliegue está sano

Aquí está el checklist ATLAS para este pipeline — cada letra mapea a una fase del diseño:

[A] ARCHITECT
  Purpose: Automated CI/CD pipeline for users-api.
    Build → test → security scan → deploy to AKS on every push to main.
    PRs run build + tests only (no deploy).
  Constraints:
    - Pipeline must fail if any security scan finds HIGH or CRITICAL issues
    - Container image must be signed before push to ACR
    - No secrets in pipeline YAML — all from Azure DevOps variable groups
    - Deploy must be zero-downtime (rolling update, readiness probe gated)
  Tech decisions: Azure DevOps, ACR, AKS, Trivy (container scan), dotnet-format
  Out of scope: blue/green deployments, performance testing, notification alerts

[T] TRACE
  On push to main:
    1. Agent checks out code
    2. dotnet restore → dotnet build → dotnet test (with coverage)
    3. Publish test results and coverage to pipeline
    4. dotnet format --verify-no-changes (fail on unformatted code)
    5. docker build → tag with build ID + git sha
    6. Trivy scan the image → fail if HIGH/CRITICAL CVE found
    7. docker push to ACR (only if scan passes)
    8. kubectl set image → wait for rollout to complete
    9. curl /healthz on the new pods → fail pipeline if not 200

  On PR (not main):
    Steps 1-3 only. No image build, no deploy.

[L] LINK
  | From           | To              | Auth               | Failure mode         |
  | -------------- | --------------- | ------------------ | -------------------- |
  | ADO pipeline   | ACR             | Service connection | Fail stage, alert    |
  | ADO pipeline   | AKS             | Service connection | Fail stage, alert    |
  | ADO agent      | Trivy           | Local binary       | Fail stage           |
  | ADO pipeline   | Variable group  | Azure Key Vault    | Pipeline blocked     |

[A] ASSEMBLE
  Phase 1: Pipeline skeleton (trigger, stages, agent pools)
  Phase 2: Build stage (restore, build, test, format check)
  Phase 3: Image stage (docker build, Trivy scan, ACR push)
  Phase 4: Deploy stage (kubectl rolling update, rollout watch)
  Phase 5: Smoke test stage (HTTP health check with retry)

[S] STRESS-TEST
  Scenario 1: Bad code
    - Push code with unit test failure → pipeline stops at Build stage
    - Push unformatted code → pipeline stops at format check
  Scenario 2: Vulnerable image
    - Base image with known HIGH CVE → pipeline stops at Image stage
    - Fix: update base image → pipeline proceeds
  Scenario 3: Failed deploy
    - New pod fails readiness probe → rollout stops, old pods stay up
    - Smoke test fails → pipeline marked failed, but service still running
  Scenario 4: Secret hygiene
    - No secrets in YAML, all resolved from variable group at runtime

Ahora traducimos el diseño ATLAS a un prompt GOTCHA — el formato que la IA necesita para generar un output consistente. Cada sección mapea a una capa de GOTCHA:

=== GOALS (from Architect) ===
Generate an Azure DevOps YAML pipeline for a .NET 10 Web API that:
- Triggers on push to main (full pipeline) and on PRs (build + test only)
- Builds, tests, runs SAST (dotnet format), scans container with Trivy
- Fails immediately if any HIGH or CRITICAL CVE is found
- Builds and pushes Docker image to ACR (tagged: build ID + git sha)
- Deploys rolling update to AKS and waits for rollout to complete
- Runs HTTP smoke test (/healthz) after deploy

=== ORCHESTRATION (from Assemble) ===
Five stages in order, each depends on previous passing:
1. build: restore → build → test → format check → publish test results
2. scan: Trivy scan on local image (before push) → fail on HIGH/CRITICAL
3. image: docker push to ACR (only after scan passes)
4. deploy: kubectl set image + rollout status watch (5 min timeout)
5. smoketest: curl /healthz with 3 retries, 10s between retries

=== TOOLS (from Link) ===
- Azure DevOps YAML pipelines (multi-stage)
- dotnet 10 SDK (ubuntu-latest agent)
- Docker + Azure Container Registry
- Trivy (aquasec/trivy action or install script)
- kubectl + kubelogin (AKS deployment)
- Azure DevOps service connections (ACR + AKS)
- Azure Key Vault linked variable group for secrets

=== CONTEXT (from Trace + Link) ===
- Project: users-api (.NET 10 Web API)
- ACR name: myacr (variable: ACR_NAME)
- AKS cluster: my-cluster (variable: AKS_CLUSTER), namespace: users-api
- K8s deployment name: users-api
- Container name in deployment spec: users-api
- Secrets in variable group: DB_CONNECTION, JWT_SECRET
- Image tag pattern: $(Build.BuildId)-$(Build.SourceVersion | first 8 chars)

=== HEURISTICS ===
DO:
- Use dependsOn between stages to enforce order
- Use condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
  for deploy and smoketest stages
- Publish test results using PublishTestResults task (JUnit format)
- Use --exit-code 1 flag in Trivy to fail the pipeline on findings
- Set timeoutInMinutes on deploy stage to prevent hanging rollouts

DON'T:
- Don't hardcode secrets or connection strings in YAML
- Don't push the image before the Trivy scan passes
- Don't use latest tag — always use a specific, traceable tag
- Don't skip the rollout status check (kubectl rollout status blocks until ready)
- Don't ignore test coverage — publish it even if you don't gate on it yet

=== ARGS (from Stress-test) ===
Pool: ubuntu-latest
ACR_NAME: $(acrName) from variable group
AKS_CLUSTER: $(aksCluster) from variable group
AKS_RESOURCE_GROUP: $(aksResourceGroup) from variable group
Trivy severity: HIGH,CRITICAL
Rollout timeout: 5 minutes
Smoke test retries: 3, interval: 10 seconds
Test results format: JUnit

Execute

Aquí está el YAML del pipeline, generado a partir de ese prompt y ajustado ligeramente (las tres cosas que mencioné en el Artículo 5 pasaron aquí también — las señalo según vayamos avanzando).

# azure-pipelines.yml
trigger:
  branches:
    include:
      - main
  paths:
    exclude:
      - "*.md"
      - "docs/**"

pr:
  branches:
    include:
      - main

variables:
  - group: users-api-secrets        # Azure Key Vault linked variable group
  - name: IMAGE_TAG
    value: "$(Build.BuildId)-$(Build.SourceVersion)"
  - name: IMAGE_FULL
    value: "$(acrName).azurecr.io/users-api:$(IMAGE_TAG)"

stages:
  # ─────────────────────────────────────────
  - stage: build
    displayName: Build & Test
    jobs:
      - job: build
        displayName: Build, test, format check
        pool:
          vmImage: ubuntu-latest
        steps:
          - task: UseDotNet@2
            inputs:
              version: "10.x"

          - script: dotnet restore
            displayName: Restore

          - script: dotnet build --no-restore --configuration Release
            displayName: Build

          - script: |
              dotnet test --no-build --configuration Release \
                --logger "junit;LogFilePath=$(Agent.TempDirectory)/test-results.xml" \
                --collect:"XPlat Code Coverage"
            displayName: Test

          - task: PublishTestResults@2
            condition: always()
            inputs:
              testResultsFormat: JUnit
              testResultsFiles: "$(Agent.TempDirectory)/test-results.xml"

          - script: dotnet format --verify-no-changes
            displayName: Format check

  # ─────────────────────────────────────────
  - stage: scan
    displayName: Security Scan
    dependsOn: build
    jobs:
      - job: trivy
        displayName: Trivy container scan
        pool:
          vmImage: ubuntu-latest
        steps:
          - task: Docker@2
            displayName: Build image (local, not pushed)
            inputs:
              command: build
              Dockerfile: Dockerfile
              tags: $(IMAGE_FULL)

          - script: |
              curl -sfL https://raw.githubusercontent.com/aquasecurity/trivy/main/contrib/install.sh \
                | sh -s -- -b /usr/local/bin v0.62.1
              trivy image \
                --exit-code 1 \
                --severity HIGH,CRITICAL \
                --no-progress \
                $(IMAGE_FULL)
            displayName: Scan with Trivy

  # ─────────────────────────────────────────
  - stage: image
    displayName: Push to ACR
    dependsOn: scan
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
    jobs:
      - job: push
        displayName: Push image
        pool:
          vmImage: ubuntu-latest
        steps:
          - task: Docker@2
            displayName: Build and push to ACR
            inputs:
              command: buildAndPush
              containerRegistry: acr-service-connection
              repository: users-api
              tags: $(IMAGE_TAG)

  # ─────────────────────────────────────────
  - stage: deploy
    displayName: Deploy to AKS
    dependsOn: image
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
    timeoutInMinutes: 10
    jobs:
      - deployment: deploy_aks
        displayName: Rolling deploy
        pool:
          vmImage: ubuntu-latest
        environment: production
        strategy:
          runOnce:
            deploy:
              steps:
                - task: KubernetesManifest@1
                  displayName: Set image
                  inputs:
                    action: patch
                    resourceToPatch: deployment/users-api
                    namespace: users-api
                    patch: |
                      spec:
                        template:
                          spec:
                            containers:
                            - name: users-api
                              image: $(acrName).azurecr.io/users-api:$(IMAGE_TAG)

                - task: Kubernetes@1
                  displayName: Wait for rollout
                  inputs:
                    command: rollout
                    arguments: "status deployment/users-api -n users-api --timeout=5m"

  # ─────────────────────────────────────────
  - stage: smoketest
    displayName: Smoke Test
    dependsOn: deploy
    condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
    jobs:
      - job: healthcheck
        displayName: Health check
        pool:
          vmImage: ubuntu-latest
        steps:
          - script: |
              for i in 1 2 3; do
                echo "Attempt $i..."
                STATUS=$(curl -s -o /dev/null -w "%{http_code}" \
                  https://$(apiHostname)/healthz)
                if [ "$STATUS" = "200" ]; then
                  echo "Health check passed (HTTP $STATUS)"
                  exit 0
                fi
                echo "Got HTTP $STATUS, retrying in 10s..."
                sleep 10
              done
              echo "Health check failed after 3 attempts"
              exit 1
            displayName: Check /healthz (3 retries)

Qué Ajustamos

La IA generó un pipeline sólido. Tres ajustes:

1. El tag de la imagen. La IA usó $(Build.BuildId) solo. Lo cambiamos a $(Build.BuildId)-$(Build.SourceVersion) (primeros 8 caracteres del SHA de git). Así cada tag de imagen es trazable hasta el commit exacto — útil cuando necesitas saber qué está corriendo en producción.

2. El script de instalación de Trivy. La IA usó una GitHub Action fijada para Trivy, que solo funciona en GitHub Actions, no en Azure DevOps. Lo cambiamos por el script de instalación que funciona en cualquier agente Ubuntu.

3. La condition en deploy y smoketest. La IA usó solo succeeded(), lo que significaba que los deploys también se ejecutarían en pipelines de PR si la rama se llamaba main en el lado del PR. Añadimos el check eq(variables['Build.SourceBranch'], 'refs/heads/main') para ser explícitos.

Este último ajuste es un buen ejemplo de por qué la capa Heuristics en GOTCHA importa — es donde escribes reglas explícitas de DO/DON’T para la IA. Teníamos esto en el prompt:

Use condition: and(succeeded(), eq(variables['Build.SourceBranch'], 'refs/heads/main'))
for deploy and smoketest stages

Y la IA aún así se equivocó ligeramente. No porque la heurística fuera poco clara — sino porque la IA la aplicó a la condition del stage pero también dejó un succeeded() por defecto a nivel de job. Eliminamos la condition a nivel de job y dejamos solo la del stage.

La lección: revisa el código generado. Un prompt estructurado no hace a la IA infalible. Hace que el output sea mucho más cercano a lo que necesitas, pero tú sigues siendo el arquitecto. El enfoque estructurado comprime la revisión de horas a minutos — pero aún necesitas mirar.

Template

Aquí está el checklist ATLAS para cualquier proyecto de pipeline CI/CD:

=== CI/CD PIPELINE ATLAS CHECKLIST ===

[A] ARCHITECT
  Purpose: (what triggers it, what it produces)
  Trigger: (branches, events, PR vs merge)
  Constraints: (what must fail the pipeline, what must never happen)
  Secrets: (where they come from, never YAML)
  Out of scope: (what this pipeline doesn't do)

[T] TRACE
  On merge to main:
    1. (first step)
    2.
    ...
  On PR:
    (which steps only)

[L] LINK
  | From | To | Auth | Failure mode |

[A] ASSEMBLE
  Stage 1: (name + what it does)
  Stage 2:
  ...

[S] STRESS-TEST
  Scenario 1: What happens when tests fail?
  Scenario 2: What happens when a CVE is found?
  Scenario 3: What happens when the deploy fails?
  Scenario 4: Are there any secrets that could leak?

Challenge

Ya tienes una API y un pipeline. Juntos forman un sistema completo: code → test → scan → deploy → validate.

Antes del Artículo 7, prueba esto: coge el prompt GOTCHA de este artículo y cambia tres cosas — un container registry diferente, un cluster diferente, y añade un stage que necesites (quizás un paso de migración de base de datos, o una notificación a Slack si falla). Mira cuánto del prompt puedes reutilizar y qué necesitas ajustar.

Ese ejercicio es un adelanto de lo que viene. En el Artículo 7, veremos los errores más comunes que cometen los desarrolladores al escalar este enfoque en proyectos más grandes — los anti-gotchas — y construiremos una librería de prompt templates listos para usar.

Si esta serie te ayuda, considera invitarme a un café.