The Infrastructure Hub -- Part 8

Drift Detection: When Your Infrastructure Doesn't Match Your Code

#platform-engineering #backstage #terraform #drift #infrastructure #ai

The Problem

It’s Monday morning. You run terraform plan on client ACME’s AKS module. Expected output: “No changes.” Actual output: 14 resources will be modified.

Nobody changed the code. Nobody merged a PR. But someone — maybe an engineer during an incident, maybe a client admin with portal access, maybe an automated process you forgot about — changed something directly in Azure. Now the cloud resource and the Terraform code disagree.

This is drift. And it happens in every organization that manages infrastructure with code. Not because people are careless — because reality is messy. Incidents need hotfixes. Clients make changes in the portal. Azure updates default settings. Kubernetes auto-scales and leaves behind config that Terraform didn’t expect.

The problem isn’t the drift itself — it’s not knowing about it. Most teams discover drift by accident: a terraform plan that shows unexpected changes, or worse, a terraform apply that reverts a manual fix someone made last week. By that point, the damage is done.

What you need is a system that detects drift before you run terraform plan, tells you what changed, explains why it matters, and lets you decide whether to fix it — all from Backstage, without opening a terminal.

The Solution

A drift detection pipeline that runs on a schedule for every Terraform module in the catalog:

  1. Run terraform plan in a read-only mode — no apply, just detection
  2. Parse the plan output — which resources drifted, what changed
  3. Send the diff to AI — get a human-readable explanation of what drifted and what the risk is
  4. Store the result — link it to the catalog entity with a drift status annotation
  5. Show it in Backstage — a dashboard with all modules, their drift status, and one-click remediation

What AI Adds

Without AI, drift detection gives you a Terraform plan output — 200 lines of HCL diff that an engineer needs to read and interpret. With AI, you get:

“The AKS cluster’s node pool was manually scaled from 3 to 5 nodes (likely during the incident on March 28). The Terraform code still says 3. If you apply, Terraform will scale it back down to 3. Recommendation: update the code to match the current state (5 nodes) before applying.”

That’s the difference between “14 resources will be modified” and knowing exactly what happened, why, and what to do about it.

Execute

The Drift Detection Endpoint

app.MapPost("/api/drift/analyze", async (DriftRequest request, IConfiguration config) =>
{
    if (string.IsNullOrWhiteSpace(request.PlanOutput))
        return Results.BadRequest(new { error = "Plan output is required." });

    var endpoint = config["AI:Endpoint"];
    var apiKey = config["AI:Key"];
    var model = config["AI:ChatModel"] ?? "mistral-small-3.2-24b-instruct-2506";
    var provider = config["AI:Provider"] ?? "openai";

    ChatClient chatClient = provider.ToLowerInvariant() switch
    {
        "azure" => new AzureOpenAIClient(
            new Uri(endpoint!), new ApiKeyCredential(apiKey!))
            .GetChatClient(model),
        _ => new OpenAIClient(
            new ApiKeyCredential(apiKey!),
            new OpenAIClientOptions { Endpoint = new Uri(endpoint!) })
            .GetChatClient(model),
    };

    var systemPrompt = $"""
        You are an infrastructure drift analyst.
        A Terraform module was planned without any code changes.
        Any differences in the plan output represent drift — something
        changed in the real infrastructure that doesn't match the code.

        Module: {request.Module}
        Client: {request.Client}

        Analyze the Terraform plan output and return a JSON object with:
        - "driftDetected": boolean
        - "resourceCount": number of resources that drifted
        - "summary": 2-3 sentence explanation of what drifted and likely why
        - "risk": "none" | "low" | "medium" | "high" | "critical"
        - "resources": array of objects, each with:
          - "address": the Terraform resource address
          - "change": "update" | "create" | "destroy" | "replace"
          - "description": what changed, in plain English
          - "likelyCause": why this probably happened
        - "recommendation": what to do next (update code, apply, or investigate)

        If the plan shows "No changes", return driftDetected: false.
        Be specific about what changed. Don't guess causes you can't infer
        from the plan output.
        Return ONLY valid JSON, no markdown.
        """;

    try
    {
        var completion = await chatClient.CompleteChatAsync(
        [
            new SystemChatMessage(systemPrompt),
            new UserChatMessage($"Terraform plan output:\n\n{request.PlanOutput}"),
        ]);

        var raw = completion.Value.Content[0].Text.Trim();
        var json = raw.StartsWith("```")
            ? raw.Split('\n', 2)[1].TrimEnd('`').Trim()
            : raw;

        var analysis = JsonSerializer.Deserialize<DriftAnalysis>(
            json, SerializerOptions.Default);

        if (analysis is null)
            return Results.UnprocessableEntity(new { error = "AI returned invalid analysis." });

        // Store the drift result
        var connStr = config["Rag:PostgresConnection"];
        if (!string.IsNullOrEmpty(connStr))
        {
            await using var dataSource = NpgsqlDataSource.Create(connStr);
            await using var cmd = dataSource.CreateCommand();
            cmd.CommandText = """
                INSERT INTO drift_results
                    (module, client, drift_detected, resource_count, risk,
                     summary, analysis_json, detected_at)
                VALUES ($1, $2, $3, $4, $5, $6, $7, NOW())
                ON CONFLICT (module)
                DO UPDATE SET client = EXCLUDED.client,
                    drift_detected = EXCLUDED.drift_detected,
                    resource_count = EXCLUDED.resource_count,
                    risk = EXCLUDED.risk,
                    summary = EXCLUDED.summary,
                    analysis_json = EXCLUDED.analysis_json,
                    detected_at = NOW()
                """;
            cmd.Parameters.AddWithValue(request.Module);
            cmd.Parameters.AddWithValue(request.Client ?? (object)DBNull.Value);
            cmd.Parameters.AddWithValue(analysis.DriftDetected);
            cmd.Parameters.AddWithValue(analysis.ResourceCount);
            cmd.Parameters.AddWithValue(analysis.Risk);
            cmd.Parameters.AddWithValue(analysis.Summary);
            cmd.Parameters.AddWithValue(json);

            await cmd.ExecuteNonQueryAsync();
        }

        return Results.Ok(analysis);
    }
    catch (ClientResultException ex) when (ex.Status == 401)
    {
        return Results.Json(new { error = "AI provider authentication failed." }, statusCode: 503);
    }
    catch (Exception ex)
    {
        return Results.Json(new { error = $"AI error: {ex.Message}" }, statusCode: 502);
    }
});

record DriftRequest(string Module, string? Client, string PlanOutput);

record DriftAnalysis(
    [property: JsonPropertyName("driftDetected")] bool DriftDetected,
    [property: JsonPropertyName("resourceCount")] int ResourceCount,
    [property: JsonPropertyName("summary")] string Summary,
    [property: JsonPropertyName("risk")] string Risk,
    [property: JsonPropertyName("resources")] List<DriftedResource> Resources,
    [property: JsonPropertyName("recommendation")] string Recommendation);

record DriftedResource(
    [property: JsonPropertyName("address")] string Address,
    [property: JsonPropertyName("change")] string Change,
    [property: JsonPropertyName("description")] string Description,
    [property: JsonPropertyName("likelyCause")] string LikelyCause);

The Database Schema

CREATE TABLE drift_results (
    id SERIAL PRIMARY KEY,
    module VARCHAR(255) NOT NULL UNIQUE,
    client VARCHAR(100),
    drift_detected BOOLEAN NOT NULL,
    resource_count INTEGER NOT NULL DEFAULT 0,
    risk VARCHAR(20) NOT NULL DEFAULT 'none',
    summary TEXT,
    analysis_json TEXT,
    detected_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);

CREATE INDEX idx_drift_module ON drift_results(module);
CREATE INDEX idx_drift_client ON drift_results(client);
CREATE INDEX idx_drift_risk ON drift_results(risk);

The UNIQUE on module means each module has one drift result — the latest. The ON CONFLICT DO UPDATE in the insert replaces the previous result every time the scan runs.

The Drift Scan Pipeline

A GitHub Actions workflow that runs terraform plan for each module and sends the output to the AI service. This runs on a schedule — daily or every 12 hours:

# .github/workflows/drift-scan.yml
name: Drift Detection

on:
  schedule:
    - cron: '0 6 * * *'  # Every day at 6am UTC
  workflow_dispatch: {}   # Manual trigger

jobs:
  scan:
    runs-on: ubuntu-latest
    strategy:
      matrix:
        module:
          - tf-azurerm-vnet
          - tf-azurerm-aks
          - tf-azurerm-sql
          - tf-azurerm-storage
    steps:
      - uses: actions/checkout@v4

      - uses: hashicorp/setup-terraform@v3
        with:
          terraform_version: '1.9'

      - name: Fetch secrets
        env:
          QUANTUMAPI_KEY: ${{ secrets.QUANTUMAPI_KEY }}
        run: |
          curl -sfL https://cli.quantumapi.eu/install.sh | sh -s -- -b /usr/local/bin
          ARM_CLIENT_ID=$(qapi vault get ${{ vars.ARM_CLIENT_ID_SECRET }})
          ARM_CLIENT_SECRET=$(qapi vault get ${{ vars.ARM_CLIENT_SECRET_ID }})
          ARM_TENANT_ID=$(qapi vault get ${{ vars.ARM_TENANT_ID_SECRET }})
          ARM_SUBSCRIPTION_ID=$(qapi vault get ${{ vars.ARM_SUBSCRIPTION_ID_SECRET }})
          echo "::add-mask::$ARM_CLIENT_ID"
          echo "::add-mask::$ARM_CLIENT_SECRET"
          echo "::add-mask::$ARM_TENANT_ID"
          echo "::add-mask::$ARM_SUBSCRIPTION_ID"
          echo "ARM_CLIENT_ID=$ARM_CLIENT_ID" >> $GITHUB_ENV
          echo "ARM_CLIENT_SECRET=$ARM_CLIENT_SECRET" >> $GITHUB_ENV
          echo "ARM_TENANT_ID=$ARM_TENANT_ID" >> $GITHUB_ENV
          echo "ARM_SUBSCRIPTION_ID=$ARM_SUBSCRIPTION_ID" >> $GITHUB_ENV

      - name: Terraform plan (drift detection)
        id: plan
        working-directory: modules/${{ matrix.module }}
        run: |
          terraform init -input=false
          set +e
          terraform plan -no-color -detailed-exitcode -out=plan.out 2>&1 | tee plan.txt
          # PIPESTATUS captures the exit code of terraform, not tee
          echo "exit_code=${PIPESTATUS[0]}" >> $GITHUB_OUTPUT
          set -e
        continue-on-error: true

      - name: Send to AI for analysis
        if: steps.plan.outputs.exit_code == '2'
        env:
          AI_SERVICE_URL: ${{ vars.AI_SERVICE_URL }}
        run: |
          PLAN_OUTPUT=$(cat modules/${{ matrix.module }}/plan.txt)
          curl -sf "$AI_SERVICE_URL/api/drift/analyze" \
            -H "Content-Type: application/json" \
            -d "$(jq -n \
              --arg module "${{ matrix.module }}" \
              --arg client "${{ vars.CLIENT_NAME }}" \
              --arg plan "$PLAN_OUTPUT" \
              '{module: $module, client: $client, planOutput: $plan}')"

      - name: No drift
        if: steps.plan.outputs.exit_code != '2'
        env:
          AI_SERVICE_URL: ${{ vars.AI_SERVICE_URL }}
        run: |
          curl -sf "$AI_SERVICE_URL/api/drift/analyze" \
            -H "Content-Type: application/json" \
            -d "$(jq -n \
              --arg module "${{ matrix.module }}" \
              --arg client "${{ vars.CLIENT_NAME }}" \
              '{module: $module, client: $client, planOutput: "No changes. Your infrastructure matches your configuration."}')"

Key details:

  • -detailed-exitcode makes Terraform return exit code 2 when there are changes (drift), exit code 0 when clean, and exit code 1 on errors. This is how we distinguish “no drift” from “drift detected” without parsing the output.
  • continue-on-error: true because exit code 2 would normally fail the step.
  • Secrets use the ::add-mask:: pattern from article 5 — fetch from QuantumVault, mask, then export.
  • Both outcomes (drift and no drift) are sent to the AI service, so the drift_results table always has the latest status for every module.

For Azure DevOps, replace the GitHub Actions syntax with an equivalent azure-pipelines.yml using the same structure — the terraform plan -detailed-exitcode trick works the same way.

The Drift Dashboard

// plugins/drift-dashboard/src/components/DriftDashboard.tsx
import React, { useEffect, useState } from 'react';
import {
  Page, Header, Content, Table, TableColumn,
  StatusOK, StatusError, StatusWarning, StatusPending,
} from '@backstage/core-components';
import { Chip } from '@material-ui/core';
import { useApi, fetchApiRef, discoveryApiRef } from '@backstage/core-plugin-api';

interface DriftResult {
  module: string;
  client: string | null;
  driftDetected: boolean;
  resourceCount: number;
  risk: string;
  summary: string;
  detectedAt: string;
}

const RiskIndicator = ({ risk }: { risk: string }) => {
  switch (risk) {
    case 'none': return <StatusOK>Clean</StatusOK>;
    case 'low': return <StatusOK>Low</StatusOK>;
    case 'medium': return <StatusWarning>Medium</StatusWarning>;
    case 'high': return <StatusError>High</StatusError>;
    case 'critical': return <StatusError>Critical</StatusError>;
    default: return <StatusPending>Unknown</StatusPending>;
  }
};

export const DriftDashboard = () => {
  const fetchApi = useApi(fetchApiRef);
  const discoveryApi = useApi(discoveryApiRef);
  const [results, setResults] = useState<DriftResult[]>([]);

  useEffect(() => {
    const load = async () => {
      const proxyUrl = await discoveryApi.getBaseUrl('proxy');
      const res = await fetchApi.fetch(
        `${proxyUrl}/ai-service/api/drift/results`,
      );
      if (res.ok) {
        const data = await res.json();
        setResults(data);
      }
    };
    load();
  }, [fetchApi, discoveryApi]);

  const columns: TableColumn<DriftResult>[] = [
    { title: 'Module', field: 'module' },
    { title: 'Client', field: 'client',
      render: row => row.client || 'internal' },
    { title: 'Status', field: 'driftDetected',
      render: row => row.driftDetected
        ? <Chip label="DRIFT" color="secondary" size="small" />
        : <Chip label="CLEAN" color="default" size="small" /> },
    { title: 'Resources', field: 'resourceCount', type: 'numeric' },
    { title: 'Risk', field: 'risk',
      render: row => <RiskIndicator risk={row.risk} /> },
    { title: 'Summary', field: 'summary',
      render: row => row.summary?.substring(0, 120) + (row.summary?.length > 120 ? '...' : '') },
    { title: 'Last Scan', field: 'detectedAt',
      render: row => new Date(row.detectedAt).toLocaleDateString() },
  ];

  const driftCount = results.filter(r => r.driftDetected).length;

  return (
    <Page themeId="tool">
      <Header
        title="Infrastructure Drift"
        subtitle={`${driftCount} module${driftCount !== 1 ? 's' : ''} with drift detected`}
      />
      <Content>
        <Table
          columns={columns}
          data={results}
          title={`${results.length} modules scanned`}
          options={{
            paging: true,
            pageSize: 20,
            search: true,
            sorting: true,
          }}
        />
      </Content>
    </Page>
  );
};

The Results Endpoint

The dashboard needs an endpoint to fetch all drift results:

app.MapGet("/api/drift/results", async (string? client, IConfiguration config) =>
{
    var connStr = config["Rag:PostgresConnection"];
    if (string.IsNullOrEmpty(connStr))
        return Results.Json(new { error = "Not configured." }, statusCode: 503);

    await using var dataSource = NpgsqlDataSource.Create(connStr);
    await using var cmd = dataSource.CreateCommand();
    cmd.CommandText = """
        SELECT module, client, drift_detected, resource_count, risk,
               summary, detected_at
        FROM drift_results
        WHERE ($1 = '' OR client = $1)
        ORDER BY
            CASE risk
                WHEN 'critical' THEN 0
                WHEN 'high' THEN 1
                WHEN 'medium' THEN 2
                WHEN 'low' THEN 3
                ELSE 4
            END,
            detected_at DESC
        """;
    cmd.Parameters.AddWithValue(client ?? "");

    var results = new List<object>();
    await using var reader = await cmd.ExecuteReaderAsync();
    while (await reader.ReadAsync())
    {
        results.Add(new
        {
            Module = reader.GetString(0),
            Client = reader.IsDBNull(1) ? null : reader.GetString(1),
            DriftDetected = reader.GetBoolean(2),
            ResourceCount = reader.GetInt32(3),
            Risk = reader.GetString(4),
            Summary = reader.IsDBNull(5) ? null : reader.GetString(5),
            DetectedAt = reader.GetFieldValue<DateTimeOffset>(6),
        });
    }

    return Results.Ok(results);
});

Results are sorted by risk (critical first, then high, medium, low, clean) so the most important drifts appear at the top.

What the Dashboard Shows

The platform team opens /drift and sees:

ModuleClientStatusResourcesRiskSummaryLast Scan
tf-azurerm-aksacmeDRIFT3MediumNode pool scaled from 3→5 during incident. Code says 3.Apr 7
tf-azurerm-sqlglobexDRIFT1LowFirewall rule added manually for temp debugging access.Apr 7
tf-azurerm-vnetacmeCLEAN0NoneApr 7
tf-azurerm-storageacmeCLEAN0NoneApr 7

Two modules have drift. One is medium risk (scaling change that Terraform would revert). One is low risk (a firewall rule that should either be committed or removed). The engineer sees the summary, understands the impact, and decides what to do — without running a single terraform plan locally.

Connecting Drift to the Chat

The infra chat from article 7 can now answer drift questions. Add drift_results to the context gathering:

// In GatherInfraContext, add after the CAB history section:

// 3. Drift status
await using (var cmd = dataSource.CreateCommand())
{
    cmd.CommandText = """
        SELECT module, drift_detected, resource_count, risk, summary
        FROM drift_results
        WHERE drift_detected = true
          AND ($1 = '' OR client = $1)
        ORDER BY detected_at DESC
        LIMIT 10
        """;
    cmd.Parameters.AddWithValue(request.Client ?? "");

    var drifts = new List<string>();
    await using var reader = await cmd.ExecuteReaderAsync();
    while (await reader.ReadAsync())
    {
        drifts.Add(
            $"- {reader.GetString(0)}: {reader.GetInt32(2)} resources drifted " +
            $"(risk: {reader.GetString(3)}) — {reader.GetString(4)}");
    }

    if (drifts.Count > 0)
        sections.Add($"DRIFT STATUS:\n{string.Join("\n", drifts)}");
}

Now an engineer can ask the chat: “Which modules have drift?” or “Is ACME’s AKS clean?” and get an answer based on the latest scan.

Registering the Dashboard

// plugins/drift-dashboard/src/plugin.ts
import {
  createPlugin,
  createRouteRef,
  createRoutableExtension,
} from '@backstage/core-plugin-api';

const rootRouteRef = createRouteRef({ id: 'drift-dashboard' });

export const driftDashboardPlugin = createPlugin({
  id: 'drift-dashboard',
  routes: { root: rootRouteRef },
});

export const DriftPage = driftDashboardPlugin.provide(
  createRoutableExtension({
    name: 'DriftPage',
    component: () =>
      import('./components/DriftDashboard').then(m => m.DriftDashboard),
    mountPoint: rootRouteRef,
  }),
);

In packages/app/src/App.tsx:

import { DriftPage } from '@internal/plugin-drift-dashboard';

// Inside <FlatRoutes>:
<Route path="/drift" element={<DriftPage />} />

Sidebar:

<SidebarItem icon={CompareArrowsIcon} to="drift" text="Drift" />

Checklist

  • /api/drift/analyze endpoint accepts plan output and returns structured analysis
  • /api/drift/results endpoint returns all drift results sorted by risk
  • drift_results table created with UNIQUE on module
  • Drift scan pipeline runs on schedule (terraform plan -detailed-exitcode)
  • Exit code 2 (changes) triggers AI analysis
  • Exit code 0 (no changes) updates status to clean
  • Pipeline secrets fetched from QuantumVault with ::add-mask::
  • Dashboard shows all modules with drift status, risk, and summary
  • Results sorted by risk (critical first)
  • DateTimeOffset used correctly for detected_at
  • Drift data integrated into the infra chat context
  • ON CONFLICT DO UPDATE keeps only the latest result per module

Challenge

Before the next article:

  1. Run the drift scan on one module — verify it detects a clean state
  2. Make a manual change in the cloud portal (scale something, add a tag)
  3. Run the scan again — verify drift is detected and the AI explanation is accurate
  4. Open the infra chat and ask “Which modules have drift?” — verify the answer includes the module you just changed

In the last article, we bring everything together: the Reference Architecture for the complete Infrastructure Hub — all 8 articles connected, deployed to Kubernetes, with the full plugin map and a guide for extending the platform.

The full code is on GitHub.

If this series helps you, consider buying me a coffee.

This is article 8 of the Infrastructure Hub series. Previous: Chat with Your Infrastructure. Next: Reference Architecture — the complete map of the Infrastructure Hub.

Comments

Loading comments...