The Infrastructure Hub -- Part 8
Drift Detection: When Your Infrastructure Doesn't Match Your Code
The Problem
It’s Monday morning. You run terraform plan on client ACME’s AKS module. Expected output: “No changes.” Actual output: 14 resources will be modified.
Nobody changed the code. Nobody merged a PR. But someone — maybe an engineer during an incident, maybe a client admin with portal access, maybe an automated process you forgot about — changed something directly in Azure. Now the cloud resource and the Terraform code disagree.
This is drift. And it happens in every organization that manages infrastructure with code. Not because people are careless — because reality is messy. Incidents need hotfixes. Clients make changes in the portal. Azure updates default settings. Kubernetes auto-scales and leaves behind config that Terraform didn’t expect.
The problem isn’t the drift itself — it’s not knowing about it. Most teams discover drift by accident: a terraform plan that shows unexpected changes, or worse, a terraform apply that reverts a manual fix someone made last week. By that point, the damage is done.
What you need is a system that detects drift before you run terraform plan, tells you what changed, explains why it matters, and lets you decide whether to fix it — all from Backstage, without opening a terminal.
The Solution
A drift detection pipeline that runs on a schedule for every Terraform module in the catalog:
- Run
terraform planin a read-only mode — no apply, just detection - Parse the plan output — which resources drifted, what changed
- Send the diff to AI — get a human-readable explanation of what drifted and what the risk is
- Store the result — link it to the catalog entity with a drift status annotation
- Show it in Backstage — a dashboard with all modules, their drift status, and one-click remediation
What AI Adds
Without AI, drift detection gives you a Terraform plan output — 200 lines of HCL diff that an engineer needs to read and interpret. With AI, you get:
“The AKS cluster’s node pool was manually scaled from 3 to 5 nodes (likely during the incident on March 28). The Terraform code still says 3. If you apply, Terraform will scale it back down to 3. Recommendation: update the code to match the current state (5 nodes) before applying.”
That’s the difference between “14 resources will be modified” and knowing exactly what happened, why, and what to do about it.
Execute
The Drift Detection Endpoint
app.MapPost("/api/drift/analyze", async (DriftRequest request, IConfiguration config) =>
{
if (string.IsNullOrWhiteSpace(request.PlanOutput))
return Results.BadRequest(new { error = "Plan output is required." });
var endpoint = config["AI:Endpoint"];
var apiKey = config["AI:Key"];
var model = config["AI:ChatModel"] ?? "mistral-small-3.2-24b-instruct-2506";
var provider = config["AI:Provider"] ?? "openai";
ChatClient chatClient = provider.ToLowerInvariant() switch
{
"azure" => new AzureOpenAIClient(
new Uri(endpoint!), new ApiKeyCredential(apiKey!))
.GetChatClient(model),
_ => new OpenAIClient(
new ApiKeyCredential(apiKey!),
new OpenAIClientOptions { Endpoint = new Uri(endpoint!) })
.GetChatClient(model),
};
var systemPrompt = $"""
You are an infrastructure drift analyst.
A Terraform module was planned without any code changes.
Any differences in the plan output represent drift — something
changed in the real infrastructure that doesn't match the code.
Module: {request.Module}
Client: {request.Client}
Analyze the Terraform plan output and return a JSON object with:
- "driftDetected": boolean
- "resourceCount": number of resources that drifted
- "summary": 2-3 sentence explanation of what drifted and likely why
- "risk": "none" | "low" | "medium" | "high" | "critical"
- "resources": array of objects, each with:
- "address": the Terraform resource address
- "change": "update" | "create" | "destroy" | "replace"
- "description": what changed, in plain English
- "likelyCause": why this probably happened
- "recommendation": what to do next (update code, apply, or investigate)
If the plan shows "No changes", return driftDetected: false.
Be specific about what changed. Don't guess causes you can't infer
from the plan output.
Return ONLY valid JSON, no markdown.
""";
try
{
var completion = await chatClient.CompleteChatAsync(
[
new SystemChatMessage(systemPrompt),
new UserChatMessage($"Terraform plan output:\n\n{request.PlanOutput}"),
]);
var raw = completion.Value.Content[0].Text.Trim();
var json = raw.StartsWith("```")
? raw.Split('\n', 2)[1].TrimEnd('`').Trim()
: raw;
var analysis = JsonSerializer.Deserialize<DriftAnalysis>(
json, SerializerOptions.Default);
if (analysis is null)
return Results.UnprocessableEntity(new { error = "AI returned invalid analysis." });
// Store the drift result
var connStr = config["Rag:PostgresConnection"];
if (!string.IsNullOrEmpty(connStr))
{
await using var dataSource = NpgsqlDataSource.Create(connStr);
await using var cmd = dataSource.CreateCommand();
cmd.CommandText = """
INSERT INTO drift_results
(module, client, drift_detected, resource_count, risk,
summary, analysis_json, detected_at)
VALUES ($1, $2, $3, $4, $5, $6, $7, NOW())
ON CONFLICT (module)
DO UPDATE SET client = EXCLUDED.client,
drift_detected = EXCLUDED.drift_detected,
resource_count = EXCLUDED.resource_count,
risk = EXCLUDED.risk,
summary = EXCLUDED.summary,
analysis_json = EXCLUDED.analysis_json,
detected_at = NOW()
""";
cmd.Parameters.AddWithValue(request.Module);
cmd.Parameters.AddWithValue(request.Client ?? (object)DBNull.Value);
cmd.Parameters.AddWithValue(analysis.DriftDetected);
cmd.Parameters.AddWithValue(analysis.ResourceCount);
cmd.Parameters.AddWithValue(analysis.Risk);
cmd.Parameters.AddWithValue(analysis.Summary);
cmd.Parameters.AddWithValue(json);
await cmd.ExecuteNonQueryAsync();
}
return Results.Ok(analysis);
}
catch (ClientResultException ex) when (ex.Status == 401)
{
return Results.Json(new { error = "AI provider authentication failed." }, statusCode: 503);
}
catch (Exception ex)
{
return Results.Json(new { error = $"AI error: {ex.Message}" }, statusCode: 502);
}
});
record DriftRequest(string Module, string? Client, string PlanOutput);
record DriftAnalysis(
[property: JsonPropertyName("driftDetected")] bool DriftDetected,
[property: JsonPropertyName("resourceCount")] int ResourceCount,
[property: JsonPropertyName("summary")] string Summary,
[property: JsonPropertyName("risk")] string Risk,
[property: JsonPropertyName("resources")] List<DriftedResource> Resources,
[property: JsonPropertyName("recommendation")] string Recommendation);
record DriftedResource(
[property: JsonPropertyName("address")] string Address,
[property: JsonPropertyName("change")] string Change,
[property: JsonPropertyName("description")] string Description,
[property: JsonPropertyName("likelyCause")] string LikelyCause);
The Database Schema
CREATE TABLE drift_results (
id SERIAL PRIMARY KEY,
module VARCHAR(255) NOT NULL UNIQUE,
client VARCHAR(100),
drift_detected BOOLEAN NOT NULL,
resource_count INTEGER NOT NULL DEFAULT 0,
risk VARCHAR(20) NOT NULL DEFAULT 'none',
summary TEXT,
analysis_json TEXT,
detected_at TIMESTAMPTZ NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_drift_module ON drift_results(module);
CREATE INDEX idx_drift_client ON drift_results(client);
CREATE INDEX idx_drift_risk ON drift_results(risk);
The UNIQUE on module means each module has one drift result — the latest. The ON CONFLICT DO UPDATE in the insert replaces the previous result every time the scan runs.
The Drift Scan Pipeline
A GitHub Actions workflow that runs terraform plan for each module and sends the output to the AI service. This runs on a schedule — daily or every 12 hours:
# .github/workflows/drift-scan.yml
name: Drift Detection
on:
schedule:
- cron: '0 6 * * *' # Every day at 6am UTC
workflow_dispatch: {} # Manual trigger
jobs:
scan:
runs-on: ubuntu-latest
strategy:
matrix:
module:
- tf-azurerm-vnet
- tf-azurerm-aks
- tf-azurerm-sql
- tf-azurerm-storage
steps:
- uses: actions/checkout@v4
- uses: hashicorp/setup-terraform@v3
with:
terraform_version: '1.9'
- name: Fetch secrets
env:
QUANTUMAPI_KEY: ${{ secrets.QUANTUMAPI_KEY }}
run: |
curl -sfL https://cli.quantumapi.eu/install.sh | sh -s -- -b /usr/local/bin
ARM_CLIENT_ID=$(qapi vault get ${{ vars.ARM_CLIENT_ID_SECRET }})
ARM_CLIENT_SECRET=$(qapi vault get ${{ vars.ARM_CLIENT_SECRET_ID }})
ARM_TENANT_ID=$(qapi vault get ${{ vars.ARM_TENANT_ID_SECRET }})
ARM_SUBSCRIPTION_ID=$(qapi vault get ${{ vars.ARM_SUBSCRIPTION_ID_SECRET }})
echo "::add-mask::$ARM_CLIENT_ID"
echo "::add-mask::$ARM_CLIENT_SECRET"
echo "::add-mask::$ARM_TENANT_ID"
echo "::add-mask::$ARM_SUBSCRIPTION_ID"
echo "ARM_CLIENT_ID=$ARM_CLIENT_ID" >> $GITHUB_ENV
echo "ARM_CLIENT_SECRET=$ARM_CLIENT_SECRET" >> $GITHUB_ENV
echo "ARM_TENANT_ID=$ARM_TENANT_ID" >> $GITHUB_ENV
echo "ARM_SUBSCRIPTION_ID=$ARM_SUBSCRIPTION_ID" >> $GITHUB_ENV
- name: Terraform plan (drift detection)
id: plan
working-directory: modules/${{ matrix.module }}
run: |
terraform init -input=false
set +e
terraform plan -no-color -detailed-exitcode -out=plan.out 2>&1 | tee plan.txt
# PIPESTATUS captures the exit code of terraform, not tee
echo "exit_code=${PIPESTATUS[0]}" >> $GITHUB_OUTPUT
set -e
continue-on-error: true
- name: Send to AI for analysis
if: steps.plan.outputs.exit_code == '2'
env:
AI_SERVICE_URL: ${{ vars.AI_SERVICE_URL }}
run: |
PLAN_OUTPUT=$(cat modules/${{ matrix.module }}/plan.txt)
curl -sf "$AI_SERVICE_URL/api/drift/analyze" \
-H "Content-Type: application/json" \
-d "$(jq -n \
--arg module "${{ matrix.module }}" \
--arg client "${{ vars.CLIENT_NAME }}" \
--arg plan "$PLAN_OUTPUT" \
'{module: $module, client: $client, planOutput: $plan}')"
- name: No drift
if: steps.plan.outputs.exit_code != '2'
env:
AI_SERVICE_URL: ${{ vars.AI_SERVICE_URL }}
run: |
curl -sf "$AI_SERVICE_URL/api/drift/analyze" \
-H "Content-Type: application/json" \
-d "$(jq -n \
--arg module "${{ matrix.module }}" \
--arg client "${{ vars.CLIENT_NAME }}" \
'{module: $module, client: $client, planOutput: "No changes. Your infrastructure matches your configuration."}')"
Key details:
-detailed-exitcodemakes Terraform return exit code 2 when there are changes (drift), exit code 0 when clean, and exit code 1 on errors. This is how we distinguish “no drift” from “drift detected” without parsing the output.continue-on-error: truebecause exit code 2 would normally fail the step.- Secrets use the
::add-mask::pattern from article 5 — fetch from QuantumVault, mask, then export. - Both outcomes (drift and no drift) are sent to the AI service, so the
drift_resultstable always has the latest status for every module.
For Azure DevOps, replace the GitHub Actions syntax with an equivalent azure-pipelines.yml using the same structure — the terraform plan -detailed-exitcode trick works the same way.
The Drift Dashboard
// plugins/drift-dashboard/src/components/DriftDashboard.tsx
import React, { useEffect, useState } from 'react';
import {
Page, Header, Content, Table, TableColumn,
StatusOK, StatusError, StatusWarning, StatusPending,
} from '@backstage/core-components';
import { Chip } from '@material-ui/core';
import { useApi, fetchApiRef, discoveryApiRef } from '@backstage/core-plugin-api';
interface DriftResult {
module: string;
client: string | null;
driftDetected: boolean;
resourceCount: number;
risk: string;
summary: string;
detectedAt: string;
}
const RiskIndicator = ({ risk }: { risk: string }) => {
switch (risk) {
case 'none': return <StatusOK>Clean</StatusOK>;
case 'low': return <StatusOK>Low</StatusOK>;
case 'medium': return <StatusWarning>Medium</StatusWarning>;
case 'high': return <StatusError>High</StatusError>;
case 'critical': return <StatusError>Critical</StatusError>;
default: return <StatusPending>Unknown</StatusPending>;
}
};
export const DriftDashboard = () => {
const fetchApi = useApi(fetchApiRef);
const discoveryApi = useApi(discoveryApiRef);
const [results, setResults] = useState<DriftResult[]>([]);
useEffect(() => {
const load = async () => {
const proxyUrl = await discoveryApi.getBaseUrl('proxy');
const res = await fetchApi.fetch(
`${proxyUrl}/ai-service/api/drift/results`,
);
if (res.ok) {
const data = await res.json();
setResults(data);
}
};
load();
}, [fetchApi, discoveryApi]);
const columns: TableColumn<DriftResult>[] = [
{ title: 'Module', field: 'module' },
{ title: 'Client', field: 'client',
render: row => row.client || 'internal' },
{ title: 'Status', field: 'driftDetected',
render: row => row.driftDetected
? <Chip label="DRIFT" color="secondary" size="small" />
: <Chip label="CLEAN" color="default" size="small" /> },
{ title: 'Resources', field: 'resourceCount', type: 'numeric' },
{ title: 'Risk', field: 'risk',
render: row => <RiskIndicator risk={row.risk} /> },
{ title: 'Summary', field: 'summary',
render: row => row.summary?.substring(0, 120) + (row.summary?.length > 120 ? '...' : '') },
{ title: 'Last Scan', field: 'detectedAt',
render: row => new Date(row.detectedAt).toLocaleDateString() },
];
const driftCount = results.filter(r => r.driftDetected).length;
return (
<Page themeId="tool">
<Header
title="Infrastructure Drift"
subtitle={`${driftCount} module${driftCount !== 1 ? 's' : ''} with drift detected`}
/>
<Content>
<Table
columns={columns}
data={results}
title={`${results.length} modules scanned`}
options={{
paging: true,
pageSize: 20,
search: true,
sorting: true,
}}
/>
</Content>
</Page>
);
};
The Results Endpoint
The dashboard needs an endpoint to fetch all drift results:
app.MapGet("/api/drift/results", async (string? client, IConfiguration config) =>
{
var connStr = config["Rag:PostgresConnection"];
if (string.IsNullOrEmpty(connStr))
return Results.Json(new { error = "Not configured." }, statusCode: 503);
await using var dataSource = NpgsqlDataSource.Create(connStr);
await using var cmd = dataSource.CreateCommand();
cmd.CommandText = """
SELECT module, client, drift_detected, resource_count, risk,
summary, detected_at
FROM drift_results
WHERE ($1 = '' OR client = $1)
ORDER BY
CASE risk
WHEN 'critical' THEN 0
WHEN 'high' THEN 1
WHEN 'medium' THEN 2
WHEN 'low' THEN 3
ELSE 4
END,
detected_at DESC
""";
cmd.Parameters.AddWithValue(client ?? "");
var results = new List<object>();
await using var reader = await cmd.ExecuteReaderAsync();
while (await reader.ReadAsync())
{
results.Add(new
{
Module = reader.GetString(0),
Client = reader.IsDBNull(1) ? null : reader.GetString(1),
DriftDetected = reader.GetBoolean(2),
ResourceCount = reader.GetInt32(3),
Risk = reader.GetString(4),
Summary = reader.IsDBNull(5) ? null : reader.GetString(5),
DetectedAt = reader.GetFieldValue<DateTimeOffset>(6),
});
}
return Results.Ok(results);
});
Results are sorted by risk (critical first, then high, medium, low, clean) so the most important drifts appear at the top.
What the Dashboard Shows
The platform team opens /drift and sees:
| Module | Client | Status | Resources | Risk | Summary | Last Scan |
|---|---|---|---|---|---|---|
| tf-azurerm-aks | acme | DRIFT | 3 | Medium | Node pool scaled from 3→5 during incident. Code says 3. | Apr 7 |
| tf-azurerm-sql | globex | DRIFT | 1 | Low | Firewall rule added manually for temp debugging access. | Apr 7 |
| tf-azurerm-vnet | acme | CLEAN | 0 | None | Apr 7 | |
| tf-azurerm-storage | acme | CLEAN | 0 | None | Apr 7 |
Two modules have drift. One is medium risk (scaling change that Terraform would revert). One is low risk (a firewall rule that should either be committed or removed). The engineer sees the summary, understands the impact, and decides what to do — without running a single terraform plan locally.
Connecting Drift to the Chat
The infra chat from article 7 can now answer drift questions. Add drift_results to the context gathering:
// In GatherInfraContext, add after the CAB history section:
// 3. Drift status
await using (var cmd = dataSource.CreateCommand())
{
cmd.CommandText = """
SELECT module, drift_detected, resource_count, risk, summary
FROM drift_results
WHERE drift_detected = true
AND ($1 = '' OR client = $1)
ORDER BY detected_at DESC
LIMIT 10
""";
cmd.Parameters.AddWithValue(request.Client ?? "");
var drifts = new List<string>();
await using var reader = await cmd.ExecuteReaderAsync();
while (await reader.ReadAsync())
{
drifts.Add(
$"- {reader.GetString(0)}: {reader.GetInt32(2)} resources drifted " +
$"(risk: {reader.GetString(3)}) — {reader.GetString(4)}");
}
if (drifts.Count > 0)
sections.Add($"DRIFT STATUS:\n{string.Join("\n", drifts)}");
}
Now an engineer can ask the chat: “Which modules have drift?” or “Is ACME’s AKS clean?” and get an answer based on the latest scan.
Registering the Dashboard
// plugins/drift-dashboard/src/plugin.ts
import {
createPlugin,
createRouteRef,
createRoutableExtension,
} from '@backstage/core-plugin-api';
const rootRouteRef = createRouteRef({ id: 'drift-dashboard' });
export const driftDashboardPlugin = createPlugin({
id: 'drift-dashboard',
routes: { root: rootRouteRef },
});
export const DriftPage = driftDashboardPlugin.provide(
createRoutableExtension({
name: 'DriftPage',
component: () =>
import('./components/DriftDashboard').then(m => m.DriftDashboard),
mountPoint: rootRouteRef,
}),
);
In packages/app/src/App.tsx:
import { DriftPage } from '@internal/plugin-drift-dashboard';
// Inside <FlatRoutes>:
<Route path="/drift" element={<DriftPage />} />
Sidebar:
<SidebarItem icon={CompareArrowsIcon} to="drift" text="Drift" />
Checklist
-
/api/drift/analyzeendpoint accepts plan output and returns structured analysis -
/api/drift/resultsendpoint returns all drift results sorted by risk -
drift_resultstable created withUNIQUEon module - Drift scan pipeline runs on schedule (
terraform plan -detailed-exitcode) - Exit code 2 (changes) triggers AI analysis
- Exit code 0 (no changes) updates status to clean
- Pipeline secrets fetched from QuantumVault with
::add-mask:: - Dashboard shows all modules with drift status, risk, and summary
- Results sorted by risk (critical first)
-
DateTimeOffsetused correctly fordetected_at - Drift data integrated into the infra chat context
-
ON CONFLICT DO UPDATEkeeps only the latest result per module
Challenge
Before the next article:
- Run the drift scan on one module — verify it detects a clean state
- Make a manual change in the cloud portal (scale something, add a tag)
- Run the scan again — verify drift is detected and the AI explanation is accurate
- Open the infra chat and ask “Which modules have drift?” — verify the answer includes the module you just changed
In the last article, we bring everything together: the Reference Architecture for the complete Infrastructure Hub — all 8 articles connected, deployed to Kubernetes, with the full plugin map and a guide for extending the platform.
The full code is on GitHub.
If this series helps you, consider buying me a coffee.
This is article 8 of the Infrastructure Hub series. Previous: Chat with Your Infrastructure. Next: Reference Architecture — the complete map of the Infrastructure Hub.
Loading comments...