The Infrastructure Hub -- Part 7

Chat with Your Infrastructure

#platform-engineering #backstage #ai #chat #infrastructure #rag

The Problem

“Which Kubernetes version does client ACME run?”

You know this. You’ve seen it. But you manage infrastructure for 5 clients across 3 clouds, and right now you can’t remember if ACME is on 1.29 or 1.30. So you open the catalog, find the AKS module, click through to the repo, open variables.tf, scroll to the kubernetes_version default. 1.29. Took 90 seconds.

Now multiply that by the 15 questions you answer every day: “What’s the CIDR range for Globex’s VNet?”, “Which modules did we change last week for client Stark?”, “Is the storage account module using private endpoints?”, “What’s the approval policy for ACME’s production changes?”

Each answer exists somewhere — in the catalog, in the Terraform code, in the documentation, in the CAB history. But finding it takes time. And half the time someone asks in Slack instead, and an engineer stops what they’re doing to answer.

The IDP series built an Ask widget for single questions about documentation. But infrastructure questions need more context. They need multi-turn conversations. “Show me ACME’s modules” → “Which ones have drift?” → “What changed in the VNet module last month?” Each question builds on the previous one.

The Solution

A conversational chat panel in Backstage that:

  1. Reads the catalog — knows every module, every client, every owner, every lifecycle status
  2. Searches documentation via the RAG system from the IDP series
  3. Queries the CAB history from article 6 — recent changes, approvals, risk levels
  4. Maintains conversation context — each message builds on the previous ones
  5. Works with any AI provider — same OpenAI-compatible pattern as every other endpoint in the series

The chat is not a generic chatbot. It’s scoped to your platform data. Every answer comes from your catalog, your docs, or your change history — not from the model’s training data.

Execute

The Chat Endpoint

A new endpoint in the AI service that handles multi-turn conversations with infrastructure context:

app.MapPost("/api/infra/chat", async (InfraChatRequest request, IConfiguration config) =>
{
    if (string.IsNullOrWhiteSpace(request.Message))
        return Results.BadRequest(new { error = "Message is required." });

    var endpoint = config["AI:Endpoint"];
    var apiKey = config["AI:Key"];
    var model = config["AI:ChatModel"] ?? "mistral-small-3.2-24b-instruct-2506";
    var provider = config["AI:Provider"] ?? "openai";

    ChatClient chatClient = provider.ToLowerInvariant() switch
    {
        "azure" => new AzureOpenAIClient(
            new Uri(endpoint!), new ApiKeyCredential(apiKey!))
            .GetChatClient(model),
        _ => new OpenAIClient(
            new ApiKeyCredential(apiKey!),
            new OpenAIClientOptions { Endpoint = new Uri(endpoint!) })
            .GetChatClient(model),
    };

    // Build context from multiple sources
    var context = await GatherInfraContext(request, config);

    var messages = new List<ChatMessage>
    {
        new SystemChatMessage($"""
            You are an infrastructure assistant for a Backstage-based platform.
            You help engineers find information about Terraform modules, clients,
            pipelines, and infrastructure changes.

            AVAILABLE DATA:
            {context}

            RULES:
            - Answer ONLY from the data above. If the answer isn't in the data, say so.
            - When referencing a module, include its lifecycle status and owner.
            - When referencing a change, include the CR ID and risk level.
            - Be concise. Engineers are busy.
            - If asked about costs or budgets, say you don't have that data.
            """),
    };

    // Add conversation history (multi-turn)
    foreach (var turn in request.History ?? [])
    {
        messages.Add(turn.Role == "user"
            ? new UserChatMessage(turn.Content)
            : new AssistantChatMessage(turn.Content));
    }

    // Add the current message
    messages.Add(new UserChatMessage(request.Message));

    try
    {
        var completion = await chatClient.CompleteChatAsync(messages);
        var reply = completion.Value.Content[0].Text.Trim();

        return Results.Ok(new
        {
            reply,
            sourcesUsed = context.Length > 0
                ? new[] { "catalog", "cab-history", "documentation" }
                : Array.Empty<string>(),
        });
    }
    catch (ClientResultException ex) when (ex.Status == 401)
    {
        return Results.Json(new { error = "AI provider authentication failed." }, statusCode: 503);
    }
    catch (Exception ex)
    {
        return Results.Json(new { error = $"AI error: {ex.Message}" }, statusCode: 502);
    }
});

record InfraChatRequest(
    string Message,
    string? Client,
    List<ChatTurn> History);

record ChatTurn(string Role, string Content);

Gathering Infrastructure Context

The key difference from a generic chatbot: before answering, we read your actual data.

async Task<string> GatherInfraContext(InfraChatRequest request, IConfiguration config)
{
    var connStr = config["Rag:PostgresConnection"];
    var sections = new List<string>();

    if (string.IsNullOrEmpty(connStr))
        return "No infrastructure data configured.";

    await using var dataSource = NpgsqlDataSource.Create(connStr);

    // 1. Catalog modules (filtered by client if specified)
    await using (var cmd = dataSource.CreateCommand())
    {
        // Query the Backstage catalog database for terraform modules
        // In production, call the catalog API instead of querying directly
        cmd.CommandText = """
            SELECT metadata->>'name', metadata->>'description',
                   spec->>'lifecycle', spec->>'owner', spec->>'system'
            FROM entities
            WHERE spec->>'type' = 'terraform-module'
              AND ($1 = '' OR spec->>'system' LIKE '%' || $1 || '%')
            ORDER BY metadata->>'name'
            LIMIT 50
            """;
        cmd.Parameters.AddWithValue(request.Client ?? "");

        var modules = new List<string>();
        await using var reader = await cmd.ExecuteReaderAsync();
        while (await reader.ReadAsync())
        {
            modules.Add(
                $"- {reader.GetString(0)}: {reader.GetString(1)} " +
                $"(lifecycle: {reader.GetString(2)}, owner: {reader.GetString(3)})");
        }

        if (modules.Count > 0)
            sections.Add($"MODULES:\n{string.Join("\n", modules)}");
    }

    // 2. Recent CAB approvals
    await using (var cmd = dataSource.CreateCommand())
    {
        cmd.CommandText = """
            SELECT change_request_id, module, client, risk_level,
                   approved_by, approved_at
            FROM cab_approvals
            WHERE approved_at >= NOW() - INTERVAL '30 days'
              AND ($1 = '' OR client = $1)
            ORDER BY approved_at DESC
            LIMIT 20
            """;
        cmd.Parameters.AddWithValue(request.Client ?? "");

        var changes = new List<string>();
        await using var reader = await cmd.ExecuteReaderAsync();
        while (await reader.ReadAsync())
        {
            changes.Add(
                $"- {reader.GetString(0)}: {reader.GetString(1)} " +
                $"(client: {(reader.IsDBNull(2) ? "internal" : reader.GetString(2))}, " +
                $"risk: {reader.GetString(3)}, " +
                $"approved by: {reader.GetString(4)}, " +
                $"date: {reader.GetFieldValue<DateTimeOffset>(5):yyyy-MM-dd})");
        }

        if (changes.Count > 0)
            sections.Add($"RECENT CHANGES (last 30 days):\n{string.Join("\n", changes)}");
    }

    // 3. RAG search — if the message looks like a question, search the docs
    if (request.Message.Contains('?') || request.Message.StartsWith("what", StringComparison.OrdinalIgnoreCase)
        || request.Message.StartsWith("how", StringComparison.OrdinalIgnoreCase)
        || request.Message.StartsWith("which", StringComparison.OrdinalIgnoreCase))
    {
        var embeddingModel = config["AI:EmbeddingModel"] ?? "bge-multilingual-gemma2";
        var aiEndpoint = config["AI:Endpoint"];
        var aiKey = config["AI:Key"];

        try
        {
            var openAiClient = new OpenAIClient(
                new ApiKeyCredential(aiKey!),
                new OpenAIClientOptions { Endpoint = new Uri(aiEndpoint!) });
            var embeddingClient = openAiClient.GetEmbeddingClient(embeddingModel);

            var questionEmbedding = await embeddingClient.GenerateEmbeddingAsync(request.Message);
            var vector = questionEmbedding.Value.ToFloats();
            var vectorStr = "[" + string.Join(",",
                vector.ToArray().Select(f => f.ToString("G"))) + "]";

            await using var cmd = dataSource.CreateCommand();
            cmd.CommandText = """
                SELECT doc_path, content,
                       1 - (embedding <=> $1::vector) AS similarity
                FROM doc_chunks
                WHERE ($2 = '' OR entity_ref LIKE '%' || $2 || '%')
                ORDER BY embedding <=> $1::vector
                LIMIT 3
                """;
            cmd.Parameters.AddWithValue(vectorStr);
            cmd.Parameters.AddWithValue(request.Client ?? "");

            var docs = new List<string>();
            await using var reader = await cmd.ExecuteReaderAsync();
            while (await reader.ReadAsync())
            {
                var similarity = reader.GetFloat(2);
                if (similarity > 0.5f)
                {
                    docs.Add($"[{reader.GetString(0)}]:\n{reader.GetString(1)}");
                }
            }

            if (docs.Count > 0)
                sections.Add($"DOCUMENTATION:\n{string.Join("\n\n", docs)}");
        }
        catch
        {
            // RAG not available — continue without docs
        }
    }

    return string.Join("\n\n", sections);
}

The context gathering has three layers:

  • Catalog modules — always included, filtered by client if the user specifies one
  • CAB history — recent changes and approvals from article 6
  • Documentation — RAG search from IDP article 5, only triggered for questions

This means the AI always knows what modules exist and what changed recently. For specific questions (“what’s the retry policy?”), it also searches the docs.

The Backstage Chat Panel

A full conversational panel, not a single-question widget:

// plugins/infra-chat/src/components/InfraChatPanel.tsx
import React, { useState, useRef, useEffect } from 'react';
import {
  Paper,
  TextField,
  IconButton,
  Typography,
  Select,
  MenuItem,
  InputLabel,
  FormControl,
  Box,
  CircularProgress,
} from '@material-ui/core';
import SendIcon from '@material-ui/icons/Send';
import { useApi, fetchApiRef, discoveryApiRef } from '@backstage/core-plugin-api';

interface Message {
  role: 'user' | 'assistant';
  content: string;
}

export const InfraChatPanel = () => {
  const fetchApi = useApi(fetchApiRef);
  const discoveryApi = useApi(discoveryApiRef);
  const [messages, setMessages] = useState<Message[]>([]);
  const [input, setInput] = useState('');
  const [client, setClient] = useState('');
  const [loading, setLoading] = useState(false);
  const messagesEndRef = useRef<HTMLDivElement>(null);

  useEffect(() => {
    messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' });
  }, [messages]);

  const handleSend = async () => {
    if (!input.trim() || loading) return;

    const userMessage: Message = { role: 'user', content: input };
    setMessages(prev => [...prev, userMessage]);
    setInput('');
    setLoading(true);

    try {
      const proxyUrl = await discoveryApi.getBaseUrl('proxy');
      const res = await fetchApi.fetch(`${proxyUrl}/ai-service/api/infra/chat`, {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({
          message: input,
          client: client || null,
          history: messages.slice(-10), // Keep last 10 turns for context window
        }),
      });

      if (!res.ok) throw new Error(`${res.status}`);

      const data = await res.json();
      setMessages(prev => [
        ...prev,
        { role: 'assistant', content: data.reply },
      ]);
    } catch {
      setMessages(prev => [
        ...prev,
        { role: 'assistant', content: 'Failed to get a response. Check the AI service.' },
      ]);
    } finally {
      setLoading(false);
    }
  };

  return (
    <Paper
      style={{
        display: 'flex',
        flexDirection: 'column',
        height: '600px',
        padding: 16,
      }}
    >
      <Box display="flex" alignItems="center" gap={2} mb={2}>
        <Typography variant="h6" style={{ flex: 1 }}>
          Infrastructure Chat
        </Typography>
        <FormControl size="small" style={{ minWidth: 150 }}>
          <InputLabel>Client filter</InputLabel>
          <Select
            value={client}
            onChange={e => setClient(e.target.value as string)}
            label="Client filter"
          >
            <MenuItem value="">All clients</MenuItem>
            <MenuItem value="acme">ACME</MenuItem>
            <MenuItem value="globex">Globex</MenuItem>
            <MenuItem value="stark">Stark</MenuItem>
          </Select>
        </FormControl>
      </Box>

      <Box
        style={{
          flex: 1,
          overflowY: 'auto',
          marginBottom: 16,
          padding: 8,
        }}
      >
        {messages.length === 0 && (
          <Typography
            color="textSecondary"
            style={{ textAlign: 'center', marginTop: 80 }}
          >
            Ask about your infrastructure. Try: "Which modules does ACME use?"
            or "What changed last week?"
          </Typography>
        )}
        {messages.map((msg, i) => (
          <Box
            key={i}
            style={{
              display: 'flex',
              justifyContent: msg.role === 'user' ? 'flex-end' : 'flex-start',
              marginBottom: 8,
            }}
          >
            <Paper
              elevation={1}
              style={{
                padding: '8px 12px',
                maxWidth: '75%',
                backgroundColor: msg.role === 'user' ? '#e3f2fd' : '#f5f5f5',
                borderRadius: 12,
              }}
            >
              <Typography
                variant="body2"
                style={{ whiteSpace: 'pre-wrap' }}
              >
                {msg.content}
              </Typography>
            </Paper>
          </Box>
        ))}
        {loading && (
          <Box display="flex" justifyContent="flex-start" mb={1}>
            <CircularProgress size={20} />
          </Box>
        )}
        <div ref={messagesEndRef} />
      </Box>

      <Box display="flex" gap={1}>
        <TextField
          fullWidth
          variant="outlined"
          size="small"
          placeholder="Ask about modules, clients, changes..."
          value={input}
          onChange={e => setInput(e.target.value)}
          onKeyDown={e => e.key === 'Enter' && !e.shiftKey && handleSend()}
          disabled={loading}
        />
        <IconButton
          color="primary"
          onClick={handleSend}
          disabled={loading || !input.trim()}
        >
          <SendIcon />
        </IconButton>
      </Box>
    </Paper>
  );
};

Adding the Chat Page to Backstage

// plugins/infra-chat/src/plugin.ts
import {
  createPlugin,
  createRouteRef,
  createRoutableExtension,
} from '@backstage/core-plugin-api';

const rootRouteRef = createRouteRef({ id: 'infra-chat' });

export const infraChatPlugin = createPlugin({
  id: 'infra-chat',
  routes: { root: rootRouteRef },
});

export const InfraChatPage = infraChatPlugin.provide(
  createRoutableExtension({
    name: 'InfraChatPage',
    component: () =>
      import('./components/InfraChatPanel').then(m => m.InfraChatPanel),
    mountPoint: rootRouteRef,
  }),
);

In packages/app/src/App.tsx:

import { InfraChatPage } from '@internal/plugin-infra-chat';

// Inside <FlatRoutes>:
<Route path="/infra-chat" element={<InfraChatPage />} />

And in the sidebar:

<SidebarItem icon={ChatIcon} to="infra-chat" text="Infra Chat" />

What a Conversation Looks Like

Engineer: “Which modules does client ACME use?”

Assistant: ACME has 4 Terraform modules in the catalog:

  • tf-azurerm-vnet — Hub VNet with 3 subnets (production, owner: team-acme)
  • tf-azurerm-aks — AKS cluster with 5 node pools (production, owner: team-acme)
  • tf-azurerm-sql — Azure SQL with private endpoints (production, owner: team-acme)
  • tf-azurerm-storage — Storage account with lifecycle policies (experimental, owner: team-acme)

Engineer: “What changed in the last 2 weeks?”

Assistant: 2 changes for ACME in the last 14 days:

  • CR-1712401234: tf-azurerm-aks — Scaled node pool from 3 to 5 nodes (risk: medium, approved by victor.zaragoza, 2026-04-01)
  • CR-1712398765: tf-azurerm-storage — Added lifecycle policy for blob deletion after 90 days (risk: low, auto-approved, 2026-03-28)

Engineer: “Is the AKS module on Kubernetes 1.30?”

Assistant: According to the documentation for tf-azurerm-aks, the kubernetes_version variable defaults to 1.29. The recent change CR-1712401234 scaled the node pool but didn’t upgrade the Kubernetes version. ACME is still on 1.29.

Three questions, 10 seconds. The same investigation without the chat would have taken 5 minutes of clicking through Backstage, GitHub, and the CAB history.

Context Window Management

The chat sends the last 10 messages as history. This keeps the context window manageable while maintaining conversation coherence. For longer investigations, the engineer can start a new conversation.

The history is truncated client-side:

history: messages.slice(-10), // Keep last 10 turns for context window

This means roughly 5 question-answer pairs. For Mistral Small (32K context), this leaves plenty of room for the infrastructure context (catalog + changes + docs).

Client Filtering

The dropdown at the top filters everything: catalog queries, CAB history, and RAG search are all scoped to the selected client. Switch to “Globex” and you see Globex’s modules, Globex’s changes, and Globex’s documentation.

For the platform team, select “All clients” to see everything. For a team lead responsible for one client, select their client and they only see what’s relevant.

Provider Agnosticism

The chat uses the same ChatClient pattern as every other endpoint:

AI:Provider = openai  → Scaleway, Mistral, OpenAI
AI:Provider = azure   → Azure AI Foundry (GPT-5, Claude Sonnet, Mistral Large)

Change the environment variables, same code. The chat works with any model that supports multi-turn conversations — which is all of them.

Checklist

  • /api/infra/chat endpoint handles multi-turn conversations
  • Context gathered from catalog, CAB history, and RAG docs
  • Client filter scopes all queries
  • Conversation history limited to last 10 turns
  • Chat panel registered as a Backstage page (/infra-chat)
  • Sidebar item visible to all users
  • RAG search only triggered for questions (not for commands)
  • Similarity threshold filters out irrelevant docs (> 0.5)
  • DateTimeOffset used correctly for CAB approval dates
  • Returns 503 when AI or PostgreSQL not configured

Challenge

Before the next article:

  1. Open the infra chat and ask “Which modules are in production?”
  2. Filter to one client and ask “What changed this month?”
  3. Ask a follow-up question that references a previous answer — verify the context carries over
  4. Ask something that’s NOT in the data — verify the AI says “I don’t have that information” instead of making something up

In the next article, we build Drift Detection — detect when real infrastructure doesn’t match the Terraform state, and use AI to explain what drifted, why it matters, and how to fix it. With PQC-encrypted state comparisons.

The full code is on GitHub.

If this series helps you, consider buying me a coffee.

This is article 7 of the Infrastructure Hub series. Previous: CAB Automation. Next: Drift Detection — find and fix infrastructure that doesn’t match your code.

Comments

Loading comments...