The AI-Native IDP -- Part 7
AI-Assisted Incident Response
The Problem
The alert fires at 3:12am. Invoice-api, 500 errors, error rate above threshold.
The on-call engineer opens the monitoring dashboard. Sees a spike in errors starting at 3:08am. Opens the logs. Scrolls through pages of structured JSON. Finds a NpgsqlException: connection refused. Checks the database. PostgreSQL is running. Checks the connection string. Looks correct. Checks recent deployments. There was a deployment 20 minutes ago. Looks at the diff. Someone changed the connection string format from Host= to Server=. Npgsql doesn’t accept Server=.
That took 40 minutes. The fix takes 2 minutes — revert the deployment.
The problem is not the fix. The problem is the investigation. The on-call engineer didn’t know that a deployment happened 20 minutes ago. Didn’t know the invoice-api uses PostgreSQL. Didn’t know the connection string format matters. All of this information exists in the system — in the catalog, in the deployment history, in the GOTCHA.md — but it takes a human 40 minutes to connect the dots.
The Solution
An incident response plugin that does the investigation automatically. When an alert fires:
- Identifies the affected service from the alert
- Reads the catalog: what does this service depend on?
- Checks recent deployments: what changed?
- Reads the logs: what errors are appearing?
- Reads the GOTCHA.md: what are the known heuristics?
- Sends all of this to the AI model
- Returns a diagnosis with probable cause and suggested actions
The engineer still decides. But instead of starting from zero, they start with a hypothesis.
Execute
The Incident Analysis Endpoint
app.MapPost("/api/incident/analyze", async (IncidentRequest request, IConfiguration config) =>
{
var endpoint = config["AI:Endpoint"];
var apiKey = config["AI:Key"];
var model = config["AI:ChatModel"] ?? "mistral-small-3.2-24b-instruct-2506";
var provider = config["AI:Provider"] ?? "openai";
ChatClient chatClient = provider.ToLowerInvariant() switch
{
"azure" => new AzureOpenAIClient(
new Uri(endpoint!), new ApiKeyCredential(apiKey!))
.GetChatClient(model),
_ => new OpenAIClient(
new ApiKeyCredential(apiKey!),
new OpenAIClientOptions { Endpoint = new Uri(endpoint!) })
.GetChatClient(model),
};
var systemPrompt = $"""
You are an incident response assistant for a cloud platform.
A service is experiencing issues. Analyze the available information
and provide a diagnosis.
SERVICE CONTEXT:
Name: {request.ServiceName}
Description: {request.ServiceDescription}
Dependencies: {string.Join(", ", request.Dependencies)}
Tags: {string.Join(", ", request.Tags)}
RECENT DEPLOYMENTS:
{request.RecentDeployments}
RECENT ERRORS (from logs):
{request.RecentErrors}
ARCHITECTURAL RULES (from GOTCHA.md):
{request.GotchaHeuristics}
Provide:
1. PROBABLE CAUSE — what most likely caused this incident, based on the evidence
2. EVIDENCE — which specific pieces of information led you to this conclusion
3. SUGGESTED ACTIONS — ordered list of steps to investigate and fix
4. RELATED SERVICES — which other services might be affected, based on dependencies
Be specific. Reference actual error messages, deployment changes, and service dependencies.
If you don't have enough information to diagnose, say so and suggest what data to collect.
""";
try
{
var completion = await chatClient.CompleteChatAsync(
[
new SystemChatMessage(systemPrompt),
new UserChatMessage(
$"Alert: {request.AlertTitle}\nSeverity: {request.Severity}\nStarted: {request.StartedAt}"),
]);
var analysis = completion.Value.Content[0].Text.Trim();
return Results.Ok(new { analysis });
}
catch (ClientResultException ex) when (ex.Status == 401)
{
return Results.Json(new { error = "AI provider authentication failed." }, statusCode: 503);
}
catch (Exception ex)
{
return Results.Json(new { error = $"AI provider error: {ex.Message}" }, statusCode: 502);
}
});
record IncidentRequest(
string ServiceName,
string ServiceDescription,
string[] Dependencies,
string[] Tags,
string RecentDeployments,
string RecentErrors,
string GotchaHeuristics,
string AlertTitle,
string Severity,
string StartedAt);
The Backstage Incident Plugin
The plugin listens for alerts (via webhook from your monitoring system) and gathers all context before calling the AI:
// plugins/ai-incident/src/plugin.ts
import {
coreServices,
createBackendPlugin,
} from '@backstage/backend-plugin-api';
import { catalogServiceRef } from '@backstage/plugin-catalog-node';
import { createRouter } from './router';
export const aiIncidentPlugin = createBackendPlugin({
pluginId: 'ai-incident',
register(env) {
env.registerInit({
deps: {
logger: coreServices.logger,
httpRouter: coreServices.httpRouter,
config: coreServices.rootConfig,
catalog: catalogServiceRef,
auth: coreServices.auth,
},
async init({ logger, httpRouter, config, catalog, auth }) {
const aiServiceUrl = config.getString('forge.aiServiceUrl');
const router = await createRouter({
logger, catalog, auth, aiServiceUrl,
});
httpRouter.use(router);
httpRouter.addAuthPolicy({
path: '/webhook/alert',
allow: 'unauthenticated',
});
logger.info('AI Incident Response plugin initialized');
},
});
},
});
The Context Gatherer
This is the core logic — it collects information from multiple sources:
// plugins/ai-incident/src/gather.ts
import type { Entity } from '@backstage/catalog-model';
import type { LoggerService } from '@backstage/backend-plugin-api';
import { Octokit } from '@octokit/rest';
interface IncidentContext {
serviceName: string;
serviceDescription: string;
dependencies: string[];
tags: string[];
recentDeployments: string;
recentErrors: string;
gotchaHeuristics: string;
}
export async function gatherIncidentContext(
entity: Entity,
logger: LoggerService,
): Promise<IncidentContext> {
const slug =
entity.metadata.annotations?.['github.com/project-slug'] ?? '';
const [owner, repo] = slug.split('/');
const octokit = new Octokit({ auth: process.env.GITHUB_TOKEN });
const serviceName = entity.metadata.name;
const serviceDescription = entity.metadata.description ?? 'No description';
const tags = (entity.metadata.tags as string[]) ?? [];
const dependencies = tags; // In production, read from catalog relations
// 1. Recent deployments (last 5 GitHub releases or deployment events)
let recentDeployments = 'No deployment data available.';
try {
const { data: commits } = await octokit.repos.listCommits({
owner,
repo,
per_page: 5,
});
recentDeployments = commits
.map(
c =>
`${c.commit.author?.date} — ${c.commit.message} (${c.sha.slice(0, 7)})`,
)
.join('\n');
} catch {
logger.info(`Could not fetch commits for ${slug}`);
}
// 2. GOTCHA.md heuristics
let gotchaHeuristics = 'No GOTCHA.md found.';
try {
const { data: gotchaFile } = await octokit.repos.getContent({
owner,
repo,
path: 'GOTCHA.md',
mediaType: { format: 'raw' },
});
const gotchaContent = gotchaFile as unknown as string;
const heuristicsMatch = gotchaContent.match(
/## HEURISTICS\s*\n([\s\S]*?)(?=\n## [A-Z]|\n---|\$)/,
);
if (heuristicsMatch) {
gotchaHeuristics = heuristicsMatch[1].trim();
}
} catch {
// No GOTCHA.md
}
// 3. Recent errors (in production, query your log aggregator)
// This is a placeholder — replace with your actual log source
const recentErrors =
'Connect to log aggregator API to fetch recent errors.';
return {
serviceName,
serviceDescription,
dependencies,
tags,
recentDeployments,
recentErrors,
gotchaHeuristics,
};
}
The Alert Router
// plugins/ai-incident/src/router.ts
import { Router, json } from 'express';
import type { LoggerService, AuthService } from '@backstage/backend-plugin-api';
import type { CatalogService } from '@backstage/plugin-catalog-node';
import { gatherIncidentContext } from './gather';
interface RouterOptions {
logger: LoggerService;
catalog: CatalogService;
auth: AuthService;
aiServiceUrl: string;
}
export async function createRouter(options: RouterOptions): Promise<Router> {
const { logger, catalog, auth, aiServiceUrl } = options;
const router = Router();
router.use(json());
// Webhook from monitoring system (Prometheus Alertmanager, Grafana, etc.)
router.post('/webhook/alert', async (req, res) => {
const { serviceName, alertTitle, severity, startedAt, errors } =
req.body;
logger.info(`Alert received: ${alertTitle} for ${serviceName}`);
// Look up the service
const credentials = await auth.getOwnServiceCredentials();
const entities = await catalog.getEntities(
{
filter: {
kind: 'Component',
'metadata.name': serviceName,
},
},
{ credentials },
);
if (entities.items.length === 0) {
logger.info(`No catalog entity for ${serviceName}`);
res.status(200).json({ skipped: 'not in catalog' });
return;
}
const entity = entities.items[0];
// Gather context from multiple sources
const context = await gatherIncidentContext(entity, logger);
// Call AI service
const aiRes = await fetch(`${aiServiceUrl}/api/incident/analyze`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
...context,
recentErrors: errors ?? context.recentErrors,
alertTitle,
severity,
startedAt,
}),
});
if (!aiRes.ok) {
logger.error(`AI incident analysis failed: ${aiRes.status}`);
res.status(500).json({ error: 'AI analysis failed' });
return;
}
const analysis = await aiRes.json();
logger.info(`Incident analysis complete for ${serviceName}`);
res.status(200).json(analysis);
});
// Manual analysis from the Backstage UI
router.post('/analyze', async (req, res) => {
const { entityRef, alertTitle, errors } = req.body;
// Parse entityRef (e.g. "component:default/invoice-api")
const name = entityRef.split('/').pop();
const credentials = await auth.getOwnServiceCredentials();
const entities = await catalog.getEntities(
{
filter: {
kind: 'Component',
'metadata.name': name,
},
},
{ credentials },
);
if (entities.items.length === 0) {
res.status(404).json({ error: 'Entity not found' });
return;
}
const entity = entities.items[0];
const context = await gatherIncidentContext(entity, logger);
const aiRes = await fetch(`${aiServiceUrl}/api/incident/analyze`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
...context,
recentErrors: errors ?? context.recentErrors,
alertTitle: alertTitle ?? 'Manual analysis',
severity: 'unknown',
startedAt: new Date().toISOString(),
}),
});
const analysis = await aiRes.json();
res.status(200).json(analysis);
});
return router;
}
The Incident Card Component
A card that appears on the service entity page, allowing on-demand analysis:
// plugins/ai-incident/src/components/IncidentCard.tsx
import React, { useState } from 'react';
import {
Button,
TextField,
Typography,
CircularProgress,
} from '@material-ui/core';
import WarningIcon from '@material-ui/icons/Warning';
import { InfoCard } from '@backstage/core-components';
import { useEntity } from '@backstage/plugin-catalog-react';
import { useApi, fetchApiRef, discoveryApiRef } from '@backstage/core-plugin-api';
export const IncidentCard = () => {
const { entity } = useEntity();
const fetchApi = useApi(fetchApiRef);
const discoveryApi = useApi(discoveryApiRef);
const [errors, setErrors] = useState('');
const [analysis, setAnalysis] = useState<string | null>(null);
const [loading, setLoading] = useState(false);
const entityRef = `component:default/${entity.metadata.name}`;
const handleAnalyze = async () => {
setLoading(true);
setAnalysis(null);
try {
const baseUrl = await discoveryApi.getBaseUrl('ai-incident');
const res = await fetchApi.fetch(`${baseUrl}/analyze`, {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({
entityRef,
errors: errors || undefined,
}),
});
if (!res.ok) {
throw new Error(`Analysis failed: ${res.status}`);
}
const data = await res.json();
setAnalysis(data.analysis);
} catch {
setAnalysis('Failed to analyze. Check the AI service connection.');
} finally {
setLoading(false);
}
};
return (
<InfoCard title="Incident Analysis" subheader="AI-assisted diagnosis">
<TextField
fullWidth
multiline
rows={3}
variant="outlined"
placeholder="Paste error messages or log excerpts here (optional)"
value={errors}
onChange={e => setErrors(e.target.value)}
disabled={loading}
/>
<Button
variant="contained"
color="primary"
onClick={handleAnalyze}
disabled={loading}
startIcon={loading ? <CircularProgress size={16} /> : <WarningIcon />}
style={{ marginTop: 8 }}
>
{loading ? 'Analyzing...' : 'Analyze Incident'}
</Button>
{analysis && (
<Typography
variant="body1"
style={{
whiteSpace: 'pre-wrap',
marginTop: 16,
padding: 16,
borderRadius: 4,
fontFamily: 'monospace',
fontSize: '0.85rem',
}}
>
{analysis}
</Typography>
)}
</InfoCard>
);
};
What an Analysis Looks Like
The on-call engineer opens the invoice-api in Backstage. Pastes the error: Npgsql.NpgsqlException: Failed to connect to 10.0.1.5:5432. Clicks “Analyze.” Gets back:
## PROBABLE CAUSE
The most recent deployment (20 minutes ago, commit a3f7c2d) changed the
database connection string format. The commit message says "refactor config
to use Server= syntax." Npgsql uses `Host=`, not `Server=`.
## EVIDENCE
- Error: `NpgsqlException: Failed to connect` — indicates a connection-level failure
- Recent deployment: `a3f7c2d — refactor config to use Server= syntax` (deployed 3:08am)
- Error spike started at 3:08am — matches the deployment time exactly
- Service dependencies: PostgreSQL (from catalog)
- GOTCHA heuristic: "No secrets in code — all from environment variables" — the
connection string should come from DB_CONNECTION env var
## SUGGESTED ACTIONS
1. **Immediate**: Revert deployment a3f7c2d or hotfix the connection string
to use `Host=` instead of `Server=`
2. **Verify**: Check the DB_CONNECTION environment variable in Kubernetes —
confirm the format uses `Host=`
3. **After fix**: Add a startup health check that validates the DB connection
before accepting traffic
## RELATED SERVICES
- notification-service: consumes events from invoice-api via Service Bus.
If invoice-api can't write to the DB, no events are published,
so notifications will stop.
40 minutes of investigation, done in 8 seconds. The engineer reads the analysis, confirms it makes sense, reverts the deployment. Done.
Connecting to Real Log Sources
The placeholder in gatherIncidentContext should connect to your actual log aggregator. Here’s an example with Grafana Loki:
async function fetchRecentErrors(
serviceName: string,
minutes: number = 30,
): Promise<string> {
const lokiUrl = process.env.LOKI_URL;
if (!lokiUrl) return 'Log aggregator not configured.';
const query = encodeURIComponent(
`{app="${serviceName}"} |~ "error|exception"`,
);
const end = Date.now() * 1_000_000; // nanoseconds
const start = end - minutes * 60 * 1_000_000_000;
const res = await fetch(
`${lokiUrl}/loki/api/v1/query_range?query=${query}&start=${start}&end=${end}&limit=50`,
);
if (!res.ok) return 'Failed to query log aggregator.';
const data = await res.json();
const lines = data.data.result
.flatMap((stream: { values: string[][] }) =>
stream.values.map(v => v[1]),
)
.slice(0, 20);
return lines.join('\n') || 'No recent errors found.';
}
Registering the Plugin
In packages/backend/src/index.ts:
import { aiIncidentPlugin } from '@internal/plugin-ai-incident';
backend.add(aiIncidentPlugin);
The IncidentCard is a separate frontend plugin (@internal/plugin-ai-incident-widget). Add it to the entity page:
import { IncidentCard } from '@internal/plugin-ai-incident-widget';
// In the overviewContent grid:
<Grid item md={6}>
<IncidentCard />
</Grid>
The frontend widget uses discoveryApiRef and fetchApiRef to call the backend — the same pattern we use in the Ask AI widget and the Governance Dashboard.
Checklist
-
/api/incident/analyzeendpoint accepts service context + errors and returns analysis - Alert webhook gathers context from catalog, GitHub commits, and GOTCHA.md
- Manual analysis available from the entity page via IncidentCard
- Analysis includes probable cause, evidence, suggested actions, and related services
- Log aggregator integration (Loki, Application Insights, etc.) connected
- Plugin registered in Backstage backend
Before the Next Article
We’ve built six AI features into the platform: catalog enrichment, smart scaffolding, context-aware code review, documentation RAG, governance dashboard, and incident response. Each one reads the catalog. Each one uses the GOTCHA prompt. Each one logs its usage for governance.
But they’re separate plugins. The final article brings everything together into a reference architecture — how these pieces connect, how to deploy them, and how to extend the platform with new AI features.
That’s article 8: The Reference Architecture.
The full code is on GitHub.
Troubleshooting
IncidentCard shows “Failed to analyze”
Make sure the AI service is running on port 5100 and forge.aiServiceUrl is set in app-config.yaml. The backend plugin calls the AI service directly, not through the proxy.
Webhook returns “not in catalog”
The serviceName in the alert body must match a component’s metadata.name in the Backstage catalog. Check with curl http://localhost:7007/api/catalog/entities to see what entities exist.
GitHub commits not loading
The plugin reads commits from the repo in github.com/project-slug. Make sure the annotation exists in the entity’s catalog-info.yaml and GITHUB_TOKEN is set.
If this series helps you, consider buying me a coffee.
This is article 7 of the AI-Native IDP series. Previous: The AI Governance Dashboard. Next: The Reference Architecture.
Loading comments...