ElixirData Blog | Context Graph, Agentic AI & Decision Intelligence

Decision Infrastructure in ITSM: Governing Incidents,Changes, and SLAs

Written by Navdeep Singh Gill | Feb 5, 2026 11:53:54 AM

What Is Decision Infrastructure in IT Service Management (ITSM)?

IT Service Management has perfected the art of tracking work. Incidents are logged. Problems are linked. Changes are scheduled. Requests are fulfilled. SLAs are measured.

The ticket system captures everything that happens. But when an outage cascades, a change breaks production, or an incident recurs for the third time, leaders ask questions the ticket system can't answer.

"Why was this incident prioritized P3 when it affected a critical system?"

"Who approved this change, and what did they know about the risk?"

"Why do we keep having the same problem?"

The ticket system tells you what happened. It doesn't tell you why decisions were made—or whether those decisions were right.

This is where decision infrastructure transforms ITSM: from tracking tickets to governing the decisions embedded in every incident, problem, change, and request.

The Hidden Decisions in IT Service Management

Ticket

Decisions Embedded  What Gets Lost
Incident Priority, assignment, escalation, resolution path  Why this priority? Why this resolver? Why this
approach?
Problem Root cause determination, workaround approval, fix
timeline
Why this root cause? Why accept this
workaround?
Change  Risk assessment, approval, implementation window Why approve despite risk? What was considered?
Request Fulfillment path, exception approval Why this approach? Why grant exception?

These decisions happen constantly. A single major incident might involve dozens of decisions:

  1. Initial triage and priority

  2. Assignment to resolver group

  3. Escalation decisions

  4. Workaround approval

  5. Communication decisions

  6. Resolution verification

Each decision is captured as a ticket update or comment—unstructured, unsearchable, disconnected from the reasoning.

What is missing in traditional ITSM systems?
 Traditional ITSM systems track actions but do not capture decision reasoning.

Layer 1: Context Graphs for Service Operations

The Problem with Ticket-Centric Data

Traditional ITSM shows you the ticket

Ticket Snapshot
Incident
INC0012847
Priority
P1
Status
In Progress
Affected CI
SRV-PROD-4521
Assignment Group
Platform Engineering

This tells you the ticket exists. It doesn't tell you what it means.

Context Graphs Connect What Matters

A context graph assembles operational reality around the incident:

Incident: INC0012847
├── AFFECTS
└── SRV-PROD-4521
├── HOSTS → App-CustomerPortal (Tier-1)
├── SERVES → 847 active users
└── REVENUE_IMPACT → $47K / hour
├── HOSTS → App-PaymentGateway (PCI-Scope)
├── PROCESSES → [PII, Payment-Card-Data]
├── OWNED_BY → Platform Engineering
└── ON_CALL → jsmith@company.com

        RECENT_CHANGES
├── CHG0012841 (Network config, 2 hours ago)
├── MODIFIED → Network-Config-East
└── IMPLEMENTED_BY → network_team

        SIMILAR_INCIDENTS
├── INC0011234 (3 weeks ago, same CI)
├── ROOT_CAUSE → Memory leak in App-CustomerPortal
└── RESOLUTION → Restart service
├── INC0009876 (2 months ago, same symptoms)
├── ROOT_CAUSE → Network configuration error
└── RESOLUTION → Rollback change

        RELATED_PROBLEMS
└── PRB0004521 (Open, investigating memory leak pattern)

├── KNOWN_WORKAROUNDS
└── KB0004521 (Service restart procedure)

└── SLA_STATUS
└── P1 → 4 hours remaining

What Context Graphs Enable?

When the incident is created, the analyst immediately sees:

  1. Business impact: 847 users, $47K/hour, Tier-1 application
  2. Recent changes: Network config change 2 hours ago (likely related?)
  3. Similar incidents: Two previous incidents with same CI—one was memory leak, one was network config
  4. Related problem: Open problem record investigating memory leak pattern
  5. Known workaround: KB article with service restart procedure
  6. Escalation path: On-call contact for Platform Engineering

Time saved: 15–20 minutes of investigation → seconds.

Query: “What else might be affected if this is a network issue?”

Network Configuration: Network-Config-East
├── CONNECTS [SRV-PROD-4521, SRV-PROD-4522, SRV-PROD-4523]
├── SERVES [App-CustomerPortal, App-PaymentGateway, App-Authentication]
└── IMPACT_IF_DOWN3 Tier-1 apps, 2,341 users, authentication for all systems

Query: “Show me all P1 incidents in the last 30 days involving network changes”

Returns correlated data showing patterns of network-change-related incidents — enabling proactive problem management and root cause analysis.

What changes for an analyst during incident creation?
They see impact, history, and risk immediately.

Layer 2: Decision Traces for Service Operations

Priority Decisions
Every priority assignment is a decision, and the reasoning behind that decision matters long after the incident is closed.

Decision Trace: Incident Priority

Decision Trace
Incident Priority Decision
Decision ID
INC0012847-PRIORITY
Incident
INC0012847
Timestamp
2024-11-15T14:32:17Z
Inputs Considered
Fact Value Source Confidence
Affected Users 847 Service Mapping 1.0
Revenue Impact $47K / hour Business Impact Analysis 0.95
Service Tier Tier-1 Service Catalog 1.0
Data Classification PII Data Governance 1.0
Similar Incidents (Recent) 2 Incident Analytics 1.0
Policies Evaluated
Priority Matrix (v3.2)
Result: P1 Criteria Met
Criteria matched: Tier-1 service, revenue > $10K/hour, users > 500
Executive Escalation (v2.1)
Result: Required
Reason: P1 on Tier-1 service
Final Decision
Priority: P1
Tier-1 CustomerPortal affecting 847 users with $47K/hour revenue impact. PII data involved.
Attribution Chain
  • Initial triage — service_desk_agent_jdoe (14:30)
  • Priority confirmation — incident_manager_schen (14:32)
  • Executive notification — vp_operations_automated (14:32)

Post-incident review: Complete trace of why P1 was assigned, what was known, and who decided.

Change Approval Decisions
Decision Trace: Change Approval

Decision Trace: Change Approval
decision_type: change_approval
decision_id: CHG0012841-APPROVAL
change_id: CHG0012841
timestamp: 2024-11-15T10:00:00Z
change_details
type: Normal
category: Network Configuration
affected_cis: [Network-Config-East]
implementation_window: 2024-11-15 12:00–14:00
inputs_considered
• risk_assessment → medium (production network, business hours, tested in staging)
• affected_services → [CustomerPortal, PaymentGateway, Authentication]
• affected_users → 2341
• rollback_plan → documented and tested
• similar_changes_success_rate → 94%
• blackout_status → not in blackout
When the change causes an incident, this trace captures what was known, what risks were assessed, who approved the change, and under what conditions.
policies_evaluated
• change_approval_matrix (v2.3)cab approval required
• business_hours_change_policy (v1.5)approved with conditions
conditions: enhanced monitoring, immediate rollback capability
decision
status: approved
conditions:
– Enhanced monitoring during implementation
– Rollback within 15 minutes if issues detected
– Communication to affected teams 30 minutes prior
reasoning
Medium-risk change with tested rollback plan and a 94% success rate for similar changes. Approved with safeguards due to production impact and business-hour execution.
attribution_chain
requester → network_engineer_mjohnson
technical_approver → network_architect_lsmith
cab_chair → change_manager_klee
business_approver → service_owner_operations

When the change causes an incident: Full trace of what was known, what risks were assessed, who approved, what conditions were set.

Decision Trace: Workaround Approval

Decision Trace — Workaround Approval
decision_type: workaround_approval
decision_id: PRB0004521-WA-001
problem_id: PRB0004521
timestamp: 2024-10-01T09:00:00Z
Workaround Details
description → Restart CustomerPortal service when memory exceeds 85%
implementation → Automated script triggered by monitoring threshold
impact → 30-second service interruption during restart
Inputs Considered
incident_frequency → 3 per week (incident_analytics)
mttr_with_workaround → 2 minutes (pilot_testing)
mttr_without_workaround → 45 minutes (historical_average)
permanent_fix_timeline → Q1 2025 (development_roadmap)
service_impact → 30-second interruption (testing)
Policies Evaluated
policy → workaround_approval_policy
version → v1.2
result → service_owner_approval_required
Decision
decision → approved
reasoning → Reduces MTTR from 45 minutes to 2 minutes. 30-second interruption acceptable compared to 45-minute outage.
Attribution Chain
proposer → problem_analyst_rwilson
technical_reviewer → platform_architect_jbrown
service_owner → digital_product_owner_amendes

What changes in post-incident reviews?
Reviews are evidence-based instead of opinion-based.

Layer 3: Decision Boundaries for Service Operations

Change Approval Boundaries

A change approval isn’t a permanent authorization. It’s valid only under specific conditions.

During Implementation · Decision Boundary
Decision ID: CHG0012841-APPROVAL
Status: APPROVED
Scope
Network-Config-East only
Validity Conditions
Within implementation window
12:00 → 14:00 (UTC)
ACTIVE
Rollback capability confirmed
Check: Pre-implementation
PENDING
Enhanced monitoring active
Check: Pre-implementation
PENDING
Stop Conditions
  • Implementation window exceeded
  • Error rate above 5%
  • Affected service degradation detected
  • Rollback triggered
  • P1 incident on affected services
Automatic Actions
On stop condition: Auto-rollback
Notifications sent to: Change Manager, Service Owner, Incident Manager
During implementation:
Boundary Violation Detected
14:15:00 — Boundary Check: within_implementation_window
Status: VIOLATED (window ended 14:00)
Action: IMPLEMENTATION HALTED
Notification: change_manager, network_engineer
Required: Window extension approval or abort
Error rate spikes:
13:45:00 — Boundary Check
Condition: error_rate_above_5%
Current Value: 7.2%
Status: STOP_CONDITION_TRIGGERED
Action: AUTO_ROLLBACK_INITIATED
Notification: all_stakeholders

The change doesn’t continue when boundaries are violated. The system enforces what policy requires.

Workaround Boundaries

Workarounds should be temporary — but they often become permanent.

Decision Boundary: Workaround Approval
decision_id: PRB0004521-WA-001
decision: workaround_approved

boundaries:
validity_conditions:
• permanent_fix_not_deployed (weekly · VALID)
• workaround_still_effective (MTTR 2.1 minutes · VALID)
• service_impact_acceptable (32 second interruption · VALID)

expiry: 2025-03-31
expiry_reason: Q1 2025 fix deadline

stop_conditions:
• permanent_fix_deployed
• workaround_ineffective
• service_impact_exceeds_threshold
• security_concern_identified

escalation_on_expiry: problem_manager

boundary_status:
• still_admissible: true
• days_until_expiry: 135
• next_review: 2024-12-01

When Q1 2025 arrives without a permanent fix:

Boundaries
2025-04-01 - Boundary Check: expiry Status: EXPIRED Action: WORKAROUND_REQUIRES_REAUTHORIZATION Notification: problem_manager, service_owner Required: Extend with new justification or implement fix

The workaround doesn't silently continue forever. The boundary forces a decision.

What problem do expiries prevent?
A: Long-term operational debt caused by forgotten workarounds.

Practical Applications

Application 1: Major Incident Review

Without decision infrastructure:

  • Review team reconstructs timeline from ticket updates

  • "Why was this P3 initially?" requires finding the person who triaged

  • Lessons learned are opinions, not data

With decision infrastructure:

  • Every decision traced with reasoning

  • Query: "Show me all priority changes and their justifications"

  • Pattern analysis: "Are we consistently under-prioritizing database incidents?"

Application 2: Change Success Analysis

Without decision infrastructure:

  • "Why do our network changes fail more often?"

  • Analysis limited to ticket metadata

  • Approval quality isn't measurable

With decision infrastructure:

  • Query: "Show me failed changes where risk was assessed as 'low'"

  • Pattern: "Changes approved without tested rollback plans fail 3x more often"

  • Improvement: Adjust approval criteria based on evidence

Application 3: Problem Management

Without decision infrastructure:

  • Workarounds accumulate without review

  • "Why do we have 47 active workarounds?"

  • No visibility into workaround health

With decision infrastructure:

  • Dashboard: "Workarounds by status—23 valid, 12 expiring, 8 expired, 4 ineffective"

  • Alert: "Workaround WA-047 has exceeded impact threshold"

  • Query: "Show me workarounds older than 12 months without associated fix plans"

Application 4: SLA Performance

Without decision infrastructure:

  • SLA missed—but why?

  • Was it a priority decision? Assignment delay? Resolution approach?

  • Root cause is opinion

With decision infrastructure:

  • Trace shows: Initial triage took 15 minutes (should be 5)

  • Pattern: "P2 incidents triaged by overnight shift have 40% longer triage times"

  • Action: Training or staffing adjustment based on evidence

Implementation Path

Phase 1: Context Foundation (Months 1-2)

Connect tickets to operational context:
  • Link incidents to affected services and users

  • Link changes to dependent systems

  • Link problems to related incidents

Surface recent changes when incidents occur

Immediate value: Analysts see context instantly, not after investigation.

Phase 2: Decision Capture (Months 2-4)

Start tracing key decisions:

  • Priority assignments with reasoning

  • Change approvals with risk assessment

  • Escalation decisions

  • Workaround approvals

Immediate value: "Why did we decide this?" becomes a query.

Phase 3: Boundary Implementation (Months 4-6)

Add validity constraints:

Change approvals bounded by implementation window

  • Workarounds bounded by effectiveness metrics

  • Exceptions bounded by expiry dates

  • Immediate value: Stale decisions are flagged, not perpetuated.

Phase 4: Continuous Improvement (Months 6+)

Use decision data for improvement:

  • Pattern analysis across decisions

  • Success rate by decision type

  • Boundary violations as leading indicators

  • AI-assisted triage with governed decisions

Immediate value: ITSM improves based on decision quality, not just ticket metrics.

The Transformation

            Dimension           Ticket System            Decision Infrastructure
Incidents  What happened Why it was prioritized, who decided
Changes What was done Why it was approved, what risks were accepted
Problems  What's being investigated Why workarounds were accepted, when they expire
Requests What was fulfilled Why exceptions were granted
SLAs Whether we met them Why we missed them, which decisions contributed

The Bottom Line

The ticket system was the foundation of IT Service Management for decades.
It told you what work was done.

Decision infrastructure tells you why decisions were made, whether they were right, and whether the reasoning still applies.

Every ticket is a decision—or a series of decisions. Priority. Assignment. Escalation. Resolution approach.
Workaround acceptance.

Track the tickets, and you track the work. Govern the decisions, and you govern the outcomes.

Without decision infrastructure, ITSM is a system of records—excellent at documenting what happened,
incapable of explaining why.
With it, ITSM becomes a system of decisions—where every choice is traceable, every approval has boundaries, and every pattern informs improvement.