Confidently, Elegantly Wrong: How AI is Disrupting IT

Winning the Acronym Bingo: SRE, EUC, DEX, and ITSM

The AI-Powered Help Desk Has a Bloat Problem

The hype cycle surrounding generative AI in IT Service Management (ITSM) has focused heavily on the promise of the autonomous service desk. The prevailing narrative suggests that if organizations simply feed enough historical tickets and knowledge base articles into a large language model, the machine will eventually outpace human agents in both speed and accuracy. This vision assumes that the primary bottleneck in IT support is a lack of accessible information. However, this approach forgoes two fundamental realities of the modern enterprise desktop: the technical limitations of LLM context processing and the chaotic, non-deterministic nature of the modern endpoint.

To navigate this, we propose a shift of perspective. What if we stopped viewing the service desk as a repository of historical fixes and started viewing the endpoint as a complex microservice that requires a discipline more akin to Site Reliability Engineering (SRE) than traditional support.

The Weight of Information and the Reality of Context Rot

One of the most significant hurdles in using AI for incident response is a phenomenon known as context rot. In an effort to be thorough, modern monitoring systems often flood an LLM with every conceivable piece of data related to an issue. This includes autonomic detection logs, user reports, and system configuration snapshots. While providing this level of detail seems like a benefit, it often creates a noise floor that obscures the signal.

When a context window becomes saturated, the model begins to suffer from performance degradation. This is not merely a matter of processing speed but of cognitive focus. Modern models often struggle with information retrieval from the middle of a large prompt, a problem frequently called the lost in the middle phenomenon. In a service management scenario, an LLM might correctly identify the user identity and the ultimate error message because they appear at the start and the end of the data stream, yet it may completely miss the subtle driver conflict or the specific timestamp of a service failure buried in the center of a log.

Furthermore, as the context grows, the original instructions provided to the model begin to dilute. The AI starts to favor the patterns found in the recently provided data over the logic developed through pretraining. In the world of ITSM, this leads to a situation where the AI provides a solution that looks like a technical answer but fails to address the specific reality of the current incident. More context does not inherently lead to more clarity. Instead, it can lead to a diluted intelligence that is more likely to hallucinate a solution based on patterns it has seen in its context window rather than what it learned through pretraining and reinforced learning.

End User Computing Knowledge Base Limitations

The second core challenge is the reliance on historical data as a foundation for AI reasoning. Traditional ITSM is built on the historical record, specifically the knowledge base and the automated script. These are inherently backward-looking tools. They operate on the assumption that because a specific set of symptoms led to a specific fix in the past, the same causality will hold true today.

In the realm of End User Computing (EUC), this logic is deteriorating at an accelerating rate. The enterprise desktop is no longer a static environment controlled entirely by centralized IT deployment. We have entered an era of constant, non-IT deployed change. Consider the typical high-performance workstation today. Browser plug-ins update silently in the background. Cloud-based applications push features and configuration changes weekly. Operating system patches are delivered with varying degrees of transparency. Personal productivity tools often bypass traditional change management entirely.

When the environment is changing this rapidly, a knowledge base article written six months ago is effectively a historical artifact rather than a functional guide. An automated script designed for a previous version of an application might not only fail but could potentially exacerbate the problem. History is a poor predictor of the present when the variables of the present are in a state of constant, decentralized flux. If an AI is encouraged to leverage old knowledge base articles to solve a problem that was created by a silent update thirty minutes ago, it will inevitably reach the wrong conclusion.

Applying SRE Principles to the Endpoint

To solve the twin problems of context rot and historical irrelevance, we must apply the pillars of SRE to the enterprise desktop. This transition creates what we can call Endpoint Reliability Engineering. In a traditional SRE environment, engineers do not rely on a manual to tell them why a server is slow. They rely on observability: the ability to understand the internal state of a system by looking at its external outputs.

In EUC, the external outputs are the telemetry points coming off the edge. If we treat each laptop or workstation as a critical node in a distributed system, we can begin to manage them using Service Level Objectives (SLOs) focused on the digital employee experience (DEX) rather than just ticket throughput.

The following table contrasts the traditional ITSM approach with this new engineering-centric model:

SRE Principle Traditional ITSM Approach Reliability Engineering Approach
MetricsKPIs focus on ticket counts and SLAs focus on server uptime.SLIs focus device health, app latency, and user sentiment and SLOs focus on business outcomes
ObservabilityReliance on static logs and manual user reports.Real-time contextual telemetry from the edge.
Toil ReductionAutomated scripts and KB articles.Autonomic self-healing based on live state.
Embracing RiskRigid change management and golden images.Blast radius visibility into decentralized, non-IT deployed updates.
The TruthRunbooks.The current state of the edge defines the solution.

The Case for Edge-First DEX Contextual Telemetry

Shifting toward edge-first telemetry allows IT organizations to move away from the "guess and check" methodology of traditional troubleshooting. It moves the discipline toward a state where the AI is not a librarian searching through old records but an engineer diagnosing a live system.

Edge-first telemetry involves capturing deep, contextual data from the device itself in real time. This means looking at hardware performance, software interactions, and the actual experience of the user as it unfolds. Real observations inform SLIs, which are computed continuously in near real-time. When this high-fidelity data is fed into an LLM, the objective is not to find a match in a database of old tickets. The goal is to perform a deterministic analysis of the live environment.

This approach solves the problem of context rot by prioritizing relevance over volume. By focusing on the specific telemetry that indicates why a process failed in the current moment, the AI can operate within a smaller, more effective context window. It also addresses the problem of non-IT change by observing the actual state of the machine rather than assuming it matches a pre-defined golden image.

Two Diagnostic Paradigms

The table below captures the structural difference between feeding an LLM historical context versus real-time telemetry. These aren't just different data sources — they represent fundamentally different assumptions about how diagnosis should work.

Dimension History-Based (KB / Tickets / Runbooks) Telemetry-Based (Edge-First DEX)
Diagnostic question"Have we seen this before?""What do we see right now?"
Resolution question"What did we do last time?""What does current state tell us to do?"
Assumes environment is...Static between documented statesContinuously changing
Captures non-IT-deployed changeNoYes
Context volume per incidentHigh — pulls historical corpusLow — scoped to device and moment
Context rot riskHigh — more tokens, worse reasoningLow — narrow, causal input
Handles novel failuresPoorly — no prior match to retrieveWell — observes actual conditions
Aligns with SRE principlesNo — prescriptive, not observability-drivenYes — SLO-based, real-time
Problem solvingOne-by-one—trial and errorAI-driven classification into causes

The left column describes most AI-powered ITSM implementations today. The right column describes where the architecture needs to go.

The future of AI in ITSM is not found in bigger databases or longer context windows. It is found in the ability to distinguish between what is historical and what is actual. In an era where the enterprise desktop evolves faster than documentation can be written, the only way to maintain control is to empower AI with the live, edge-based context it needs to see the plot. By embracing Endpoint Reliability Engineering, IT can stop being a historian of past failures and start being a steward of present performance.

Source: Lakeside Software