We are looking at how to address infrastructure observability from a holistic POV. Today we have 25+ tools that do not share information, operated by various groups in silos; we operate across 4 states/400 sites/8 data centers/60K users. How have you (or are in the process of) built/architected an enterprise-wide observability solution that ties network, storage, compute, services, etc. together for reporting/analytics, action (we use ServiceNow) and dashboarding? Our goal is a singular view of the infrastructure for dashboarding/reporting, event management (why is application X slow? network? storage? alert the right team), automated response (low disk - increase storage allocation; poor network performance for an app from a site - fix erroneous QoS settings). We're not looking for 'the tool' - but the architecture/strategy of how to solve the problem, choose the right tools(s), etc.
Sort by:
From the information provided, we feel to provide the below 3 pronged approach:
Step 1:
1) Classify the tools by functionality, versions, support coverage, End of Life areas etc…
2) Based on the above classification look at phase-out/retirement for some of the tools that are NOT “Fit-for-purpose”. This approach will help for phase-out of tools amongst the current tools landscape.
3) Prepare a matrix of tools that are “Fit-for-Purpose” matrix with Support coverage & key functionalities.
4) This matrix will provide areas of functionality which are required but not served by current tools/functionality Gaps, these would need to be further assessed for new tools.
Step 2 :
1) Businesses correlate different types of events based on their IT environments and needs. However, there are several common types, such as events in operating systems, data storage and web servers. eg System, Operating system, Application, Database, Web server, Network, etc
1) The goal of event correlation is to identify all events related to a single problem. There will be events that stem from the root problem and symptomatic events as the original failure impacts other components. With analytics, event correlation software can give you insights into other event-driven metrics, which then help you make improvements in the effectiveness of your enterprise’s event management effort. To do so, you must look at raw event volumes and improvements resulting from deduplication and filtering. Evaluate enrichment statistics, signal-to-noise ratios and false-positive percentages. You can also look at event frequency in terms of the most common sources of hardware and software problems, so you can become more proactive in preventing issues.
2) The right event correlation software improves an organization’s resilience and moves it out of a firefighting, purely reactive mindset. Additional downstream benefits include automating key processes, faster resolution, and smarter root-cause analysis.
a. AI-Driven Event Co-relation : Machine learning and deep learning have given event correlation solutions the power to learn from event data to automatically generate new correlation patterns. This marked the application of artificial intelligence to event correlation.
Step 3: Develop Observability
Observability refers to the ability to monitor, measure, and understand the state of a system or application by examining its outputs, logs, and performance metrics. In modern software systems and cloud computing, Observability plays an increasingly crucial role in ensuring the reliability, performance, and security of applications and infrastructure.
The importance of Observability has grown due to the increasing complexity of software systems, the widespread adoption of microservices, and the growing reliance on distributed architectures.
Observability absorbs and extends classic monitoring systems and helps teams identify the root cause of issues. It allows stakeholders to answer questions about their application and business, including forecasting and predictions about what could go wrong. A diverse collection of tools and technologies are in use, which leads to a large matrix of possible deployments. This has architectural consequences, so teams must understand how to set up their observability systems in a way that works for them. Few of the leading tools for observability are Dynatrace, NewRelic, Splunk ,Datadog, honeycomb
Action : Prepare an assessment for the Observability solution selection checklist, so that you can arrive at an selecting the right observability solution. This will help you for your ranking for the tools alignment with Observability & help you to choose a solution for your observability.
S
In order to make meaningful progress, the first step is to develop a dedicated team entirely devoted to achieving your primary goal. It's important to emphasize that this initiative requires full-time commitment, and it's crucial for someone to take clear ownership of it.
Additionally, you must ensure that this team possesses the requisite authority, visibility, and, most importantly, knowledge to drive effective change. A team that lacks the understanding of the underlying technologies and their interconnected capabilities is bound to encounter difficulties.
Once this solid foundation is established, you can proceed with other essential tasks, such as classification, contract review, and the development of a capability matrix. These subsequent steps will build upon the groundwork laid by the dedicated and knowledgeable team, ultimately leading you towards successful implementation.