We are looking at how to address infrastructure observability from a holistic POV. Today we have 25+ tools that do not share information, operated by various groups in silos; we operate across 4 states/400 sites/8 data centers/60K users. How have you (or are in the process of) built/architected an enterprise-wide observability solution that ties network, storage, compute, services, etc. together for reporting/analytics, action (we use ServiceNow) and dashboarding? Our goal is a singular view of the infrastructure for dashboarding/reporting, event management (why is application X slow? network? storage? alert the right team), automated response (low disk - increase storage allocation; poor network performance for an app from a site - fix erroneous QoS settings). We're not looking for 'the tool' - but the architecture/strategy of how to solve the problem, choose the right tools(s), etc.

2.7k views2 Comments

Sort by:

Director of Systems Operations in Healthcare and Biotech2 years ago

In order to make meaningful progress, the first step is to develop a dedicated team entirely devoted to achieving your primary goal. It's important to emphasize that this initiative requires full-time commitment, and it's crucial for someone to take clear ownership of it.

Additionally, you must ensure that this team possesses the requisite authority, visibility, and, most importantly, knowledge to drive effective change. A team that lacks the understanding of the underlying technologies and their interconnected capabilities is bound to encounter difficulties.

Once this solid foundation is established, you can proceed with other essential tasks, such as classification, contract review, and the development of a capability matrix. These subsequent steps will build upon the groundwork laid by the dedicated and knowledgeable team, ultimately leading you towards successful implementation.

Senior Director - Enterprise IT & Cloud Transformation in Services (non-Government)2 years ago

From the information provided, we feel to provide the below 3 pronged approach:

Step 1:

1)      Classify the tools by functionality, versions, support coverage, End of Life areas etc…

2)      Based on the above classification look at phase-out/retirement for some of the tools that are NOT “Fit-for-purpose”. This approach will help for phase-out of tools amongst the current tools landscape.

3)      Prepare a matrix of tools that are “Fit-for-Purpose” matrix with Support coverage & key functionalities.

4)      This matrix will provide areas of functionality which are required but not served by current tools/functionality Gaps, these would need to be further assessed for new tools.

Step 2 :

1)      Businesses correlate different types of events based on their IT environments and needs. However, there are several common types, such as events in operating systems, data storage and web servers. eg  System, Operating system, Application, Database, Web server, Network, etc

1)     The goal of event correlation is to identify all events related to a single problem. There will be events that stem from the root problem and symptomatic events as the original failure impacts other components. With analytics, event correlation software can give you insights into other event-driven metrics, which then help you make improvements in the effectiveness of your enterprise’s event management effort. To do so, you must look at raw event volumes and improvements resulting from deduplication and filtering. Evaluate enrichment statistics, signal-to-noise ratios and false-positive percentages. You can also look at event frequency in terms of the most common sources of hardware and software problems, so you can become more proactive in preventing issues.

2)      The right event correlation software improves an organization’s resilience and moves it out of a firefighting, purely reactive mindset. Additional downstream benefits include automating key processes, faster resolution, and smarter root-cause analysis.

a.       AI-Driven Event Co-relation : Machine learning and deep learning have given event correlation solutions the power to learn from event data to automatically generate new correlation patterns. This marked the application of artificial intelligence to event correlation.

Step 3: Develop Observability

Observability refers to the ability to monitor, measure, and understand the state of a system or application by examining its outputs, logs, and performance metrics. In modern software systems and cloud computing, Observability plays an increasingly crucial role in ensuring the reliability, performance, and security of applications and infrastructure.

The importance of Observability has grown due to the increasing complexity of software systems, the widespread adoption of microservices, and the growing reliance on distributed architectures.

Observability absorbs and extends classic monitoring systems and helps teams identify the root cause of issues. It allows stakeholders to answer questions about their application and business, including forecasting and predictions about what could go wrong. A diverse collection of tools and technologies are in use, which leads to a large matrix of possible deployments. This has architectural consequences, so teams must understand how to set up their observability systems in a way that works for them. Few of the leading tools for observability are Dynatrace, NewRelic, Splunk ,Datadog, honeycomb

Action : Prepare an assessment for the Observability solution selection checklist, so that you can arrive at an selecting the right observability solution. This will help you for your ranking for the tools alignment with Observability & help you to choose a solution for your observability.

S

Content you might like

How would you describe your organization’s current stage in adopting SAP Joule?

Exploring – We are aware of SAP Joule but have not yet started any formal evaluation or planning.20%

Evaluating – We are actively assessing SAP Joule’s capabilities and fit for our business needs.60%

Piloting – We have initiated a limited rollout or proof of concept in selected areas.13%

Scaling – We are expanding SAP Joule adoption across multiple business units, processes or systems.

Operationalized – SAP Joule is fully integrated into our operations and delivering measurable value.7%

View Results

I'm currently refining a strategic engagement model for the ST+i (Science, Technology & Innovation) function within a corporate group structure. The goal is to strengthen collaboration and alignment with subsidiary companies. We've identified the following key thematic areas as foundational to our model: 1. Cybersecurity 2. Enterprise Architecture 3. Portfolio Management 4. Project Performance 5. Contracting Processes 6. Shared IT Services 7. Innovation 8. Centers of Excellence 9. Knowledge Management and Digital Adoption I’d love to hear from experts in other industries: Which of these areas resonate with your subsidiary engagement models? Are there additional themes or practices you consider essential when structuring relationships with affiliated companies? Your insights would be incredibly valuable as we benchmark and evolve our approach.

What are the fundamental leadership capabilities that are needed for an era of exponential technology and AI innovation?

Which vendors/partners are essential to your ability to deliver applications to support your business?

Amazon38%

Google43%

IBM22%

Microsoft61%

Oracle21%

Red Hat13%

SAP12%

Vmware18%

None of these2%

View Results

We are looking into Purview for Data Catalog, Data Map capabilities. There were multiple threads ~1 year ago with less positive feedback.

Has things changed since then? Has anyone implemented in recent months and how was your experience. We are looking at data sources such as ADLS, Synapse, Power BI and number of SaaS softwares.