Cisco

Recommendations for SAN Fabric Dashboards

IT departments need to monitor their storage area network fabric infrastructures, as well as purchase and implement the tools and processes to do so. IT must quickly identify and repair problems before outages occur.

Overview

Organizations must monitor the status of their storage area network (SAN) fabrics and take preemptive action to ensure that SLAs are met. An inability to detect failures, high-availability exposures or degradations in the SAN fabric systems can lead to unscheduled application outages.

Key Findings

SAN fabric dashboards and reports are primarily performance and event-oriented.
The failure of a storage network port or connection is the most important primary indicator, and network congestion is the second-most-important indicator.
Security issues, such as intruder detection or alerting, are not commonly monitored.
The following five high-level SAN fabric metrics are the most often implemented: port failure, congestion monitoring, bandwidth, late running and component failures.

Recommendations

Add security monitoring indicators, such as role-based access control, and intruder alerting to your security or SAN fabric dashboard, as well as reporting systems to meet change control and operational security requirements.
The SAN fabric is a critical data center infrastructure component that must be monitored to ensure storage availability, so combine or include SAN fabric dashboards in your operations room monitoring and alerting system.
SAN fabric switch vendor-monitoring tools, combined with storage array monitoring tools, provide most of the high-level primary indicators that are needed to monitor the SAN.
Detailed secondary reporting, end-to-end analysis and root cause determination in a heterogeneous SAN fabric require a specialist monitoring and reporting tool

Analysis

Fibre Channel (FC) SAN fabrics are mature, reliable and, therefore, much simpler to manage than the first FC SANs, which became available in the late 1990s. Because FC is used in high-performance and high-availability environments, SAN fabrics are a critical part of the IT infrastructure and, therefore, must be actively monitored. As servers are consolidated using hypervisors in virtual server environments, greater demands are placed on the storage infrastructure, because each physical server contains more applications and uses more SAN fabric resources. As a result, the loss of a storage connection to a single physical server that hosts many virtual hosts has a greater service impact.

Similarly, the increase in virtualized FC connections increases complexity and reduces the ability of an administrator to track which storage a virtual server is using and how it is configured. The increased complexity of server virtualization is counterbalanced by the improvements in SAN fabric bandwidth and management tools; however, SAN fabrics are still exposed to environmental, human and procedural errors. Therefore, organizations must monitor their SAN fabric infrastructures and take preemptive actions to fix SAN fabric problems before they suffer a storage outage.

The adoption of FC over Ethernet (FCoE) is mirroring that of FC. FCoE is going through development and early adoption issues, such as standards compliance and device interoperability, similar to what FC experienced a decade earlier. Adoption of FCoE has been slow and, even though an increasingly larger number of servers support FCoE, few storage arrays support FCoE. Therefore, even though, from a high level, we expect FCoE monitoring requirements to be similar to those for FC-based SAN fabrics, we do not cover FCoE dashboards or monitoring in this research.

Overview

SAN fabrics predominantly use the FC protocol and are composed of FC devices, such as server host-based adaptors (HBAs), SAN switches and directors. Specialized devices that are used to extend the geographical distance of SAN fabrics, enable compression and/or provide encryption must be inserted into the data path between the server and storage. Therefore, they must also be monitored as they become part of the SAN fabric and can cause a system or service outage. SAN fabric designs are commonly implemented using two independent networks, which, together with host and storage multipathing software, enable highly available and serviceable environments.

Optical SAN fabric connections, such as small-form-factor pluggable converters and gigabit interface converters (GBICs), have relatively short service lives and, therefore, need to be continuously monitored for any degradation in performance or reliability. Few present-day monitoring tools, other than the switch vendors' element managers, are able to monitor FCoE storage networks.

When monitoring IT infrastructures, especially large installations that have many FC devices and thousands of nodes, it is easy to become overwhelmed by the complexity, the amount of data and the components that need to be monitored. Therefore, it is important to be able to consolidate the reporting data into a high-level, simple-to-understand dashboard. IT operations staff ideally should also be able to investigate in detail and have the ability to drill down into the root cause of any problem or exception.

This research describes the primary or high-level indicators that would be used in the central operational control center, or bridge, to provide a status display of the SAN fabric environment. In addition, we also list the secondary indicators that need to be monitored and available for operators to determine the severity of any event or exception, first-level problem determination and short-term forecasting. Outside the central control room, many other sections in the IT department can create customized reports from the metrics available, as determined by their individual requirements, such as exception reporting or trending.

No two organizations are the same, and these are the indicators that Gartner recommends from our interactions with clients and vendors. However, organizations may find that they have specific requirements or indicators that are different from those highlighted in this research. Therefore, the recommendations in this report can be used as a starting point for organizations that do not monitor their SAN fabrics. These recommendations can be used to check and compare with those used by SAN fabric dashboards, monitoring or reporting processes.

Don't Blame the Victim: Understand the Problem

IT departments purchased server, application, database and network monitoring tools long before FC SANs were implemented. Therefore, administrators often have a large number of tools to manage and interrogate these components, but few that report on the storage and SAN fabric. In the absence of management tools, performance and status of the SAN fabric and storage cannot be determined. Hence, it is often convenient to blame performance problems on the storage and SAN fabric infrastructure.

When there are too many vehicles on a road, it's easy to blame the road, but conveniently ignore its overuse. It is easy and part of human nature to blame an entity that cannot defend itself – in this case, due to the lack of any factual data concerning the SAN fabric. Because of this lack of storage and SAN fabric visibility, many IT departments incorrectly spend time solving or looking for storage problems that don't exist. The classic example is a badly coded program that, instead of searching a small subset of data via a key or an index, inefficiently reads an excessively large amount of data that is an order of magnitude larger than what's required. This "bad code," which increases the traffic and SAN fabric payload, is the cause of SAN fabric bandwidth and disk storage contention.

This problem can also manifest itself during a full restore of applications or systems, when an exceptional amount of data, which is again an order of magnitude larger than that was is present during normal operational hours, needs to be sent over the SAN fabric. In this case, the source problem is a storage issue, not a SAN fabric issue. In both cases, a SAN fabric management tool is required to understand and determine who is the consumer of SAN fabric resources.

Primary SAN Fabric Indicators

The purpose of these indicators (see Table 1) is to summarize, detect and highlight any exceptions, exposures and errors in the SAN fabric infrastructure. A color-coded system (see Note 1) is often used to display these on a high-level dashboard.

Table 1. Primary SAN Fabric Dashboard Indicators
fig 1

Secondary, Detailed SAN Fabric Reporting Metrics/Indicators

Once a problem or potential problem has been highlighted, something needs to be done about it, and this will require more-accurate data. Therefore, although high-level dashboards are critical, IT operations also require in-depth information, so that false positives and the root cause of the problem can be determined, then corrected as soon as possible.

Although there are many options available for obtaining and tracking detailed data, Table 2 lists the most common indicators and data sources that are used to monitor and solve SAN fabric problems. Even when these secondary components are identified, further and more-detailed investigation is likely to be required to determine and correct the root cause. Therefore, IT staff cannot rely on the SAN fabric-monitoring product to perform all problem diagnosis, but they are likely to need information from other systems using device-specific management or diagnostic utilities to identify and solve the root cause of the problem.

Table 2. Secondary SAN Fabric Dashboard Indicators
fig 2

Other Factors That Also Must Be Monitored

Many other metrics and issues that are out of the scope of this research may need to be monitored to ensure that an IT department can provide a reliable and sustainable SAN fabric infrastructure. These are briefly outlined below:

Compliance with dual-SAN (see Note 2) design rules and topology guidelines must be monitored, so that any change or failure that deviates from these design principles does not create an exposure by introducing a SPOF, such as an ISL cross-coupling, which creates a single SAN.
Introduction of new technologies and platform support – in-line compression devices that are used to reduce the amount of data to be stored and wave division multiplexers or extenders that are used to stretch the distance of SANs between data centers.
Capacity planning is a process that must be performed by the SAN fabric-monitoring products (if possible), by storage resource management tools or general-purpose capacity reporting tools, if they have this capability. These indicate and produce forecasts when the SAN fabric will run out of ports, links or capacity.
Hardware update – older and slower SAN switches and HBAs need to replaced and upgraded as the performance demands of the applications increase.
Security – access to the SAN fabric hardware and management tools needs to be logged, audited, secured and monitored.

Summary

IT departments must monitor the status of the SAN fabric infrastructure, and purchase and implement the tools and processes to do so. IT departments must be able to quickly identify and repair any problems before an outage occurs and fix any degradation in performance or availability in the SAN fabric before IT service is negatively affected.

Evidence

The Gartner storage analyst team takes thousands of customer inquiries each year from clients concerning storage-related questions and issues. The following vendors also provided information concerning the key SAN fabric indicators requested by their customers and provided by their products: Brocade, Cisco and Virtual Instruments.

Source: Gartner RAS Core Research Note G00213188, Valdis Filks, Bob Passmore, 2 June 2011

Return to Home

Note 1. Traffic Light Color Schemes as Dashboard Indicators

It is common that colors similar to traffic light systems are used to indicate the status of various metrics at a high and visual level. The accepted color indicators are:

Green = no problems; completed successfully
Yellow/amber = completed, with some errors, or an exception occurred that may require investigation
Red = process did not complete or failed, leading to an error that requires operator or administrator intervention

Note 2. Dual SANs

It is an accepted best practice in the storage industry to design FC SANs to have dual SANs. This design stipulates that connections between servers and storage devices are via two separate paths that do not have any physical dependence or connection between them. This includes devices such as switches, extenders, compression or acceleration devices that the connection or path passes through. For comparison, traditional Ethernet and Internet Protocol (IP) networks are designed and implemented differently, using single paths, connections and networks.