When reporting system uptime for a SaaS product, how are you handling partial impact or intermittent issues where the system is running but some customers are experiencing issues while others are unimpacted? Is there a best practice for determining what is reported in system uptime metrics as "down?" With all of the redundancies built in, the application we're supporting is rarely truly down; most issues fall into this category of partial/intermittent where impacted users/customers are limited. Unfortunately, it can often be challenging to identify exactly what customers had issues and for how long. How have you handled similar issues?

5.1k viewscircle icon8 Comments
Sort by:
Senior Enterprise Architect, Application Consulting in Healthcare and Biotech2 years ago

I've consulted clients to combine application throughput metrics with ticket metrics.  If there is a disparity among users performing different business functions or at different locations, this becomes apparent very quickly.  For example, we have uncovered both database bottlenecks and network appliance degradation using this approach.

CTO in Transportation2 years ago

We are a SaaS product and we have alarms for error rates at different levels in the infrastructure.
Ping services keep our general status pages, but as you mentioned this is not always exactly true for some of our customers.
Alarms for Load balancers (allow us to detect issues in the HTTP layer and between services).
Alarms for the application logs, allow us to detect issues at the application level that may not necessarily result in HTTP errors because they are handled correctly but still present the user with an operational error.)
Alarms in the logs for our UI layer, we have an error log reporting in the UI, and this allows us to detect errors on the client side, JS errors, and complex interactive pages, that may not be found in any of the other cases.
All our requests are tied to an identifier that allows us to identify what customers are affected.
Based on that we also maintain status pages and we update the affected customers.

Lightbulb on1
Sr. Director, Head of Global Omnichannel Capabilities Delivery Center in Manufacturing2 years ago

We treat any down time, even if intermittent or not impacting the full functionality of the system as down time.  We do measure the exact length of the outage, so generally don't raise too much concern if short lived, but nonetheless reported as an outage from a quantity perspective.

VP of IT in Software2 years ago

for the most part, we could partial as a full outage unless the partial is very limited functionality. For intermittent disruptions, if it's very up and down we could the entire period - otherwise we evaluate it on a case by case basis and maybe would count a large percentage of the time as the outage time.

Manager in Construction2 years ago

I'd agree with the other comment here, describing the key issue types and % impact on user-base. If it is affecting small number of users then its a P3.

P1 - System down for all users
P2 - System up but critical functionality down for more than 5% users
P3 - System up but critical functionality down for less than 1% users
P4 - Feature issue.

Content you might like

Yes, but it doesn’t exist yet26%

Yes, and it already exists43%

No, but I wish23%

No, and I don’t think it needs to8%

View Results

Currently satisfied with our level of test automation20%

Plan to start an automation project in the next 1-6 months54%

Plan to start an automation project in the next 6-12 months16%

Plan to start an automation project in 13 or more months5%

Don't know3%

View Results