When reporting system uptime for a SaaS product, how are you handling partial impact or intermittent issues where the system is running but some customers are experiencing issues while others are unimpacted? Is there a best practice for determining what is reported in system uptime metrics as "down?" With all of the redundancies built in, the application we're supporting is rarely truly down; most issues fall into this category of partial/intermittent where impacted users/customers are limited. Unfortunately, it can often be challenging to identify exactly what customers had issues and for how long. How have you handled similar issues?
Sort by:
We are a SaaS product and we have alarms for error rates at different levels in the infrastructure.
Ping services keep our general status pages, but as you mentioned this is not always exactly true for some of our customers.
Alarms for Load balancers (allow us to detect issues in the HTTP layer and between services).
Alarms for the application logs, allow us to detect issues at the application level that may not necessarily result in HTTP errors because they are handled correctly but still present the user with an operational error.)
Alarms in the logs for our UI layer, we have an error log reporting in the UI, and this allows us to detect errors on the client side, JS errors, and complex interactive pages, that may not be found in any of the other cases.
All our requests are tied to an identifier that allows us to identify what customers are affected.
Based on that we also maintain status pages and we update the affected customers.
We treat any down time, even if intermittent or not impacting the full functionality of the system as down time. We do measure the exact length of the outage, so generally don't raise too much concern if short lived, but nonetheless reported as an outage from a quantity perspective.
for the most part, we could partial as a full outage unless the partial is very limited functionality. For intermittent disruptions, if it's very up and down we could the entire period - otherwise we evaluate it on a case by case basis and maybe would count a large percentage of the time as the outage time.
I'd agree with the other comment here, describing the key issue types and % impact on user-base. If it is affecting small number of users then its a P3.
P1 - System down for all users
P2 - System up but critical functionality down for more than 5% users
P3 - System up but critical functionality down for less than 1% users
P4 - Feature issue.
I've consulted clients to combine application throughput metrics with ticket metrics. If there is a disparity among users performing different business functions or at different locations, this becomes apparent very quickly. For example, we have uncovered both database bottlenecks and network appliance degradation using this approach.