5 ways to ensure your SLA metrics backfire

This is a continuation of my series of posts on the Zen of Service Level Agreement Metrics.

Zen Habit #2: Do the math

You have just finished defining your Service Level Agreement metrics and targets. You are satisfied that you now have a true measure of your service and your outsourcing contract. However, if you have not done your math, this could backfire with unintended consequences.

Gimnasia para el Cerebro

Here are five mathematical traps that can set your SLA definitions up for failure.

The Percentage SLA Trap

  • How did you define a Service Level Agreement metric target – is it a percentage or a number? Are you measuring Quality metrics or Performance metrics?
  • A percentage value for an SLA focuses the operation on ensuring acceptable behavior across a total number of events, while you could use a number value for an SLA to focus only on the outages.
  • Percentage value SLA metrics are fine for measuring overall performance, but you might want to use number value SLA metrics for critical events that can damage the reputation of your service. Therefore a very visible and critical service is better served by counting the maximum number of allowed outages per time period instead of only measuring a percentage of availability.
  • Your operation could face damage in reputation due to a critical outage that attracts attention. A percentage-based SLA metric can have the law of averages work against you.

The Expensive Mean SLA Trap

  • It sounds counter-intuitive: An SLA should not reflect the service you expect from the provider.
  • Mathematically, the expectation of a service is the statistical mean of the service – this means that on a normal Gaussian distribution of events, almost 50% of the events could be worse than the mean.
  • An SLA defined at the statistical mean is an example of what I call a “Set to Fail” SLA. A good working relationship cannot be established on such an SLA definition.
  • An ideal SLA should be set left to the mean to guarantee a minimum service for you – and your provider should find the level of risk acceptable.
  • You might be tempted to say “well, let the provider handle this risk” – but there is no free lunch. You will be unnecessarily paying the risk premium that your provider would calculate into the price.

The Law of Small Numbers Trap

  • The total number of events also plays an unwitting role – this is when the law of small numbers can work against what you want to achieve.
  • If the number of events in a time period is very small, the likelihood of an SLA breach increases.
  • If the likelihood of an SLA breach is high, then you have forced the provider to include the penalty payment into the price, which means you are paying higher than you need to for a service.
  • You are also discouraging a provider to keep an issue unresolved once an SLA breach occurs.
  • A provider who has taken a penalty payment against margin is likely to save the effort to resolve an issue on which he has been already penalized. You lose out in the end. This is also a “Set to Fail” SLA.
  • So if you are defining an SLA metric on infrequent occurrences, you have two options in your SLA Management – either you combine similar services to increase the sample size, or use the ladder principle: the sample size is directionally proportional to the leniency of the SLA metric.

Sizing for Statistical Significance Trap

  • The observation time for your SLA should allow statistical significance for the measure you have in mind.
  • If the time period is too short, then you are again approaching the “set-to-fail” danger.
  • Example: Lets take the example of an SLA that allows a maximum of 3 critical errors per work order in User Acceptance Test per quarter. This assumes that you have enough work orders to reach a statistical sample size. However if your operation does not have this volume of work orders, you reach the “set-to-fail” threshold – and you might want to increase the duration in the measure to six months to capture the sample size of work orders.

The Safety Net Trap

  • Be aware that if a provider pays penalties due to an unresolved event that causes an SLA breach, they are no longer motivated to resolve this event.
  • Resolving such an event means adding more effort and eating to the margin which has already been eroded.
  • Your SLA mechanism should catch all such breach-events and incentivize their resolution.
  • Software programming languages have exception handling as a language feature – you need “exception handling” in your SLA strategy as well.

Service Level measurement is all about ensuring that the mathematics encourage the right behaviour. Doing the math diligently will take care that well-intentioned Service Level Agreements actually achieve the desired effect.

Have you done the math on your SLA metric definitions? Is it encouraging the right behaviour?

photo credit: solofotones via photopin cc

Speak Your Mind


Wordpress SEO Plugin by Wordpress SEO Plugin
%d bloggers like this: