Why averages are plain wrong

Let’s say you served 10 requests and they took respectively: 100, 110, 130, 145, 980, 945, 110, 135, 120, 125 miliseconds.

Why the jump to 980? Maybe you were in the middle of a backup. Maybe the network was a bit congested. Maybe anything else.

You reported average for these metrics will be 290ms. Think about this value for a second:

  • if you go access your page right now, does it accurately predict what your response time will be? NO.
  • does it give you any indication of the spikes that happen during backups? NO
  • is it as useful as a totally abstract number? YES

What to use instead

Use percentiles. Always. We’ll illustrate how percentiles work by showing you what questions they answer. We’ll use the previous data for these examples.

  • What time is the fastest 50% of all my traffic served in?
  • At least how slow are the slowest 10% of my requests?

And the answers for our data set are (using stats-percentile for node, a neat and simple lib):

let data = [100, 110, 130, 145,  980, 945, 110, 135, 120, 125];
percentile.calc(data, 50)
=> 125
// "50% of your requests are served in at most 125ms"
percentile.calc(data, 80)
=> 145
// "80% of your requests are served in at most 145ms"
percentile.calc(data, 90)
=> 945
// "90% of your requests are served in at most 945ms"

Now that is telling us something.

Other resources

The topic has been discussed widely, here are some great resources if you want to get into the nitty-gritty-details of it all.