Monitoring – part 1


You have a system in prod.  All's well, right?  The frontend loads, users are not complaining – those are the metrics, right?

Suddenly the users are complaining, the frontend doesn't load, or transactions aren't flowing. Now what? The purpose of monitoring is to prove, at various levels of synthesis, that your system serves users and performs its various functions quickly and correctly. 

At the top level monitoring is:

  • is it up?
  • is the latency acceptable?
  • what are the total transactions per second (TPS)?

As you answer these questions, more specific questions emerge:

  • How long did it take the average, min, max and standard deviation for dataflow through each piece of the pipeline?
  • How many warnings, errors, or exceptions within the past 10 minutes?


The Goals

Each system’s goals need to be clearly articulated in plain English, and any monitoring must exist to provide proof those goals are met – including the always-present need to stay within SLA (Service Level Agreement).

While incidents will happen, postmortems need to be easy too, and monitoring is central to achieving that.

Regardless of the technology used, your monitoring systems must prove the goals of the base system are being met. The two need to be separated for resilience and architectural clarity.

Any agreement on service or performance targets can be met to the letter and still produce dissatisfied users and customers — that’s a signal the agreement needs changing.

Each shop tends to have its own idea of which tools to use, so we adapt to circumstances and frequently do not make those large choices.  However, there are considerations above the tools that can lead to significant deviation from common practice:

  • Near-realtime systems (millisecond or lower granularities) typically are poorly served by out of the box monitoring from places like AWS, Azure, New Relic, Datadog etc. That is not to say you cannot use it – many, many people do.  But then their incidents are unnecessarily unpleasant.
  • While we typically hew to the 12 factor app methodology (https://12factor.net/) in system design, we part company with them when it comes to the purpose of logs (because the thinking behind that purpose is to our mind archaic, insufficient, and fit only for slow, weak-SLA systems)
  • A monitor that tells you granular facts is less valuable than one which correctly synthesizes the situation and alerts you to a failure state.
  • We believe synthetic (CPU, memory, IO etc) measurements are inherently less interesting and flexible than app-generated, app-specific measurements.

You might reasonably wonder why, if a failure state is known beforehand, it isn't self-healing – about all I can say is if you've ever had a data vendor change a column name on you arbitrarily, or any of the greater infinity of other random things that can go wrong, you'd maybe expect not to be able to heal them all.  That doesn't mean alarms can't tell you more!

In conclusion

Monitoring is an “easy to learn, hard to master” topic in system design and the above illustrate some of the dilemmas and principles we consider as we design and iterate systems. In the next post in this series, we’ll look at the 12-factor app!