A study of the golden signals

Site reliability engineering (SRE) is on the rise and service monitoring is a big part of it. While monitoring a complex solution is a engineering endeavor in itself the key to success is to start with some simple metrics, e.g. Latency, traffic, error and saturation (a.k.a the four golden signals)

If you only measure four metrics of your user facing systems, focus on those four

SRE Book

The four golden signals

NameDescription
LatencyThe time it takes to service a request. Also known as response time. Note – Mixing the latency of successful requests and the latency of the requests that fail might lead to a wrong conclusion. For example, 403 unauthorised errors are usually returned very quickly but it doesn’t suggests the service is in good health
TrafficA measure of how much demand is being placed on your system. For a web service, this measurement is usually HTTP requests per second. It’s also known as throughput.
ErrorThe rate of requests that fail
– explicit failure e.g., HTTP 400, 500s
– implicit failure e.g., HTTP 200 but coupled with the wrong content
– policy violation e.g., If you committed to one-second response times and hence any request over one second is an error
SaturationHow overloaded your service is. This is directly measured by the utilisation of CPU, memory, bandwidth, the number messages in the ESB and so on
The four golden signals

The four golden signals are very popular but worth noting there are at least two other methods:

Definitely worth reading if you are interested in service monitoring or SRE in general.

Leave a comment