A study of the golden signals

Site reliability engineering (SRE) is on the rise and service monitoring is a big part of it. While monitoring a complex solution is a engineering endeavor in itself the key to success is to start with some simple metrics, e.g. Latency, traffic, error and saturation (a.k.a the four golden signals)

If you only measure four metrics of your user facing systems, focus on those four
SRE Book

The four golden signals

Name	Description
Latency	The time it takes to service a request. Also known as response time. Note – Mixing the latency of successful requests and the latency of the requests that fail might lead to a wrong conclusion. For example, 403 unauthorised errors are usually returned very quickly but it doesn’t suggests the service is in good health
Traffic	A measure of how much demand is being placed on your system. For a web service, this measurement is usually HTTP requests per second. It’s also known as throughput.
Error	The rate of requests that fail – explicit failure e.g., HTTP 400, 500s – implicit failure e.g., HTTP 200 but coupled with the wrong content – policy violation e.g., If you committed to one-second response times and hence any request over one second is an error
Saturation	How overloaded your service is. This is directly measured by the utilisation of CPU, memory, bandwidth, the number messages in the ESB and so on

The four golden signals

The four golden signals are very popular but worth noting there are at least two other methods:

Brendan Gregg’s USE method – utilisation, saturation and error
Tom Wilkie’s RED method – rate, error and duration

Definitely worth reading if you are interested in service monitoring or SRE in general.

The four golden signals

Share this:

Related

Leave a comment Cancel reply