Skip to main content

The Three Pillars

Overview

![TODO - FIX IMAGE]

The three pillars of observability are:

Logs

This is probably the most familiar one, we use Logs to refer to a collection of Log Records. A Log Record is a recording of an Event, which by definition is an immutable timestamped payload that can contain metadata.
In our case, we currently use a structured approach in JSON format, where the payload contains the data defined at code level (so is the developer who controls this information), and as part of the metadata, we have several attributes like labels, fields and traceids.

Metrics

Metrics are the raw representation of the health of components and systems over intervals of time. This can be a set of low-level summaries (like CPU & memory levels) and high-level types of data generated by components (like "number of HTTP 500's responses").

Some systems, like AWS XRay, allows the creation of custom Metrics based on the output of the business logic of a component. (https://docs.aws.amazon.com/AmazonCloudWatch/latest/logs/MonitoringLogData.html )

The collection of metrics allows us to apply statistical transformations to that collected data, and with that output define an alert system based on what we consider healthy/non-healthy levels.

Traces

A Trace is a representation of a request flow across the distributed components. This is of great help when analyzing & understanding the life-cycle of a request, its execution forks, latency and responses.

The need for a context between the three

Having plain access to Logs, Metrics & Traces, even when useful, doesn't fully unblock our ability to build robust systems. As said previously, one of the key aspects when defining observability is to have a context between those three.

As part of creating that context, there's a set of techniques and utilities that enables us to do so on the software & components we deliver, and that procedure is what's called Instrumentation. There are tools like Grafana or AWS X-Ray that provide a powerful mechanism to use that context, allowing us to navigate seamlessly between those 3 pillars.

For example, we could have a system that relies on Metrics to alert us when a lambda function is responding with 500 errors when overloaded. From metrics, we can view exemplars of traces of those errors, so we can observe the component map of a solution, and identify if there are other components also failing or not. Last, but not least, we could observe the logs relevant for the failing trace and identify precisely what's wrong at code level.