This banner image illustrates Visual guide on building a customizable and scalable cloud observability solution using open source software.

How To Build An Open Source Cloud Observability Stack

Observability is the process of understanding systems from the outside, without knowing inner intricate details. The process can help engineering and operations teams troubleshoot an issue or debug a problem dynamically by identifying unknown patterns.

With observability, these teams should be able to address questions such as: “Do I have a problem?” “Where is the problem?” and “Why is this problem happening?” In this sense, observability is different from monitoring, where metrics and logs monitor a predefined set of behavior attributes.

Instrumentation Of The Observability Stack

This banner image illustrates Visual guide on building a customizable and scalable cloud observability solution using open source software.

Observability provides visibility into the reliability of the system through tracing, metrics and logs as well as, from the customer-experience perspective, how service-level objectives (SLOs) are being met. Therefore, for the system to become observable, systems must be instrumented with code that can provide signals (telemetry) through traces, metrics and logs.

Adopting vendor-specific solutions to collect the telemetry data of your system can lead to higher maintenance, cost and service performance issues. These options may also be unable to scale to the needs of a growing platform. For this reason, adopting open standards-based data collection formats, such as OpenTelemetry for tracing and metrics, may be preferable.

OpenTelemetry helps—irrespective of what observability back end, or how the observability back end collects these signals—to make your stack immune to the instrumentation code and gives flexibility to port data to multiple platforms and to avoid vendor lock-in. There are several commercial SaaS observability back ends and a few well known open source tools, such as Prometheus and Jaeger:


Prometheus is an open source monitoring tool that captures metrics data as time-series, i.e., changes over a period of time in a multidimensional data model. Prometheus provides basic UI but depends on another open source tool, Grafana, for advanced dashboarding and reporting capabilities.


Jaeger is an end-to-end distributed tracing tool that provides native OpenTelemetry support to capture microservices through distributed context propagation and distributed transaction monitoring. Jaeger provides dashboard UI capabilities that show the requests through different services as transactions flow.

Distributed Tracing And Metrics Collection

OpenTelemetry open standards code instrumented in microservices sends metrics and distributed tracing telemetry with the OpenTelemetry Protocol (OLTP). The OLTP specification describes the encoding, transport and delivery of telemetry data between telemetry sources, which are intermediary nodes, such as collectors and telemetry back ends.

The OpenTelemetry Collector—a proxy receiving the data using OLTP—processes and exports telemetry data. In this case, after receiving the data, metrics data is exported to Prometheus using Prometheus exporter. Similarly, distributed tracing data is exported to Jaeger using Jaeger exporter. Both Prometheus and Jaeger can push the data to the open source visualization web application, Grafana.

SLAs And Commitments To Business Stakeholders

The question then becomes: How do you leverage these cloud observability capabilities to ensure the service-level indicators (SLIs) are measured and compliant with the SLOs? To understand the answer, let’s look at the difference between service-level agreements (SLAs) and SLOs:

• SLAs are customer-centric and should not include what or how many back-end services are involved in delivering such experience.
• SLOs define goals and objectives that fulfill commitments made to business through SLAs.

Not every metric is important. It is important to focus on SLOs that impact customer experience, as these matter the most. SLIs should be chosen carefully, therefore, to effectively measure system compliance with SLOs. To achieve such cloud observability, site reliability engineering (SRE) teams must instrument the code across your stack in an intelligent way.

Errors and failures in the transaction are a given. One-hundred percent reliability of a system is practically impossible. Google recommends that SRE teams plan an error budget that can account for the number of service errors over a period of time before users are unhappy and bake estimates into your SLOs.

As expectations with the stakeholders are set that 100% SLO is not achievable, SRE engineers must focus on identifying the correct percentage of SLOs and what SLIs to measure. Google recommends preparing SLIs specifications that define service outcome that matters to the end user. As necessary, there can be more than one SLI implementation for each specification.


Adopting the cloud observability tools and strategies to identify SLOs and measure SLIs is a wide topic, but it is important to factor in the SRE experience to instrument the code to capture appropriate SLIs. Beyond that the engineers and teams must continue to learn about the topic in order to understand the best practices, and there are many valuable sources available. The Prometheus and Jaeger websites provide technical details, for instance, and Google has a lot of great content published on SRE.

As seen on this link of Forbes Technology Council.

For enquires, mail to

Contributed for Sage IT by
Srini Gajula

Social Media Sharing

Share This Story, Choose Your Platform!

Related Posts
  • Read now
  • Read now
  • Read now
  • Read now
  • Read now