At Sensible, we prioritize observability in our software. We aim to provide stable, minimally-invasive tooling and better data for our engineers.
To do this, we built our MVP with instrumentation using the OpenTelemetry standards and associated tooling. This has already provided a return on investment, allowing us to more rapidly diagnose and remedy bugs, and identify performance issues within our services.
As a whole, instrumented code has given us the tools to improve the reliability and performance of our codebase. Below are possible next steps for a unified telemetry system that combines logging, metrics, and distributed tracing.
Knowing what is going on inside your software is tantamount to operating a highly available service.
Our team wanted to be able to answer questions like:
Telemetry is necessary to find (or even know to look for) logs, and to make sense of how they are connected to the rest of the system, especially in a distributed context.
As such, we needed a way to:
At Sensible, we use OpenTelemetry as a way of tying together observability requirements into a single system.
OpenTelemetry is a project incubated by the Cloud Native Computing Foundation, which aims to unify the disparate telemetry APIs and standards into a single set of adapters to work across systems. As there are many competing and overlapping standards in this space, OpenTelemetry is a unified effort to combine these and support the primary use-cases we outlined above.
On a high level, the major pieces of infrastructure we added are:
We start with an instrumented application, which can be done by applying the OpenTelemetry, SDKs, and APIs to the code. We can do this through direct instrumentation of spans, manually adding in code that defines these components—or through automatic instrumentation.
Automatic instrumentation relies on tools that inject instrumented versions of our common libraries into the code, to provide context aware traces and observability with minimal changes to the application stack. The latter portion is best suited for dynamic languages, where this kind of library injection is easily possible.
This configures an instrumentation context and runs auto-instrumentation on application deployments that are configured with the following annotation
The goal of this auto-instrumentation step is to provide an easy pipeline of tracing and metric data to the OpenTelemetry collectors, so they can distribute telemetry data to the correct backend (traces→Jaeger, logs→Elasticsearch, metrics→Prometheus).
Auto instrumentation (where possible) allows us to quickly adopt the complete telemetry toolkits and get value instantly. Once you have the working telemetry stack in place, you can then use it to further optimize the telemetry data you want to know about by applying more advanced manual instrumentation where needed.
Auto-instrumentation is only half the story—and isn’t available as an easy option for Golang due to the nature of compiled languages. And within Python, instrumenting some of our concurrent code (async) automatically presented race conditions that caused the Celery workers to crash.
Here’s a list of available Golang instrumented libraries: https://opentelemetry.io/registry/?language=go&component=instrumentation
Our MVP uses OpenTelemetry to collect all metrics, logs, and traces, and then offloads this telemetry data to a ‘collector’ agent. The collector can do some simple post-processing on the telemetry data (such as injecting an environment tag to designate which environment the span came from) and distribute data to its appropriate backend.
To accomplish this, we installed the OpenTelemetry Operator to manage the lifecycle of the collector processes, along with the auto instrumentation features referenced above. To add a collector to your application deployment, you need to configure an OpenTelemetry Collector CRD in your manifest, then let the installed operator manage the lifecycle of your collector.
Collectors can be deployed in a number of configurations. These include having one collector per Kubernetes worker node (daemonset), having a pool of collectors per namespace (deployment), or having a colocated process that runs alongside each application container within the same pod (sidecar).
Since the collector process is generally lightweight, and often configured alongside the application, we opted to use the sidecar configuration which is automatically managed by the linked configurations. This means, to send metrics from your application to the collector, you simply access it over the right port (usually 4317 or 4318 depending on the protocol used) on 127.0.0.1 since this is collocated in the same pod as the application deployment.
You can also explore other configurations for the collectors, but in our current view, the sidecar approach offers the best compromise between maintainability and performance without negatively affecting the running application. In many cases, there may be inefficiencies to deploying the same process to your infrastructure. The tradeoff for our system is that we have a much easier architecture and service topology. It also means in the case of failures, we only lose a single service’s telemetry data instead of the entire namespace or cluster.
Documentation on specific configuration options and use cases for the OpenTelemetry Operator can be found here: https://github.com/open-telemetry/opentelemetry-operator
The choice to use Jaeger at this point in time is motivated by the fact it has the largest community adoption and support, along with the most modern and stable set of features based on the alternatives surveyed. It's open to discussion for us as a team. If we have a good reason to use another tracing backend, we'll have the right scaffolding in place to make this transition completely transparent to the application engineer.
In order to make it easier to onboard this new observability stack, we opted to first get everything online using the single container package called “Jaeger-all-in-one”.
Before scaling up our tracing efforts (but after completing basic tracing coverage/instrumentation across all our services) we will need to upgrade our Jaeger deployment to a multi container build. This will be through the Jaeger Kubernetes operator—or through a more manually managed helm deployment, where we pick and choose which Jaeger subcomponents we actually want to deploy.
For a more in depth look at what these components are, see: https://www.jaegertracing.io/docs/1.31/architecture/
The tracing backend does not persist data beyond memory. Thus, it will require an upgrade as soon as you need persistence, or need to hold more traces than can reasonably fit in the memory of the Jaeger-all-in-one process.
To access the tracing UI, configure your kubectl to point to the dev cluster, and then run:
kubectl port-forward -n tools jaeger-all-in-one-0 16686
Next, navigate in your browser to: [<http://localhost:16686>](<http://localhost:16686/>)
Make a request to an instrumented service and see the results when you search for your recent request in the UI.
At this point in time, the MVP has set up basic infrastructure for OpenTelemetry, but has yet to pipe through metrics and logging.
Metrics overall are more stable in the OpenTelemetry project than log exports, but both are less stable than the tracing functionality outlined above. We should explore the feasibility of adding these in with the existing tools, and weight that against the trade-off of a separate, more stable system (such as Fluentd) for log exports.
Given the pace of adoption for OpenTelemetry, along with the large corporate and OSS support for this project, our team feels the risk of adopting a less stable technology is fairly low at this stage. However, we are not going to disable our existing metrics infrastructure until we are certain we are happy with what OpenTelemetry provides us in exchange.
There are a number of hosted backends that we can elect to use downstream of our collectors. Some of these options only target a single piece of managed observability infrastructure, while most others now offer a suite of observability tools.
Some options to consider in this space include:
Get excited about tracing and increased observability!
Get in touch with our partnerships team to see how we can work together.
Tropical Storm Kay had the closest approach to San Diego since record-keeping began in 1949.
Peak vacation season increasingly coincides with peak wildfire season—and the potential for poor air quality
Skift honored us for developing 2022's best idea for improving the traveler experience.