Observability in 2024: Logs, Metrics, and Traces

Introduction

Modern distributed systems are inherently complex. A single user request might traverse dozens of microservices, interact with multiple databases, pass through message queues, and touch external APIs before returning a response. When something goes wrong in this intricate web of services, traditional monitoring approaches—checking CPU usage and scanning log files—simply cannot provide the answers you need. This is where observability comes in.

Observability goes beyond monitoring. While monitoring tells you that something is broken, observability helps you understand why it is broken. The concept, borrowed from control theory, refers to the ability to understand the internal state of a system by examining its external outputs. In the context of software engineering, this means being able to ask arbitrary questions about your system's behavior without having to pre-define what those questions might be. The three pillars of observability—logs, metrics, and traces—form the foundation of this capability.

In 2024, the observability landscape has matured significantly. OpenTelemetry has emerged as the de facto standard for instrumentation, replacing the fragmented ecosystem of vendor-specific agents and SDKs. Grafana's LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus) provides a powerful open-source alternative to expensive commercial platforms. This guide walks you through building a complete observability stack, from understanding the fundamentals to implementing production-grade pipelines.

Understanding Observability: Core Concepts

The Three Pillars

Observability rests on three complementary data types, each providing a different lens through which to examine your system.

Logs are discrete event records that capture what happened at a specific point in time. They are the most familiar telemetry data type—every developer has used console.log() or written to a log file at some point. Structured logging, where each log entry contains consistent key-value pairs such as requestId, userId, and duration, transforms logs from unstructured text blobs into queryable, analyzable data. Modern log aggregation systems like Grafana Loki index metadata labels rather than full log content, making them far more cost-efficient than traditional full-text indexing solutions like Elasticsearch.

Metrics are numerical measurements captured over time. They answer questions like "What is the current request rate?" or "How much memory is this service using?" Metrics are inherently more efficient to store and query than logs because they are aggregated by nature. A single metric data point—consisting of a metric name, a set of labels, a timestamp, and a numeric value—encodes far less information than a log line, but at scale, the ability to graph and alert on metrics makes them indispensable. The most common metric types are counters (monotonically increasing values like total request count), gauges (values that go up and down like current memory usage), histograms (distributions of values like request latency percentiles), and summaries (similar to histograms but calculated on the client side).

Traces capture the end-to-end journey of a request through your distributed system. A trace is composed of spans, where each span represents a single unit of work—a database query, an HTTP call, a message queue operation. Spans are linked together in a parent-child hierarchy, forming a directed acyclic graph that shows exactly how a request flows through your architecture. Distributed tracing is particularly powerful for identifying latency bottlenecks: when a user reports slow page loads, you can examine the trace to see that a downstream service is making an unexpected N+1 query to a database.

Correlation Is Key

The real power of observability emerges when you correlate data across all three pillars. Imagine your metrics dashboard shows a spike in error rates at 3:42 PM. You click on the spike and see related traces that show a particular downstream service timing out. From the trace, you find the trace ID and use it to pull the exact log entries from the failing service, revealing a database connection pool exhaustion issue. This workflow—metrics to traces to logs—is the golden path of observability debugging, and it requires that all three data types share common identifiers like trace IDs, span IDs, and request IDs.

Architecture and Design Patterns

The LGTM Stack Architecture

The Grafana LGTM stack represents one of the most popular open-source observability architectures in production today. Each component handles a specific pillar of observability.

Grafana serves as the visualization and alerting layer. It connects to multiple data sources simultaneously, allowing you to build dashboards that combine log queries from Loki, metric queries from Prometheus, and trace queries from Tempo on a single screen. Grafana's dashboard provisioning system supports GitOps workflows where you define dashboards as JSON files in a repository and let Grafana sync them automatically.

Loki is a horizontally scalable log aggregation system inspired by Prometheus. Unlike Elasticsearch, which indexes the full text of every log line, Loki only indexes labels attached to log streams. This design makes Loki dramatically cheaper to operate at scale because storage costs scale with the number of unique label combinations, not with the volume of log data. Log content itself is stored in compressed chunks in object storage like S3, GCS, or MinIO. The trade-off is that Loki does not support full-text search as efficiently as Elasticsearch—queries must filter by labels first, then scan log lines within matching streams using grep-like filters.

Prometheus or Grafana Mimir for multi-tenant long-term storage handles metrics collection and storage. Prometheus uses a pull-based model: it scrapes HTTP endpoints exposed by your applications at regular intervals, ingesting the time-series data into its local TSDB. PromQL, Prometheus's query language, is extremely powerful for aggregating, filtering, and performing math across time series. For example, histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) computes the 99th percentile request latency over the last 5-minute window.

Tempo is a distributed tracing backend that stores and queries traces. It is designed to be cost-efficient: unlike Jaeger or Zipkin, Tempo does not index trace content—it only indexes the trace ID and a few high-cardinality identifiers. This means you typically need to find a trace ID from a log or metric first, then look up the full trace in Tempo.

Data Collection with OpenTelemetry

OpenTelemetry provides a vendor-neutral instrumentation layer. Instead of installing separate agents for each observability vendor, you instrument your code once using the OTel SDK and export all telemetry data through the OTel Collector, which can route data to multiple backends. The Collector architecture consists of three components: receivers accept data in various formats, processors transform data, and exporters send data to backends.

The Observability Pipeline Pattern

A mature observability deployment follows a pipeline pattern:

Instrumentation – Applications emit telemetry using OTel SDKs
Collection – OTel Collector agents run as sidecars or DaemonSets
Processing – The Collector batches, filters, samples, and enriches data
Routing – Processed data is exported to appropriate backends
Storage – Backends store data with appropriate retention policies
Visualization and Alerting – Grafana provides unified dashboards and alert rules

Step-by-Step Implementation

Setting Up the OTel Collector

First, deploy the OpenTelemetry Collector as a DaemonSet on Kubernetes. The Collector configuration defines which receivers, processors, and exporters to use:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  filelog:
    include: ["/var/log/containers/*.log"]
    operators:
      - type: json_parser
        parse_from: body
 
processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  k8sattributes:
    extract:
      metadata: [namespace, deployment, pod]
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: latency-policy
        type: latency
        latency: { threshold_ms: 500 }
 
exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp, filelog]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [loki]

Instrumenting a Node.js Application

Auto-instrumentation captures telemetry from common libraries like Express, HTTP, PostgreSQL, and Redis without modifying application code:

// tracing.ts — load before your application code
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { OTLPLogExporter } from '@opentelemetry/exporter-logs-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { BatchLogRecordProcessor } from '@opentelemetry/sdk-logs';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';
 
const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'order-service',
    'deployment.environment': process.env.NODE_ENV || 'development',
  }),
  traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
    exportIntervalMillis: 30000,
  }),
  logRecordProcessor: new BatchLogRecordProcessor(
    new OTLPLogExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT })
  ),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingPaths: ['/healthz', '/readyz', '/metrics'],
      },
      '@opentelemetry/instrumentation-pg': { enhancedDatabaseReporting: true },
    }),
  ],
});
 
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());

Custom Metrics with the OTel Metrics API

Beyond auto-instrumentation, you can define custom business metrics:

import { metrics } from '@opentelemetry/api';
 
const meter = metrics.getMeter('order-service');
 
const ordersPlaced = meter.createCounter('orders.placed', {
  description: 'Total number of orders placed',
  unit: 'orders',
});
 
const orderValue = meter.createHistogram('orders.value', {
  description: 'Distribution of order values in USD',
  unit: 'USD',
  advice: { explicitBucketBoundaries: [10, 25, 50, 100, 250, 500, 1000] },
});
 
async function placeOrder(order: Order) {
  const result = await processOrder(order);
  ordersPlaced.add(1, {
    'order.type': order.type,
    'payment.method': order.paymentMethod,
  });
  orderValue.record(order.totalUSD);
  return result;
}

Real-World Use Cases and Case Studies

Use Case 1: Debugging Intermittent Latency Spikes

A production API starts experiencing intermittent 2-second latency spikes every few minutes. The SRE team opens the Grafana metrics dashboard and sees the P99 latency spike correlated with a surge in database query count. They click on a spike in the latency histogram, which links to traces in Tempo. The trace waterfall reveals that a single HTTP request is triggering 150 sequential database queries—an N+1 problem introduced by a recent ORM change. They find the trace ID, search for it in Loki, and locate the exact log entries showing the query pattern. The fix is deployed within an hour.

Use Case 2: Capacity Planning for a Product Launch

A team is preparing for a major product launch expected to increase traffic by 10x. Using Prometheus metrics from the past 30 days, they build Grafana dashboards showing current resource utilization trends. They correlate request rate metrics with resource consumption to create capacity models. By extrapolating current usage patterns, they determine they need to scale from 3 to 12 replicas and increase the database connection pool from 50 to 200. The team sets up Grafana alerts at 70% and 85% capacity thresholds to get early warnings during the launch.

Use Case 3: Security Incident Detection

An engineer notices unusual patterns in the observability data: a specific API endpoint shows a sudden increase in 403 responses correlated with a burst of failed authentication attempts in Loki logs. Using Grafana's log-to-trace correlation, they trace the requests to identify the source IP addresses and request patterns. The traces reveal automated credential-stuffing attacks. The team immediately implements rate limiting and blocks the offending IPs, all while monitoring the Grafana dashboard to verify the attack volume decreases.

Best Practices for Production

Use structured logging from day one: Emit logs as JSON objects with consistent field names across all services. Use correlation IDs in every log entry to enable cross-pillar debugging. Libraries like pino for Node.js or zerolog for Go make structured logging nearly free in terms of performance overhead.
Sample traces intelligently: In high-throughput systems, collecting 100% of traces is prohibitively expensive. Use tail-based sampling in the OTel Collector to always keep traces with errors or high latency, while sampling a percentage of normal traces. This ensures you never miss the interesting cases while controlling storage costs.
Set cardinality limits on metrics: Prometheus and Mimir can struggle when metric label combinations explode. Set cardinality limits in your instrumentation to prevent runaway metric series growth. A good rule of thumb is to keep unique label combinations under 10,000 per metric.
Use exemplars to bridge metrics and traces: Configure your metrics instrumentation to attach trace IDs as exemplars on histogram buckets. This allows Grafana dashboard users to click on a high-latency data point in a graph and jump directly to a representative trace in Tempo.
Implement alerting with multi-window SLOs: Instead of alerting on raw metrics, define Service Level Objectives and alert when you are consuming your error budget too quickly. Use Grafana's SLO plugin or recording rules to implement multi-burn-rate alerting.
Separate hot and cold storage: Configure Loki and Tempo to store recent data on fast storage and older data in object storage. Set retention policies that balance cost against debugging needs—typically 7-14 days hot for logs, 30-90 days cold, and 3-7 days for traces.
Use the Collector for data enrichment: Add Kubernetes metadata at the Collector level rather than in application code. This keeps instrumentation simple and ensures consistent metadata across all telemetry types.
Test your observability pipeline in staging: Before deploying instrumentation changes to production, verify that your Collector configuration, sampling rules, and dashboard queries work correctly in a staging environment.

Common Pitfalls and Solutions

Pitfall	Impact	Solution
High-cardinality metric labels (e.g., user IDs)	Prometheus OOM, slow queries	Use traces for high-cardinality data; keep metric labels to a bounded set
Missing correlation between logs, metrics, and traces	Unable to debug across pillars	Ensure all services propagate traceId and include it in log entries
Collecting 100% of traces in production	Massive storage costs, Collector overload	Implement tail-based sampling: keep all errors, sample 1-10% of normal traffic
Log volume explosion from debug-level logging	Loki storage costs spike	Use runtime-adjustable log levels; default to INFO in production
Alert fatigue from too many low-value alerts	Operators ignore real incidents	Implement SLO-based alerting with multi-burn-rate windows
Vendor lock-in from proprietary SDKs	Expensive migration when switching vendors	Use OpenTelemetry SDKs exclusively

Performance Optimization

Reducing Collector Overhead

The OTel Collector can become a bottleneck if not tuned properly. Key optimizations include batching more aggressively and using the memory limiter to prevent OOM conditions:

processors:
  batch:
    timeout: 10s
    send_batch_size: 8192
  memory_limiter:
    check_interval: 500ms
    limit_mib: 1024
    spike_limit_mib: 256
  filter/drop-healthchecks:
    logs:
      exclude:
        match_type: regexp
        record_attributes:
          - key: url.path
            value: '/(healthz|readyz|metrics)'

Deploy the Collector as a Gateway for low-throughput environments, or as an Agent DaemonSet for high-throughput clusters. In Agent mode, each node runs its own Collector instance, reducing network overhead and preventing a single Collector from becoming a bottleneck.

Prometheus Query Optimization

Slow PromQL queries can impact both dashboard load times and alert evaluation. Always filter by service or job label before aggregating:

# BAD: Full scan of high-cardinality metric
histogram_quantile(0.99, sum(rate(http_duration_seconds_bucket[5m])) by (le, path))
 
# GOOD: Filter by service first, then aggregate
histogram_quantile(0.99,
  sum(rate(http_duration_seconds_bucket{service="order-service"}[5m])) by (le)
)

Comparison with Alternatives

Feature	Grafana LGTM	Datadog	New Relic	Elastic Stack
Cost Model	Open-source (infra costs only)	Per-host + per-GB ingest	Per-GB ingest	Open-source or licensed
Traces	Tempo (low-cost)	APM (indexed, expensive)	Distributed tracing	APM
Logs	Loki (label-only indexing)	Full indexing	NRQL	Elasticsearch
Metrics	Prometheus/Mimir	Custom metrics extra	Metrics	N/A
Learning Curve	High (multiple components)	Low (unified platform)	Medium	High
Vendor Lock-in	None	High	High	Medium
Query Language	PromQL, LogQL, TraceQL	DQL	NRQL	KQL, Lucene
Best For	Cost-conscious teams	Turnkey solution	Full-stack observability	Log-heavy workloads

Advanced Patterns and Techniques

Generating Dashboards with Grafonnet

Instead of manually creating Grafana dashboards in the UI, define them programmatically using Grafonnet, a Jsonnet library for Grafana:

local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;
 
dashboard.new('Order Service SLOs', tags=['slo', 'order-service'])
.addPanel(
  graphPanel.new('Request Latency P99', datasource='Prometheus', format='ms')
  .addTarget(
    prometheus.target(
      'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="order-service"}[5m])) * 1000',
      legendFormat='P99 Latency'
    )
  ),
  gridPos={ x: 0, y: 0, w: 12, h: 8 }
)

Testing Strategies

Test your observability pipeline with synthetic traffic to verify data flows correctly:

import { trace, metrics } from '@opentelemetry/api';
 
describe('Observability Pipeline', () => {
  it('should export traces to Tempo', async () => {
    const tracer = trace.getTracer('test');
    const span = tracer.startSpan('test-span');
    span.setAttribute('test.run', 'true');
    span.end();
    await new Promise(resolve => setTimeout(resolve, 5000));
    const response = await fetch('http://tempo:3200/api/search?tags=test.run=true');
    const traces = await response.json();
    expect(traces.traces.length).toBeGreaterThan(0);
  });
});

Future Outlook

The observability landscape is converging on OpenTelemetry as the universal instrumentation standard. All major cloud providers now offer managed OTel Collector services, reducing the operational burden of running your own Collector fleet. Grafana's recent introduction of TraceQL—a query language specifically designed for searching and filtering traces—represents a significant advancement in trace analysis capabilities.

The rise of eBPF-based observability tools promises zero-code instrumentation that automatically captures telemetry from network traffic without requiring any SDK integration. Finally, the convergence of observability and security is gaining momentum—the same telemetry pipeline used for performance monitoring can detect anomalous behavior and potential security incidents without deploying separate security monitoring infrastructure.

Conclusion

Building a production-grade observability stack requires understanding the complementary roles of logs, metrics, and traces and how to correlate them effectively. The Grafana LGTM stack paired with OpenTelemetry provides a powerful, vendor-neutral foundation that can scale from a small startup to a large enterprise deployment.

Key takeaways:

Observability is not monitoring—monitoring checks predefined conditions while observability enables arbitrary exploration of system state
Use OpenTelemetry for instrumentation as the vendor-neutral standard
Correlate across pillars for the real debugging power
Sample intelligently with tail-based sampling to preserve visibility into errors while controlling costs
Structure your data with consistent fields and bounded label cardinality
Automate dashboards with Grafonnet or Terraform
Implement SLO-based alerting on error budget burn rates

For further learning, explore the OpenTelemetry documentation, the Grafana Labs blog, and the Prometheus best practices guide.

Minh Vo

Slaying code & making it lit fr fr 🔥 tagline