MinhVo

Minh Vo

rss feed

Slaying code & making it lit fr fr 🔥 tagline

Hey there 👋 I'm an AI Engineer with 7 years of experience building scalable web and mobile applications. Currently at Neurond AI (May 2025 — present), architecting an Enterprise AI Assistant Platform with multi-tenant RAG on pgvector, multi-provider LLM orchestration, and Azure-native infrastructure. Previously spent 5+ years at SNAPTEC (Sep 2019 — Apr 2025), leading SaaS themes, admin dashboards, and e-commerce platforms — earned the Hero of the Year award in 2021. I specialize in TypeScript, React, Next.js, and AI-Native engineering with Claude Code and Cursor.bio

Back to blogs

Observability in 2024: Logs, Metrics, and Traces

Complete observability stack: OpenTelemetry, Grafana, Loki, Tempo, and Prometheus.

ObservabilityDevOpsMonitoringGrafana

By MinhVo

Introduction

Modern distributed systems are inherently complex. A single user request might traverse dozens of microservices, interact with multiple databases, pass through message queues, and touch external APIs before returning a response. When something goes wrong in this intricate web of services, traditional monitoring approaches—checking CPU usage and scanning log files—simply cannot provide the answers you need. This is where observability comes in.

Observability goes beyond monitoring. While monitoring tells you that something is broken, observability helps you understand why it is broken. The concept, borrowed from control theory, refers to the ability to understand the internal state of a system by examining its external outputs. In the context of software engineering, this means being able to ask arbitrary questions about your system's behavior without having to pre-define what those questions might be. The three pillars of observability—logs, metrics, and traces—form the foundation of this capability.

In 2024, the observability landscape has matured significantly. OpenTelemetry has emerged as the de facto standard for instrumentation, replacing the fragmented ecosystem of vendor-specific agents and SDKs. Grafana's LGTM stack (Loki, Grafana, Tempo, Mimir/Prometheus) provides a powerful open-source alternative to expensive commercial platforms. This guide walks you through building a complete observability stack, from understanding the fundamentals to implementing production-grade pipelines.

Observability Stack Overview

Understanding Observability: Core Concepts

The Three Pillars

Observability rests on three complementary data types, each providing a different lens through which to examine your system.

Logs are discrete event records that capture what happened at a specific point in time. They are the most familiar telemetry data type—every developer has used console.log() or written to a log file at some point. Structured logging, where each log entry contains consistent key-value pairs such as requestId, userId, and duration, transforms logs from unstructured text blobs into queryable, analyzable data. Modern log aggregation systems like Grafana Loki index metadata labels rather than full log content, making them far more cost-efficient than traditional full-text indexing solutions like Elasticsearch.

Metrics are numerical measurements captured over time. They answer questions like "What is the current request rate?" or "How much memory is this service using?" Metrics are inherently more efficient to store and query than logs because they are aggregated by nature. A single metric data point—consisting of a metric name, a set of labels, a timestamp, and a numeric value—encodes far less information than a log line, but at scale, the ability to graph and alert on metrics makes them indispensable. The most common metric types are counters (monotonically increasing values like total request count), gauges (values that go up and down like current memory usage), histograms (distributions of values like request latency percentiles), and summaries (similar to histograms but calculated on the client side).

Traces capture the end-to-end journey of a request through your distributed system. A trace is composed of spans, where each span represents a single unit of work—a database query, an HTTP call, a message queue operation. Spans are linked together in a parent-child hierarchy, forming a directed acyclic graph that shows exactly how a request flows through your architecture. Distributed tracing is particularly powerful for identifying latency bottlenecks: when a user reports slow page loads, you can examine the trace to see that a downstream service is making an unexpected N+1 query to a database.

Correlation Is Key

The real power of observability emerges when you correlate data across all three pillars. Imagine your metrics dashboard shows a spike in error rates at 3:42 PM. You click on the spike and see related traces that show a particular downstream service timing out. From the trace, you find the trace ID and use it to pull the exact log entries from the failing service, revealing a database connection pool exhaustion issue. This workflow—metrics to traces to logs—is the golden path of observability debugging, and it requires that all three data types share common identifiers like trace IDs, span IDs, and request IDs.

Three Pillars of Observability

Architecture and Design Patterns

The LGTM Stack Architecture

The Grafana LGTM stack represents one of the most popular open-source observability architectures in production today. Each component handles a specific pillar of observability.

Grafana serves as the visualization and alerting layer. It connects to multiple data sources simultaneously, allowing you to build dashboards that combine log queries from Loki, metric queries from Prometheus, and trace queries from Tempo on a single screen. Grafana's dashboard provisioning system supports GitOps workflows where you define dashboards as JSON files in a repository and let Grafana sync them automatically.

Loki is a horizontally scalable log aggregation system inspired by Prometheus. Unlike Elasticsearch, which indexes the full text of every log line, Loki only indexes labels attached to log streams. This design makes Loki dramatically cheaper to operate at scale because storage costs scale with the number of unique label combinations, not with the volume of log data. Log content itself is stored in compressed chunks in object storage like S3, GCS, or MinIO. The trade-off is that Loki does not support full-text search as efficiently as Elasticsearch—queries must filter by labels first, then scan log lines within matching streams using grep-like filters.

Prometheus or Grafana Mimir for multi-tenant long-term storage handles metrics collection and storage. Prometheus uses a pull-based model: it scrapes HTTP endpoints exposed by your applications at regular intervals, ingesting the time-series data into its local TSDB. PromQL, Prometheus's query language, is extremely powerful for aggregating, filtering, and performing math across time series. For example, histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) computes the 99th percentile request latency over the last 5-minute window.

Tempo is a distributed tracing backend that stores and queries traces. It is designed to be cost-efficient: unlike Jaeger or Zipkin, Tempo does not index trace content—it only indexes the trace ID and a few high-cardinality identifiers. This means you typically need to find a trace ID from a log or metric first, then look up the full trace in Tempo.

Data Collection with OpenTelemetry

OpenTelemetry provides a vendor-neutral instrumentation layer. Instead of installing separate agents for each observability vendor, you instrument your code once using the OTel SDK and export all telemetry data through the OTel Collector, which can route data to multiple backends. The Collector architecture consists of three components: receivers accept data in various formats, processors transform data, and exporters send data to backends.

The Observability Pipeline Pattern

A mature observability deployment follows a pipeline pattern:

  1. Instrumentation – Applications emit telemetry using OTel SDKs
  2. Collection – OTel Collector agents run as sidecars or DaemonSets
  3. Processing – The Collector batches, filters, samples, and enriches data
  4. Routing – Processed data is exported to appropriate backends
  5. Storage – Backends store data with appropriate retention policies
  6. Visualization and Alerting – Grafana provides unified dashboards and alert rules

Observability Pipeline

Step-by-Step Implementation

Setting Up the OTel Collector

First, deploy the OpenTelemetry Collector as a DaemonSet on Kubernetes. The Collector configuration defines which receivers, processors, and exporters to use:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  filelog:
    include: ["/var/log/containers/*.log"]
    operators:
      - type: json_parser
        parse_from: body
 
processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  k8sattributes:
    extract:
      metadata: [namespace, deployment, pod]
  tail_sampling:
    decision_wait: 10s
    policies:
      - name: error-policy
        type: status_code
        status_code: { status_codes: [ERROR] }
      - name: latency-policy
        type: latency
        latency: { threshold_ms: 500 }
 
exporters:
  otlp/tempo:
    endpoint: tempo:4317
    tls: { insecure: true }
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
  prometheusremotewrite:
    endpoint: http://mimir:9009/api/v1/push
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, tail_sampling, batch]
      exporters: [otlp/tempo]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [prometheusremotewrite]
    logs:
      receivers: [otlp, filelog]
      processors: [memory_limiter, k8sattributes, batch]
      exporters: [loki]

Instrumenting a Node.js Application

Auto-instrumentation captures telemetry from common libraries like Express, HTTP, PostgreSQL, and Redis without modifying application code:

// tracing.ts — load before your application code
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc';
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc';
import { OTLPLogExporter } from '@opentelemetry/exporter-logs-otlp-grpc';
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics';
import { BatchLogRecordProcessor } from '@opentelemetry/sdk-logs';
import { Resource } from '@opentelemetry/resources';
import { ATTR_SERVICE_NAME } from '@opentelemetry/semantic-conventions';
 
const sdk = new NodeSDK({
  resource: new Resource({
    [ATTR_SERVICE_NAME]: 'order-service',
    'deployment.environment': process.env.NODE_ENV || 'development',
  }),
  traceExporter: new OTLPTraceExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT }),
    exportIntervalMillis: 30000,
  }),
  logRecordProcessor: new BatchLogRecordProcessor(
    new OTLPLogExporter({ url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT })
  ),
  instrumentations: [
    getNodeAutoInstrumentations({
      '@opentelemetry/instrumentation-http': {
        ignoreIncomingPaths: ['/healthz', '/readyz', '/metrics'],
      },
      '@opentelemetry/instrumentation-pg': { enhancedDatabaseReporting: true },
    }),
  ],
});
 
sdk.start();
process.on('SIGTERM', () => sdk.shutdown());

Custom Metrics with the OTel Metrics API

Beyond auto-instrumentation, you can define custom business metrics:

import { metrics } from '@opentelemetry/api';
 
const meter = metrics.getMeter('order-service');
 
const ordersPlaced = meter.createCounter('orders.placed', {
  description: 'Total number of orders placed',
  unit: 'orders',
});
 
const orderValue = meter.createHistogram('orders.value', {
  description: 'Distribution of order values in USD',
  unit: 'USD',
  advice: { explicitBucketBoundaries: [10, 25, 50, 100, 250, 500, 1000] },
});
 
async function placeOrder(order: Order) {
  const result = await processOrder(order);
  ordersPlaced.add(1, {
    'order.type': order.type,
    'payment.method': order.paymentMethod,
  });
  orderValue.record(order.totalUSD);
  return result;
}

Real-World Use Cases and Case Studies

Use Case 1: Debugging Intermittent Latency Spikes

A production API starts experiencing intermittent 2-second latency spikes every few minutes. The SRE team opens the Grafana metrics dashboard and sees the P99 latency spike correlated with a surge in database query count. They click on a spike in the latency histogram, which links to traces in Tempo. The trace waterfall reveals that a single HTTP request is triggering 150 sequential database queries—an N+1 problem introduced by a recent ORM change. They find the trace ID, search for it in Loki, and locate the exact log entries showing the query pattern. The fix is deployed within an hour.

Use Case 2: Capacity Planning for a Product Launch

A team is preparing for a major product launch expected to increase traffic by 10x. Using Prometheus metrics from the past 30 days, they build Grafana dashboards showing current resource utilization trends. They correlate request rate metrics with resource consumption to create capacity models. By extrapolating current usage patterns, they determine they need to scale from 3 to 12 replicas and increase the database connection pool from 50 to 200. The team sets up Grafana alerts at 70% and 85% capacity thresholds to get early warnings during the launch.

Use Case 3: Security Incident Detection

An engineer notices unusual patterns in the observability data: a specific API endpoint shows a sudden increase in 403 responses correlated with a burst of failed authentication attempts in Loki logs. Using Grafana's log-to-trace correlation, they trace the requests to identify the source IP addresses and request patterns. The traces reveal automated credential-stuffing attacks. The team immediately implements rate limiting and blocks the offending IPs, all while monitoring the Grafana dashboard to verify the attack volume decreases.

Best Practices for Production

  1. Use structured logging from day one: Emit logs as JSON objects with consistent field names across all services. Use correlation IDs in every log entry to enable cross-pillar debugging. Libraries like pino for Node.js or zerolog for Go make structured logging nearly free in terms of performance overhead.

  2. Sample traces intelligently: In high-throughput systems, collecting 100% of traces is prohibitively expensive. Use tail-based sampling in the OTel Collector to always keep traces with errors or high latency, while sampling a percentage of normal traces. This ensures you never miss the interesting cases while controlling storage costs.

  3. Set cardinality limits on metrics: Prometheus and Mimir can struggle when metric label combinations explode. Set cardinality limits in your instrumentation to prevent runaway metric series growth. A good rule of thumb is to keep unique label combinations under 10,000 per metric.

  4. Use exemplars to bridge metrics and traces: Configure your metrics instrumentation to attach trace IDs as exemplars on histogram buckets. This allows Grafana dashboard users to click on a high-latency data point in a graph and jump directly to a representative trace in Tempo.

  5. Implement alerting with multi-window SLOs: Instead of alerting on raw metrics, define Service Level Objectives and alert when you are consuming your error budget too quickly. Use Grafana's SLO plugin or recording rules to implement multi-burn-rate alerting.

  6. Separate hot and cold storage: Configure Loki and Tempo to store recent data on fast storage and older data in object storage. Set retention policies that balance cost against debugging needs—typically 7-14 days hot for logs, 30-90 days cold, and 3-7 days for traces.

  7. Use the Collector for data enrichment: Add Kubernetes metadata at the Collector level rather than in application code. This keeps instrumentation simple and ensures consistent metadata across all telemetry types.

  8. Test your observability pipeline in staging: Before deploying instrumentation changes to production, verify that your Collector configuration, sampling rules, and dashboard queries work correctly in a staging environment.

Common Pitfalls and Solutions

PitfallImpactSolution
High-cardinality metric labels (e.g., user IDs)Prometheus OOM, slow queriesUse traces for high-cardinality data; keep metric labels to a bounded set
Missing correlation between logs, metrics, and tracesUnable to debug across pillarsEnsure all services propagate traceId and include it in log entries
Collecting 100% of traces in productionMassive storage costs, Collector overloadImplement tail-based sampling: keep all errors, sample 1-10% of normal traffic
Log volume explosion from debug-level loggingLoki storage costs spikeUse runtime-adjustable log levels; default to INFO in production
Alert fatigue from too many low-value alertsOperators ignore real incidentsImplement SLO-based alerting with multi-burn-rate windows
Vendor lock-in from proprietary SDKsExpensive migration when switching vendorsUse OpenTelemetry SDKs exclusively

Performance Optimization

Reducing Collector Overhead

The OTel Collector can become a bottleneck if not tuned properly. Key optimizations include batching more aggressively and using the memory limiter to prevent OOM conditions:

processors:
  batch:
    timeout: 10s
    send_batch_size: 8192
  memory_limiter:
    check_interval: 500ms
    limit_mib: 1024
    spike_limit_mib: 256
  filter/drop-healthchecks:
    logs:
      exclude:
        match_type: regexp
        record_attributes:
          - key: url.path
            value: '/(healthz|readyz|metrics)'

Deploy the Collector as a Gateway for low-throughput environments, or as an Agent DaemonSet for high-throughput clusters. In Agent mode, each node runs its own Collector instance, reducing network overhead and preventing a single Collector from becoming a bottleneck.

Prometheus Query Optimization

Slow PromQL queries can impact both dashboard load times and alert evaluation. Always filter by service or job label before aggregating:

# BAD: Full scan of high-cardinality metric
histogram_quantile(0.99, sum(rate(http_duration_seconds_bucket[5m])) by (le, path))
 
# GOOD: Filter by service first, then aggregate
histogram_quantile(0.99,
  sum(rate(http_duration_seconds_bucket{service="order-service"}[5m])) by (le)
)

Comparison with Alternatives

FeatureGrafana LGTMDatadogNew RelicElastic Stack
Cost ModelOpen-source (infra costs only)Per-host + per-GB ingestPer-GB ingestOpen-source or licensed
TracesTempo (low-cost)APM (indexed, expensive)Distributed tracingAPM
LogsLoki (label-only indexing)Full indexingNRQLElasticsearch
MetricsPrometheus/MimirCustom metrics extraMetricsN/A
Learning CurveHigh (multiple components)Low (unified platform)MediumHigh
Vendor Lock-inNoneHighHighMedium
Query LanguagePromQL, LogQL, TraceQLDQLNRQLKQL, Lucene
Best ForCost-conscious teamsTurnkey solutionFull-stack observabilityLog-heavy workloads

Advanced Patterns and Techniques

Generating Dashboards with Grafonnet

Instead of manually creating Grafana dashboards in the UI, define them programmatically using Grafonnet, a Jsonnet library for Grafana:

local grafana = import 'grafonnet/grafana.libsonnet';
local dashboard = grafana.dashboard;
local prometheus = grafana.prometheus;
local graphPanel = grafana.graphPanel;
 
dashboard.new('Order Service SLOs', tags=['slo', 'order-service'])
.addPanel(
  graphPanel.new('Request Latency P99', datasource='Prometheus', format='ms')
  .addTarget(
    prometheus.target(
      'histogram_quantile(0.99, rate(http_request_duration_seconds_bucket{service="order-service"}[5m])) * 1000',
      legendFormat='P99 Latency'
    )
  ),
  gridPos={ x: 0, y: 0, w: 12, h: 8 }
)

Testing Strategies

Test your observability pipeline with synthetic traffic to verify data flows correctly:

import { trace, metrics } from '@opentelemetry/api';
 
describe('Observability Pipeline', () => {
  it('should export traces to Tempo', async () => {
    const tracer = trace.getTracer('test');
    const span = tracer.startSpan('test-span');
    span.setAttribute('test.run', 'true');
    span.end();
    await new Promise(resolve => setTimeout(resolve, 5000));
    const response = await fetch('http://tempo:3200/api/search?tags=test.run=true');
    const traces = await response.json();
    expect(traces.traces.length).toBeGreaterThan(0);
  });
});

Future Outlook

The observability landscape is converging on OpenTelemetry as the universal instrumentation standard. All major cloud providers now offer managed OTel Collector services, reducing the operational burden of running your own Collector fleet. Grafana's recent introduction of TraceQL—a query language specifically designed for searching and filtering traces—represents a significant advancement in trace analysis capabilities.

The rise of eBPF-based observability tools promises zero-code instrumentation that automatically captures telemetry from network traffic without requiring any SDK integration. Finally, the convergence of observability and security is gaining momentum—the same telemetry pipeline used for performance monitoring can detect anomalous behavior and potential security incidents without deploying separate security monitoring infrastructure.

Conclusion

Building a production-grade observability stack requires understanding the complementary roles of logs, metrics, and traces and how to correlate them effectively. The Grafana LGTM stack paired with OpenTelemetry provides a powerful, vendor-neutral foundation that can scale from a small startup to a large enterprise deployment.

Key takeaways:

  1. Observability is not monitoring—monitoring checks predefined conditions while observability enables arbitrary exploration of system state
  2. Use OpenTelemetry for instrumentation as the vendor-neutral standard
  3. Correlate across pillars for the real debugging power
  4. Sample intelligently with tail-based sampling to preserve visibility into errors while controlling costs
  5. Structure your data with consistent fields and bounded label cardinality
  6. Automate dashboards with Grafonnet or Terraform
  7. Implement SLO-based alerting on error budget burn rates

For further learning, explore the OpenTelemetry documentation, the Grafana Labs blog, and the Prometheus best practices guide.