Observability-Driven Development: Test in Production Safely

No staging environment perfectly replicates production. Different data volumes, user behavior patterns, network conditions, infrastructure configurations, and third-party service states mean that some bugs only manifest under real traffic. Observability-Driven Development (ODD) embraces this reality: instead of trying to make staging perfect, instrument your code thoroughly, deploy safely with progressive rollout strategies, and use production telemetry to validate correctness.

Testing in production doesn't mean shipping untested code. It means using production as the ultimate test environment with safety nets in place — feature flags to control exposure, canary deployments to limit blast radius, and comprehensive observability to detect problems before users report them.

The shift toward ODD has been driven by the adoption of microservices, where a single user request might traverse dozens of services. Traditional debugging with breakpoints becomes impossible in such distributed environments. Instead, you need systems that tell you what's happening in real time, across every service boundary, without requiring you to predict failure modes in advance.

Observability dashboard showing real-time metrics and distributed traces

The Three Pillars of Observability

Observability rests on three complementary telemetry signals: logs, metrics, and traces. Raw telemetry isn't enough — the key is structured, correlated data that lets you ask questions you didn't anticipate when you wrote the instrumentation.

Structured Logging

Unstructured log messages are nearly impossible to search, aggregate, or alert on. Structured logging with consistent fields enables powerful querying and automated analysis:

import pino from "pino";
 
const logger = pino({
  level: process.env.LOG_LEVEL || "info",
  formatters: {
    level: (label) => ({ level: label }),
  },
  redact: ["req.headers.authorization", "password", "token", "creditCard"],
  serializers: {
    err: pino.stdSerializers.err,
    req: pino.stdSerializers.req,
    res: pino.stdSerializers.res,
  },
});
 
// In request handler
function handleCheckout(req: Request): Response {
  const span = tracer.startSpan("checkout");
  const log = logger.child({
    traceId: span.context().traceId,
    spanId: span.context().spanId,
    userId: req.user.id,
    orderId: req.body.orderId,
    requestId: req.headers["x-request-id"],
  });
 
  log.info(
    { itemCount: req.body.items.length, total: req.body.total },
    "checkout started"
  );
 
  try {
    const result = processPayment(req.body);
    log.info(
      {
        total: result.total,
        processor: result.processor,
        processingTime: result.duration,
      },
      "payment succeeded"
    );
    return result;
  } catch (error) {
    log.error(
      { err: error, processor: req.body.processor, amount: req.body.total },
      "payment failed"
    );
    throw error;
  } finally {
    span.end();
  }
}

Key principles for structured logging:

Use consistent field names across all services (userId, orderId, traceId)
Redact sensitive data automatically (passwords, tokens, PII)
Include context — trace IDs, request IDs, and user IDs enable correlation
Log at appropriate levels — debug for development detail, info for business events, warn for degraded state, error for failures
Ship logs to a centralized system — Loki, Elasticsearch, or Datadog for cross-service querying

Metrics That Drive Decisions

Metrics are numerical measurements aggregated over time. They answer questions like "how many?" and "how fast?" and enable alerting on anomalies:

import { Counter, Histogram, Gauge, Registry } from "prom-client";
 
const register = new Registry();
 
const checkoutTotal = new Counter({
  name: "checkout_total",
  help: "Total checkout attempts",
  labelNames: ["status", "payment_method", "region"],
  registers: [register],
});
 
const checkoutDuration = new Histogram({
  name: "checkout_duration_seconds",
  help: "Checkout processing duration",
  labelNames: ["status"],
  buckets: [0.1, 0.25, 0.5, 1, 2.5, 5, 10],
  registers: [register],
});
 
const activeCarts = new Gauge({
  name: "active_carts_total",
  help: "Number of active shopping carts",
  labelNames: ["region"],
  registers: [register],
});
 
const paymentErrors = new Counter({
  name: "payment_errors_total",
  help: "Payment processing errors by type",
  labelNames: ["error_type", "processor", "region"],
  registers: [register],
});
 
// Usage in application code
async function processCheckout(cart: Cart): Promise<Result> {
  const timer = checkoutDuration.startTimer({ status: "pending" });
  activeCarts.inc({ region: cart.region });
 
  try {
    const result = await chargePayment(cart);
    checkoutTotal.inc({
      status: "success",
      payment_method: cart.paymentMethod,
      region: cart.region,
    });
    timer({ status: "success" });
    return result;
  } catch (error) {
    checkoutTotal.inc({
      status: "failure",
      payment_method: cart.paymentMethod,
      region: cart.region,
    });
    paymentErrors.inc({
      error_type: error.code,
      processor: cart.processor,
      region: cart.region,
    });
    timer({ status: "failure" });
    throw error;
  } finally {
    activeCarts.dec({ region: cart.region });
  }
}

The RED method (Rate, Errors, Duration) provides a consistent framework for service-level metrics:

Metric Type	What It Measures	Example
Rate	Requests per second	`rate(http_requests_total[5m])`
Errors	Failed requests as a ratio	`sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m]))`
Duration	Request latency distribution	`histogram_quantile(0.99, rate(http_duration_bucket[5m]))`

Distributed Tracing

Traces follow a request across service boundaries, showing the complete execution path and time spent in each service. This is invaluable for identifying bottleneck services, unexpected network hops, and cascading failures:

import { trace, SpanStatusCode, context } from "@opentelemetry/api";
 
const tracer = trace.getTracer("checkout-service", "1.0.0");
 
async function fulfillOrder(orderId: string): Promise<void> {
  return tracer.startActiveSpan("fulfillOrder", async (span) => {
    span.setAttribute("order.id", orderId);
    span.setAttribute("order.type", "standard");
 
    try {
      // Step 1: Reserve inventory
      await tracer.startActiveSpan("reserveInventory", async (invSpan) => {
        invSpan.setAttribute("order.id", orderId);
        const reserved = await inventory.reserve(orderId);
        invSpan.setAttribute("items.count", reserved.length);
        invSpan.setAttribute("items.total_value", reserved.reduce((s, i) => s + i.price, 0));
        invSpan.end();
      });
 
      // Step 2: Charge payment
      await tracer.startActiveSpan("chargePayment", async (paySpan) => {
        paySpan.setAttribute("order.id", orderId);
        const charge = await payments.charge(orderId);
        paySpan.setAttribute("payment.id", charge.id);
        paySpan.setAttribute("payment.amount", charge.amount);
        paySpan.setAttribute("payment.processor", charge.processor);
        paySpan.end();
      });
 
      // Step 3: Send confirmation
      await tracer.startActiveSpan("sendConfirmation", async (emailSpan) => {
        emailSpan.setAttribute("order.id", orderId);
        await notifications.sendOrderConfirmation(orderId);
        emailSpan.end();
      });
 
      span.setStatus({ code: SpanStatusCode.OK });
    } catch (error) {
      span.setStatus({ code: SpanStatusCode.ERROR, message: error.message });
      span.recordException(error);
      throw error;
    } finally {
      span.end();
    }
  });
}

Setting Up OpenTelemetry

OpenTelemetry (OTel) is the vendor-neutral standard for collecting telemetry data. It provides a unified API and SDK for logs, metrics, and traces, eliminating the need to integrate with multiple vendor-specific libraries:

import { NodeSDK } from "@opentelemetry/sdk-node";
import { getNodeAutoInstrumentations } from "@opentelemetry/auto-instrumentations-node";
import { OTLPTraceExporter } from "@opentelemetry/exporter-trace-otlp-grpc";
import { OTLPMetricExporter } from "@opentelemetry/exporter-metrics-otlp-grpc";
import { PeriodicExportingMetricReader } from "@opentelemetry/sdk-metrics";
import { Resource } from "@opentelemetry/resources";
import { SemanticResourceAttributes } from "@opentelemetry/semantic-conventions";
import { W3CTraceContextPropagator } from "@opentelemetry/core";
 
const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: "checkout-service",
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.APP_VERSION || "unknown",
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV || "development",
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://otel-collector:4317",
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://otel-collector:4317",
    }),
    exportIntervalMillis: 15000,
  }),
  textMapPropagator: new W3CTraceContextPropagator(),
  instrumentations: [
    getNodeAutoInstrumentations({
      "@opentelemetry/instrumentation-http": { enabled: true },
      "@opentelemetry/instrumentation-express": { enabled: true },
      "@opentelemetry/instrumentation-pg": { enabled: true },
      "@opentelemetry/instrumentation-redis": { enabled: true },
    }),
  ],
});
 
sdk.start();
process.on("SIGTERM", () => sdk.shutdown());

With auto-instrumentation, HTTP requests, database queries, and Redis calls are automatically traced without manual span creation. The OTel Collector acts as an intermediary that receives telemetry from your services and routes it to backends like Jaeger, Prometheus, and Loki:

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
 
processors:
  batch:
    timeout: 5s
    send_batch_size: 1024
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
 
exporters:
  jaeger:
    endpoint: jaeger:14250
    tls:
      insecure: true
  prometheus:
    endpoint: 0.0.0.0:8889
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [jaeger]
    metrics:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

SLOs, SLIs, and Error Budgets

Service Level Objectives (SLOs) define what "good enough" looks like for your users. Service Level Indicators (SLIs) are the actual measurements. Error budgets are the gap between your target and 100%, and they dictate how much risk you can take with deployments.

Engineers analyzing service level objectives on monitoring screens

Defining Meaningful SLOs

# slo-definitions.yaml
slos:
  - name: checkout-availability
    service: checkout-service
    sli: |
      sum(rate(http_requests_total{status!~"5.."}[30d]))
      /
      sum(rate(http_requests_total[30d]))
    target: 0.999    # 99.9% availability
    window: 30d
    alerting:
      burn_rate_threshold: 14.4  # alert if burning budget 14.4x faster than sustainable
      page_after: 1h
 
  - name: checkout-latency
    service: checkout-service
    sli: |
      sum(rate(http_duration_bucket{le="0.5"}[30d]))
      /
      sum(rate(http_duration_bucket{le="+Inf"}[30d]))
    target: 0.995    # 99.5% of requests under 500ms
    window: 30d
 
  - name: checkout-correctness
    service: checkout-service
    sli: |
      sum(rate(checkout_total{status="success"}[30d]))
      /
      sum(rate(checkout_total[30d]))
    target: 0.998    # 99.8% successful checkouts
    window: 30d

Burn Rate Alerting

Instead of alerting on raw error rates, alert on error budget burn rate. A 99.9% SLO gives you 43.2 minutes of downtime per month. If you're burning that budget at 14.4x the sustainable rate, you'll exhaust it in 2 days:

# Prometheus burn rate alert
groups:
  - name: slo-alerts
    rules:
      - alert: HighBurnRate
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[1h]))
              /
              sum(rate(http_requests_total[1h]))
            )
          ) / (1 - 0.999) > 14.4
        for: 5m
        labels:
          severity: critical
          slo: checkout-availability
        annotations:
          summary: "Error budget burning at 14.4x rate"
          description: "At this rate, the monthly error budget will be exhausted in 2 days"
          runbook_url: "https://wiki.internal/runbooks/high-burn-rate"
 
      - alert: ModerateBurnRate
        expr: |
          (
            1 - (
              sum(rate(http_requests_total{status!~"5.."}[6h]))
              /
              sum(rate(http_requests_total[6h]))
            )
          ) / (1 - 0.999) > 6
        for: 30m
        labels:
          severity: warning
          slo: checkout-availability
        annotations:
          summary: "Error budget burning at 6x rate over 6h window"

Error Budget Policy

Define what happens when error budgets are consumed:

Budget Remaining	Deployment Policy	Risk Tolerance
> 50%	Normal deployments, experiments welcome	High
25-50%	Canary deployments required, no risky experiments	Medium
10-25%	Only critical fixes, no feature releases	Low
< 10%	Emergency changes only, full incident response	None

Feature Flags for Safe Rollouts

Feature flags decouple deployment from release. Deploy code to production behind a flag, then enable it gradually based on confidence and business requirements:

interface FeatureFlag {
  name: string;
  description: string;
  enabled: boolean;
  rolloutPercentage: number;
  allowedUsers?: string[];
  allowedRegions?: string[];
  enabledEnvironments?: string[];
  createdAt: string;
  owner: string;
}
 
class FeatureFlagService {
  private flags: Map<string, FeatureFlag> = new Map();
  private cache: Map<string, { value: boolean; expiry: number }> = new Map();
  private cacheTTL = 60_000; // 1 minute
 
  isEnabled(flagName: string, context: UserContext): boolean {
    // Check cache first
    const cached = this.cache.get(`${flagName}:${context.userId}`);
    if (cached && Date.now() < cached.expiry) {
      return cached.value;
    }
 
    const flag = this.flags.get(flagName);
    if (!flag || !flag.enabled) {
      this.setCache(flagName, context.userId, false);
      return false;
    }
 
    // Environment check
    if (flag.enabledEnvironments?.length &&
        !flag.enabledEnvironments.includes(process.env.NODE_ENV)) {
      return false;
    }
 
    // Allowlist check (for testing with specific users)
    if (flag.allowedUsers?.includes(context.userId)) {
      this.setCache(flagName, context.userId, true);
      return true;
    }
 
    // Region check
    if (flag.allowedRegions?.includes(context.region)) {
      this.setCache(flagName, context.userId, true);
      return true;
    }
 
    // Percentage rollout (deterministic per user)
    const hash = this.hashUser(flagName, context.userId);
    const result = (hash % 100) < flag.rolloutPercentage;
    this.setCache(flagName, context.userId, result);
    return result;
  }
 
  private hashUser(flagName: string, userId: string): number {
    let hash = 0;
    const str = `${flagName}:${userId}`;
    for (let i = 0; i < str.length; i++) {
      hash = ((hash << 5) - hash + str.charCodeAt(i)) | 0;
    }
    return Math.abs(hash);
  }
 
  private setCache(flag: string, userId: string, value: boolean): void {
    this.cache.set(`${flag}:${userId}`, {
      value,
      expiry: Date.now() + this.cacheTTL,
    });
  }
}
 
// Usage
const flags = new FeatureFlagService();
 
async function checkout(cart: Cart, user: UserContext): Promise<Result> {
  if (flags.isEnabled("new-checkout-flow", user)) {
    return newCheckoutFlow(cart);
  }
  return legacyCheckoutFlow(cart);
}

Feature Flag Best Practices

Name flags descriptively — new-checkout-flow not flag-123
Track flag owners — every flag should have a responsible team
Clean up flags — remove flags after full rollout; stale flags accumulate debt
Use percentage rollouts — start at 1%, increase to 5%, 10%, 25%, 50%, 100%
Monitor metrics per flag variant — compare error rates, latency, and conversion between control and treatment groups
Add kill switches — flags that instantly disable a feature without a redeployment

Canary Deployments

Deploy new code to a small percentage of traffic first, compare error rates and latency against the baseline, and roll back automatically if metrics degrade:

# Kubernetes canary deployment with Flagger
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
  name: checkout-service
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: checkout-service
  progressDeadlineSeconds: 600
  analysis:
    # Canary analysis schedule
    interval: 30s
    threshold: 5          # max failed checks before rollback
    maxWeight: 50         # max percentage of traffic to canary
    stepWeight: 10        # increment traffic by 10% per step
    stepWeightPromotion: 100
    metrics:
      - name: request-success-rate
        thresholdRange:
          min: 99         # rollback if success rate drops below 99%
        interval: 1m
      - name: request-duration
        thresholdRange:
          max: 500        # rollback if p99 latency exceeds 500ms
        interval: 1m
      - name: error-rate
        thresholdRange:
          max: 1          # rollback if error rate exceeds 1%
        interval: 1m
    webhooks:
      - name: acceptance-test
        type: pre-rollout
        url: http://flagger-loadtester.test/
        timeout: 30s
        metadata:
          type: bash
          cmd: "curl -sd 'test' http://checkout-service-canary.production/health"

Manual Canary Process

If you don't use Flagger, implement canary deployments manually with Kubernetes:

# Primary deployment (receives 90% traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-service-primary
spec:
  replicas: 9
  selector:
    matchLabels:
      app: checkout-service
      track: primary
  template:
    metadata:
      labels:
        app: checkout-service
        track: primary
        version: v1.2.0
---
# Canary deployment (receives 10% traffic)
apiVersion: apps/v1
kind: Deployment
metadata:
  name: checkout-service-canary
spec:
  replicas: 1
  selector:
    matchLabels:
      app: checkout-service
      track: canary
  template:
    metadata:
      labels:
        app: checkout-service
        track: canary
        version: v1.3.0

Progressive Delivery Strategies

Progressive delivery extends canary deployments with additional rollout strategies that minimize risk:

Blue-Green Deployments

Run two identical environments and switch traffic atomically. This provides instant rollback by routing traffic back to the blue environment:

# Argo Rollouts blue-green strategy
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
  name: checkout-service
spec:
  replicas: 5
  strategy:
    blueGreen:
      activeService: checkout-active
      previewService: checkout-preview
      autoPromotionEnabled: false
      prePromotionAnalysis:
        templates:
          - templateName: smoke-tests
        args:
          - name: service-name
            value: checkout-preview.production.svc.cluster.local
      postPromotionAnalysis:
        templates:
          - templateName: production-metrics
        args:
          - name: service-name
            value: checkout-active.production.svc.cluster.local

Shadow Traffic (Dark Launch)

Mirror production traffic to a new version without affecting users. Compare responses between the primary and shadow versions to detect behavioral differences:

# Nginx traffic mirroring
upstream primary {
    server checkout-v1:3000;
}
 
upstream shadow {
    server checkout-v2:3000;
}
 
server {
    location /api/checkout {
        mirror /mirror;
        proxy_pass http://primary;
    }
 
    location = /mirror {
        internal;
        proxy_pass http://shadow$request_uri;
        proxy_set_header X-Mirror-Request "true";
    }
}

Chaos Engineering

Chaos engineering tests system resilience by injecting controlled failures. Combined with observability, it validates that your monitoring catches problems and your systems degrade gracefully:

import { ChaosMonkey } from "./chaos-engine";
 
const chaos = new ChaosMonkey({
  enabled: process.env.CHAOS_ENABLED === "true",
  rules: [
    {
      name: "payment-latency",
      target: "payment-service",
      type: "latency",
      delayMs: 3000,
      probability: 0.05,    // 5% of requests
      duration: "10m",
    },
    {
      name: "inventory-error",
      target: "inventory-service",
      type: "error",
      statusCode: 503,
      probability: 0.02,    // 2% of requests
      duration: "5m",
    },
    {
      name: "database-connection-pool",
      target: "postgres",
      type: "connection_drain",
      connectionsToDrop: 10,
      probability: 1.0,     // Always active during experiment
      duration: "2m",
    },
  ],
  observability: {
    logExperiments: true,
    metricsPrefix: "chaos",
    alertOnExperimentStart: true,
    slackChannel: "#chaos-engineering",
  },
});
 
// Middleware to inject chaos
app.use(async (req, res, next) => {
  const chaosResult = await chaos.evaluate(req);
  if (chaosResult.shouldInject) {
    logger.warn({
      chaos: true,
      rule: chaosResult.ruleName,
      type: chaosResult.type,
      requestId: req.headers["x-request-id"],
    }, "Chaos injection active");
 
    chaosInjectedTotal.inc({
      rule: chaosResult.ruleName,
      type: chaosResult.type,
    });
 
    if (chaosResult.type === "latency") {
      await delay(chaosResult.delayMs);
    }
    if (chaosResult.type === "error") {
      return res.status(chaosResult.statusCode).json({
        error: "Chaos injection",
        rule: chaosResult.ruleName,
      });
    }
  }
  next();
});

GameDay Playbooks

Run regular GameDays where you simulate failure scenarios and validate your response:

# gameday-checkout-deps.yaml
name: "Checkout Dependency Failure GameDay"
date: "2024-03-15"
participants: ["payments-team", "platform-team", "sre-team"]
hypothesis: "When the payment service becomes unavailable, checkout gracefully degrades and users see a clear error message within 5 seconds"
 
scenarios:
  - name: "Payment service timeout"
    action: "Block all traffic to payment-service for 5 minutes"
    expected:
      - "Error rate increases to 100% for checkout"
      - "p99 latency stays under 5s (timeout)"
      - "Alert fires within 2 minutes"
      - "Runbook executed within 5 minutes"
    metrics_to_watch:
      - "checkout_error_rate"
      - "checkout_p99_latency"
      - "alert_response_time"
 
  - name: "Database connection exhaustion"
    action: "Drop 90% of database connections"
    expected:
      - "Checkout success rate drops to ~10%"
      - "Connection pool recovers within 30s after chaos stops"
      - "No data corruption"

Alerting That Doesn't Cry Wolf

Good alerts are actionable, have clear ownership, and include runbooks. Bad alerts wake you up at 3 AM for a transient blip and train on-call engineers to ignore pages.

# Prometheus alerting rules
groups:
  - name: checkout-alerts
    rules:
      - alert: HighCheckoutErrorRate
        expr: |
          sum(rate(checkout_total{status="failure"}[5m]))
          /
          sum(rate(checkout_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
          team: payments
          service: checkout
        annotations:
          summary: "Checkout error rate above 5% for 5 minutes"
          description: "Error rate is {{ $value | humanizePercentage }}"
          runbook_url: "https://wiki.internal/runbooks/checkout-errors"
          dashboard_url: "https://grafana.internal/d/checkout-overview"
 
      - alert: CheckoutLatencyDegraded
        expr: |
          histogram_quantile(0.99,
            sum(rate(checkout_duration_seconds_bucket[5m])) by (le)
          ) > 2
        for: 10m
        labels:
          severity: warning
          team: payments
          service: checkout
        annotations:
          summary: "Checkout p99 latency above 2 seconds"
 
      - alert: CheckoutLatencyCritical
        expr: |
          histogram_quantile(0.99,
            sum(rate(checkout_duration_seconds_bucket[5m])) by (le)
          ) > 5
        for: 5m
        labels:
          severity: critical
          team: payments
          service: checkout
        annotations:
          summary: "Checkout p99 latency above 5 seconds — user impact likely"

Alert Hygiene Rules

Every alert must be actionable — if the on-call engineer can't take action, the alert shouldn't exist
Every alert must have an owner — assign alerts to teams, not individuals
Every critical alert must have a runbook — document the investigation and mitigation steps
Use for durations — transient spikes shouldn't page anyone; require sustained degradation
Review alert frequency quarterly — remove alerts that page frequently but never require action
Use severity levels consistently — critical for user-facing impact, warning for degradation, info for awareness

Safe Experimentation

A/B Testing with Observability

Combine feature flags with metrics to run controlled experiments:

async function runCheckoutExperiment(
  cart: Cart,
  user: UserContext
): Promise<ExperimentResult> {
  const variant = flags.isEnabled("checkout-experiment-v2", user)
    ? "treatment"
    : "control";
 
  experimentEvents.inc({ experiment: "checkout-v2", variant });
 
  const startTime = Date.now();
  try {
    const result = variant === "treatment"
      ? await newCheckoutFlow(cart)
      : await legacyCheckoutFlow(cart);
 
    checkoutDuration.observe(
      { variant, status: "success" },
      (Date.now() - startTime) / 1000
    );
 
    return { variant, success: true, duration: Date.now() - startTime };
  } catch (error) {
    checkoutDuration.observe(
      { variant, status: "failure" },
      (Date.now() - startTime) / 1000
    );
    throw error;
  }
}

Statistical Significance

Don't make decisions on small sample sizes. Use proper statistical methods:

function isExperimentSignificant(
  control: ExperimentMetrics,
  treatment: ExperimentMetrics,
  minSampleSize: number = 1000,
  confidenceLevel: number = 0.95
): SignificanceResult {
  if (control.sampleSize < minSampleSize || treatment.sampleSize < minSampleSize) {
    return { significant: false, reason: "Insufficient sample size" };
  }
 
  const zScore = calculateZScore(control, treatment);
  const pValue = calculatePValue(zScore);
  const significant = pValue < (1 - confidenceLevel);
 
  return {
    significant,
    pValue,
    zScore,
    improvement: (treatment.mean - control.mean) / control.mean,
    confidenceInterval: calculateCI(control, treatment, confidenceLevel),
  };
}

Common Pitfalls

Pitfall	Impact	Solution
Unstructured log messages	Can't search or aggregate logs effectively	Use structured JSON logging with consistent fields across all services
Too many alerts	Alert fatigue — real incidents get ignored in noise	Every alert must be actionable with a clear owner; review quarterly
No feature flags	Big-bang releases with high risk of outages	Decouple deployment from release; use progressive rollouts
Missing traces	Can't debug cross-service issues	Instrument all service boundaries with OpenTelemetry
Staging-only testing	Bugs that only appear under real traffic	Use canary deployments with automatic rollback and production observability
Logging sensitive data	Privacy violations, compliance failures	Redact PII, tokens, and credentials automatically in your logging configuration
No alerting runbooks	Slow incident response	Document investigation and mitigation steps for every critical alert
Metrics without context	Hard to correlate signals	Use trace IDs in logs and metric labels for cross-signal correlation
No SLOs defined	No objective way to measure service health	Define SLOs based on user experience, not infrastructure metrics
Ignoring error budgets	Reckless deployments during degraded periods	Use error budgets to gate deployment velocity

Best Practices

Instrument at service boundaries — log and trace every incoming request, outgoing dependency call, and database query
Use the RED method — Rate, Errors, and Duration for every service endpoint
Correlate signals — include trace IDs in logs and metric labels so you can jump between signals
Deploy with feature flags — never ship code without a kill switch
Use canary deployments — limit blast radius to 5-10% of traffic during initial rollout
Write runbooks before incidents — document investigation steps for every critical alert
Set SLOs, not just SLIs — define target error rates and latency thresholds; alert on budget burn rate
Run chaos experiments regularly — validate that your systems degrade gracefully under failure
Automate rollback — human judgment is slow; let metrics drive rollback decisions
Review and iterate — observability is a practice, not a one-time setup; refine instrumentation based on incidents

Conclusion

Observability-Driven Development shifts the question from "does this work in staging?" to "how do I know this works in production?" The answer is instrumentation, progressive rollout, and automated safety nets. Structured logs let you debug issues you didn't predict. Metrics tell you when behavior changes. Traces show you where time is spent across service boundaries. Feature flags let you deploy continuously without releasing recklessly. Canary deployments catch regressions before they reach all users. SLOs and error budgets provide objective guardrails for deployment velocity. Chaos engineering validates that your safety nets actually work when failures occur.

Together, these practices create a feedback loop where production telemetry informs development decisions, deployments are safe by default, and incidents are detected and resolved before users notice. This is the foundation of modern SRE and the key to shipping fast without breaking things.

Minh Vo

Slaying code & making it lit fr fr 🔥 tagline