MinhVo

Minh Vo

rss feed

Slaying code & making it lit fr fr 🔥 tagline

Hey there 👋 I'm an AI Engineer with 7 years of experience building scalable web and mobile applications. Currently at Neurond AI (May 2025 — present), architecting an Enterprise AI Assistant Platform with multi-tenant RAG on pgvector, multi-provider LLM orchestration, and Azure-native infrastructure. Previously spent 5+ years at SNAPTEC (Sep 2019 — Apr 2025), leading SaaS themes, admin dashboards, and e-commerce platforms — earned the Hero of the Year award in 2021. I specialize in TypeScript, React, Next.js, and AI-Native engineering with Claude Code and Cursor.bio

Back to blogs

Cloud-Native Observability: OpenTelemetry Collector

Deploy OTel Collector: receivers, processors, exporters, and pipeline configuration.

OpenTelemetryObservabilityCloud-NativeDevOps

By MinhVo

Introduction

Observability is the cornerstone of operating reliable cloud-native systems. When your application consists of dozens of microservices running across multiple clusters and regions, understanding what's happening inside the system requires more than just logs. You need a unified approach to collecting, processing, and exporting telemetry data—traces, metrics, and logs—from every component in your stack.

OpenTelemetry (OTel) has emerged as the industry standard for instrumenting cloud-native applications. It provides vendor-neutral APIs, SDKs, and tools for generating and collecting telemetry data. At the center of the OpenTelemetry architecture sits the OpenTelemetry Collector, a proxy that receives telemetry data from your applications, processes it (filtering, sampling, enriching), and exports it to one or more observability backends like Jaeger, Prometheus, Datadog, or Grafana.

Observability architecture

The Collector decouples your applications from your observability backend. Applications send telemetry to the Collector using a standard protocol, and the Collector handles routing, transformation, and delivery. This architecture provides flexibility to switch backends without re-instrumenting your applications, and allows centralized control over telemetry data processing.

Understanding OpenTelemetry: Core Concepts

OpenTelemetry defines three pillars of observability: traces (distributed request flows across services), metrics (numerical measurements of system behavior over time), and logs (discrete events with timestamps and severity levels). The Collector can handle all three signal types through a unified pipeline architecture.

The Collector itself consists of three components: receivers accept telemetry data in various formats (OTLP, Jaeger, Zipkin, Prometheus), processors transform the data (batching, filtering, sampling, attribute enrichment), and exporters send the data to backends. These components are connected through pipelines that define the flow of data from receivers through processors to exporters.

OpenTelemetry pipeline architecture

The Collector supports two deployment modes. The agent mode runs as a sidecar or daemonset alongside your applications, collecting telemetry locally and forwarding it to a central Collector. The gateway mode runs as a standalone service that receives telemetry from multiple agents, performs centralized processing, and exports to backends. Most production deployments use both: agents for local collection and a gateway for centralized processing.

The OpenTelemetry Protocol (OTLP) is the native protocol for transmitting telemetry data. It uses Protocol Buffers over gRPC or HTTP and supports traces, metrics, and logs. OTLP is designed for efficiency—batching, compression, and streaming reduce network overhead while maintaining low latency for real-time observability.

Architecture and Design Patterns

Pipeline Architecture

The Collector's pipeline architecture is its most powerful feature. A pipeline connects one or more receivers to one or more exporters through a series of processors. You can define multiple pipelines for different signal types (traces, metrics, logs) and route data through different processing paths based on attributes.

# otel-collector-config.yaml
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
      http:
        endpoint: 0.0.0.0:4318
  prometheus:
    config:
      scrape_configs:
        - job_name: 'my-app'
          scrape_interval: 15s
          static_configs:
            - targets: ['localhost:8080']
 
processors:
  batch:
    timeout: 5s
    send_batch_size: 1000
  memory_limiter:
    check_interval: 1s
    limit_mib: 512
    spike_limit_mib: 128
  attributes:
    actions:
      - key: environment
        value: production
        action: upsert
 
exporters:
  otlp/jaeger:
    endpoint: jaeger:4317
    tls:
      insecure: false
  prometheus:
    endpoint: 0.0.0.0:8889
  loki:
    endpoint: http://loki:3100/loki/api/v1/push
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch, attributes]
      exporters: [otlp/jaeger]
    metrics:
      receivers: [otlp, prometheus]
      processors: [memory_limiter, batch]
      exporters: [prometheus]
    logs:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [loki]

Sampling Strategies

Sampling is critical for controlling the volume of telemetry data in production. The Collector supports head-based sampling (deciding at the start of a trace) and tail-based sampling (deciding after the trace completes, keeping interesting traces like errors or slow requests).

processors:
  tail_sampling:
    decision_wait: 10s
    num_traces: 100000
    policies:
      - name: error-policy
        type: status_code
        status_code:
          status_codes: [ERROR]
      - name: slow-traces-policy
        type: latency
        latency:
          threshold_ms: 1000
      - name: probabilistic-policy
        type: probabilistic
        probabilistic:
          sampling_percentage: 10

Agent-Gateway Pattern

The agent-gateway pattern deploys lightweight Collector instances as agents (sidecars or daemonsets) that forward telemetry to a centralized gateway Collector. The gateway handles heavy processing like tail-based sampling, which requires seeing all spans from a trace before making a decision.

# Agent config (lightweight, deployed as sidecar)
receivers:
  otlp:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317
 
processors:
  batch:
    timeout: 1s
    send_batch_size: 500
  memory_limiter:
    limit_mib: 128
 
exporters:
  otlp:
    endpoint: otel-gateway:4317
    tls:
      insecure: false
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [memory_limiter, batch]
      exporters: [otlp]

Collector deployment pattern

Step-by-Step Implementation

Instrumenting a Node.js Application

// tracing.ts - Initialize OpenTelemetry before any other imports
import { NodeSDK } from '@opentelemetry/sdk-node'
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-grpc'
import { OTLPMetricExporter } from '@opentelemetry/exporter-metrics-otlp-grpc'
import { PeriodicExportingMetricReader } from '@opentelemetry/sdk-metrics'
import { Resource } from '@opentelemetry/resources'
import { SemanticResourceAttributes } from '@opentelemetry/semantic-conventions'
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node'
 
const sdk = new NodeSDK({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'my-api-service',
    [SemanticResourceAttributes.SERVICE_VERSION]: '1.0.0',
    [SemanticResourceAttributes.DEPLOYMENT_ENVIRONMENT]: process.env.NODE_ENV
  }),
  traceExporter: new OTLPTraceExporter({
    url: process.env.OTEL_COLLECTOR_URL || 'http://localhost:4317'
  }),
  metricReader: new PeriodicExportingMetricReader({
    exporter: new OTLPMetricExporter({
      url: process.env.OTEL_COLLECTOR_URL || 'http://localhost:4317'
    }),
    exportIntervalMillis: 15000
  }),
  instrumentations: [getNodeAutoInstrumentations()]
})
 
sdk.start()

Deploying the Collector on Kubernetes

# collector-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: otel-collector
  namespace: observability
spec:
  replicas: 3
  selector:
    matchLabels:
      app: otel-collector
  template:
    metadata:
      labels:
        app: otel-collector
    spec:
      containers:
        - name: collector
          image: otel/opentelemetry-collector-contrib:0.96.0
          args: ["--config=/etc/otel-collector-config.yaml"]
          ports:
            - containerPort: 4317 # OTLP gRPC
            - containerPort: 4318 # OTLP HTTP
            - containerPort: 8889 # Prometheus exporter
          resources:
            requests:
              memory: "256Mi"
              cpu: "250m"
            limits:
              memory: "512Mi"
              cpu: "500m"
          volumeMounts:
            - name: config
              mountPath: /etc/otel-collector-config.yaml
              subPath: otel-collector-config.yaml
      volumes:
        - name: config
          configMap:
            name: otel-collector-config

Custom Span Attributes

// app.ts - Adding custom attributes to traces
import { trace, SpanStatusCode } from '@opentelemetry/api'
 
const tracer = trace.getTracer('my-api-service')
 
async function processOrder(orderId: string) {
  return tracer.startActiveSpan('processOrder', async (span) => {
    try {
      span.setAttribute('order.id', orderId)
      span.setAttribute('order.priority', 'high')
 
      const order = await fetchOrder(orderId)
      span.setAttribute('order.total', order.total)
      span.setAttribute('order.item_count', order.items.length)
 
      // Create child span for payment processing
      const paymentSpan = tracer.startSpan('processPayment')
      paymentSpan.setAttribute('payment.method', order.paymentMethod)
      
      await chargePayment(order)
      paymentSpan.end()
 
      span.setStatus({ code: SpanStatusCode.OK })
      return order
    } catch (error) {
      span.setStatus({
        code: SpanStatusCode.ERROR,
        message: error.message
      })
      span.recordException(error)
      throw error
    } finally {
      span.end()
    }
  })
}

Real-World Use Cases

Microservices Debugging

When a user reports a slow request in a microservices architecture, distributed tracing through the Collector allows you to visualize the entire request flow across services. Each service adds its span to the trace, and the Collector correlates them using trace context propagation. This makes it possible to identify the exact service and operation causing the slowdown.

Cost Optimization

The Collector's sampling and filtering capabilities help control observability costs. By using tail-based sampling to keep only error traces and slow traces while sampling healthy traffic at 1%, teams can reduce their observability storage costs by 90%+ while retaining the most valuable data for debugging.

Multi-Backend Migration

When migrating from one observability vendor to another, the Collector's multi-exporter capability allows you to send telemetry to both the old and new backend simultaneously. This enables gradual migration without losing visibility during the transition.

Compliance and Data Privacy

The Collector's attribute processor can redact sensitive information (PII, credentials, tokens) from telemetry data before it reaches the backend. This is essential for compliance with GDPR, HIPAA, and other data privacy regulations.

Best Practices for Production

  1. Set memory limits: Always configure the memory_limiter processor to prevent the Collector from consuming unbounded memory. Set the limit based on your available resources and expected telemetry volume.

  2. Use batching: The batch processor reduces the number of export calls and improves throughput. Configure appropriate batch sizes and timeouts based on your latency requirements and backend capabilities.

  3. Deploy as a daemonset on Kubernetes: Running the Collector as a daemonset ensures one instance per node, reducing network hops for telemetry collection. Use the Kubernetes attributes processor to enrich telemetry with pod and node metadata.

  4. Monitor the Collector itself: The Collector exposes its own metrics on port 8888. Monitor queue lengths, dropped spans, memory usage, and export latency to detect issues before they impact observability.

  5. Use tail-based sampling for production: Head-based sampling makes decisions early and misses interesting traces. Tail-based sampling keeps error traces and slow traces while sampling healthy traffic, providing better coverage of important events.

  6. Implement retry logic: Configure retry settings on exporters to handle transient failures. The Collector supports configurable retry with exponential backoff for all exporters.

  7. Separate pipelines by signal type: Use different pipelines for traces, metrics, and logs. This allows independent processing, sampling, and routing for each signal type.

  8. Use the Collector Contrib distribution: The contrib distribution includes additional receivers, processors, and exporters not in the core distribution. Use it for production deployments to access community-contributed components.

Common Pitfalls and Solutions

PitfallImpactSolution
No memory limits on CollectorOOM crashes, data lossConfigure memory_limiter processor with appropriate limits
Missing tail-based samplingHigh storage costs, noiseImplement tail_sampling with error and latency policies
No batchingHigh network overhead, rate limitingConfigure batch processor with appropriate size and timeout
Sending telemetry directly to backendVendor lock-in, no processing flexibilityUse Collector as intermediary
Not monitoring Collector healthSilent data lossExport Collector's own metrics to monitoring system
Using head sampling for distributed tracesLosing important tracesUse tail-based sampling to keep errors and slow traces
Not configuring retry on exportersData loss during transient failuresConfigure retry_on_failure with exponential backoff
Ignoring context propagationBroken traces across servicesUse W3C TraceContext propagation headers

Performance Optimization

The Collector's performance depends on its configuration and deployment topology. For high-throughput environments, deploy multiple Collector instances behind a load balancer and use the batch processor to reduce export overhead.

# High-throughput Collector configuration
processors:
  batch:
    timeout: 2s
    send_batch_size: 5000
    send_batch_max_size: 10000
  memory_limiter:
    check_interval: 500ms
    limit_mib: 1024
    spike_limit_mib: 256
  # Drop high-cardinality attributes to reduce memory usage
  filter:
    traces:
      exclude:
        match_type: regexp
        attributes:
          - key: http.request.header.authorization
          - key: http.request.header.cookie

Use the filter processor to drop noisy or high-cardinality attributes that consume memory without providing debugging value. Configure the k8sattributes processor to enrich telemetry with Kubernetes metadata, reducing the need for application-level attribute instrumentation.

Comparison with Alternatives

FeatureOTel CollectorJaeger CollectorPrometheus AgentFluentd
Signal TypesTraces, Metrics, LogsTraces onlyMetrics onlyLogs primarily
Protocol SupportOTLP, Jaeger, Zipkin, Prometheus, +50 moreJaeger, ZipkinPrometheus remote write50+ input plugins
ProcessingReceivers, Processors, ExportersStorage adaptersRelabel, scrapeFilters, transforms
Vendor NeutralYesJaeger-specificPrometheus-specificYes
DeploymentAgent or GatewayGatewayAgentAgent or Aggregator
Tail SamplingYesNoNoNo
Custom ExtensionsYes (Go plugins)LimitedNoYes (Ruby plugins)

Advanced Patterns

Multi-Tenant Routing

Route telemetry from different tenants to different backends using the routing processor.

processors:
  routing:
    table:
      - key: tenant
        value: "tenant-a"
        exporters: [otlp/tenant-a-backend]
      - key: tenant
        value: "tenant-b"
        exporters: [otlp/tenant-b-backend]
    default_exporters: [otlp/default-backend]

Log-to-Trace Correlation

Connect logs to their corresponding traces using the transform processor to add trace context to log records.

processors:
  transform:
    log_statements:
      - context: log
        statements:
          - set(attributes["trace_id"], trace_id_string)
          - set(attributes["span_id"], span_id_string)

Metrics from Traces

Generate RED (Rate, Error, Duration) metrics from trace data using the spanmetrics connector.

connectors:
  spanmetrics:
    histogram:
      explicit:
        buckets: [5ms, 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 5s]
    dimensions:
      - name: http.method
      - name: http.status_code
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      exporters: [spanmetrics, jaeger]
    metrics:
      receivers: [spanmetrics]
      exporters: [prometheus]

Testing Strategies

Test your Collector configuration in a staging environment before deploying to production. Use the debug exporter to verify that telemetry is being processed correctly.

# Test configuration with debug exporter
exporters:
  debug:
    verbosity: detailed
    sampling_initial: 10
    sampling_thereafter: 100
 
service:
  pipelines:
    traces:
      receivers: [otlp]
      processors: [batch]
      exporters: [debug]

Validate sampling policies by sending known test traces and verifying that the correct traces are retained. Test memory limits by generating high-cardinality telemetry and verifying that the Collector handles backpressure gracefully.

Future Outlook

OpenTelemetry is rapidly evolving with new features like profiles (continuous profiling), events (structured events beyond logs), and experimental support for client-side web and mobile telemetry. The Collector is adding support for more processing capabilities, including machine learning-based anomaly detection and automatic attribute extraction.

The OpenTelemetry project is also working on improving the Collector's performance with more efficient data structures, better batching algorithms, and native support for hardware acceleration. The goal is to make the Collector capable of handling millions of spans per second with minimal resource consumption.

Community Resources and Further Learning

The technology landscape evolves rapidly, making continuous learning essential for maintaining expertise. Building a systematic approach to staying current with developments in your technology stack ensures you can leverage new features and avoid deprecated patterns.

Curated Learning Pathways

Rather than consuming content randomly, create structured learning pathways aligned with your current projects and career goals. Start with official documentation and specification documents, which provide the most accurate and comprehensive information. Follow this with hands-on tutorials and workshops that reinforce concepts through practical application.

Technical blogs from framework maintainers and core team members often provide deeper insights into design decisions and upcoming features. Subscribe to the official blogs of your primary frameworks and libraries to stay ahead of breaking changes and deprecation timelines.

Contributing to Open Source

Contributing to open-source projects in your technology stack provides unparalleled learning opportunities. Start with documentation improvements and bug reports, then progress to fixing small issues tagged as "good first issue" in your favorite projects. This direct engagement with maintainers and the codebase accelerates your understanding far beyond what passive learning can achieve.

# Setting up for contribution
git clone https://github.com/project/repository.git
cd repository
git checkout -b fix/issue-description
 
# Run the project's contribution setup
npm run setup:dev
npm run test  # Ensure tests pass before making changes
 
# Make your changes, then run the full test suite
npm run test:full
npm run lint
npm run build
 
# Submit your contribution
git add -A
git commit -m "fix: description of the fix
 
Closes #1234"
git push origin fix/issue-description

Building a Technical Knowledge Base

Maintain a personal knowledge base that captures insights, solutions, and patterns you discover during your work. Tools like Obsidian, Notion, or even a simple Markdown repository can serve as an external memory that grows more valuable over time.

Organize your notes by topic rather than chronologically, and include code examples, links to relevant documentation, and explanations of why certain approaches work better than others. When you encounter a particularly insightful article or conference talk, write a summary that captures the key takeaways and how they apply to your current projects.

Follow key conferences and their published talks to stay informed about emerging patterns and best practices. Many conferences publish recorded talks on YouTube within weeks of the event, making world-class technical content freely accessible.

Join relevant Discord servers, Slack communities, and forums where practitioners discuss real-world challenges and solutions. These communities provide early warning about emerging issues and access to collective wisdom that isn't available through formal documentation.

Mentorship and Knowledge Sharing

Teaching others is one of the most effective ways to deepen your own understanding. Consider writing technical blog posts, giving talks at local meetups, or mentoring junior developers. The process of explaining concepts to others forces you to organize your knowledge and identify gaps in your understanding.

Pair programming sessions with colleagues of different experience levels create mutual learning opportunities. Senior developers gain fresh perspectives on problems they've solved the same way for years, while junior developers benefit from exposure to production-grade thinking and decision-making processes.

Conclusion

The OpenTelemetry Collector is the central hub for cloud-native observability. Its pipeline architecture provides the flexibility to receive telemetry from any source, process it with powerful transformations, and export it to any backend. By decoupling instrumentation from backend selection, the Collector future-proofs your observability investment.

Key takeaways:

  1. Deploy the Collector as an intermediary between your applications and observability backends to avoid vendor lock-in and enable centralized processing.
  2. Use tail-based sampling to control costs while retaining the most valuable traces—errors and slow requests.
  3. Monitor the Collector itself to detect data loss, memory pressure, and export failures before they impact your observability.
  4. Start with the contrib distribution for access to the full ecosystem of receivers, processors, and exporters.

Start by instrumenting one service with OpenTelemetry, deploying a Collector instance, and exporting to your existing observability backend. Gradually expand to cover your entire stack. Refer to the OpenTelemetry documentation for detailed guides and the Collector configuration reference for all available components.