Prometheus and Grafana: Monitoring Stack Guide

Introduction

Observability is the foundation of reliable software systems, and the combination of Prometheus and Grafana has become the industry-standard open-source monitoring stack. Prometheus, originally built at SoundCloud and now a Cloud Native Computing Foundation (CNCF) graduated project, provides a powerful time-series database and query language designed specifically for monitoring metrics. Grafana, the leading open-source visualization platform, transforms Prometheus data into actionable dashboards that give teams real-time visibility into system health, performance trends, and anomalies.

The Prometheus-Grafana stack replaces expensive commercial monitoring solutions like Datadog, New Relic, and Dynatrace with a self-hosted, fully customizable alternative that scales from a single-node application to thousands of microservices. Companies like DigitalOcean, GitLab, and JPMorgan Chase run this stack at massive scale, monitoring billions of time series with sub-second query latency.

This guide covers the complete monitoring pipeline — from instrumenting your applications with Prometheus metrics, through configuring scrape targets and recording rules, to building production-grade Grafana dashboards with alerting, all deployed on Kubernetes with high availability.

Understanding Prometheus and Grafana: Core Concepts

The Prometheus Data Model

Prometheus stores all data as time series — streams of timestamped values belonging to a metric with a set of label key-value pairs. A metric name describes what is being measured (e.g., http_requests_total), while labels provide dimensional context (e.g., method="POST", status="200", handler="/api/users"). This dimensional data model enables powerful ad-hoc querying without predefined table schemas.

The four core metric types are:

Counter: A monotonically increasing value that resets to zero on process restart. Use for request counts, error counts, bytes transferred, and anything that only goes up.
Gauge: A value that can go up or down. Use for temperatures, queue depths, memory usage, and in-flight request counts.
Histogram: Samples observations and counts them in configurable buckets. Use for request duration, response sizes, and any value where you need percentile calculations.
Summary: Similar to histograms but calculates quantiles on the client side. Use sparingly — histograms with server-side aggregation are preferred for most use cases.

Prometheus Architecture

Prometheus operates on a pull-based model: it scrapes metrics from instrumented targets at regular intervals via HTTP endpoints (typically /metrics). This pull model provides several advantages over push-based systems: targets do not need to know about Prometheus, you can run Prometheus against targets without modifying them, and the monitoring system controls when and how often data is collected.

┌─────────────┐     scrape     ┌──────────────┐
│  Application │ ◄──────────── │  Prometheus   │
│  /metrics    │               │  Server       │
└─────────────┘               │  ┌──────────┐ │
                               │  │ TSDB     │ │
┌─────────────┐     scrape     │  └──────────┘ │
│  Node        │ ◄──────────── │  ┌──────────┐ │
│  Exporter    │               │  │ PromQL   │ │
└─────────────┘               │  └──────────┘ │
                               └──────┬───────┘
┌─────────────┐     scrape            │ query
│  Kubernetes │ ◄────────────         │
│  Metrics    │                       ▼
└─────────────┘               ┌──────────────┐
                               │   Grafana    │
                               │  Dashboards  │
                               └──────────────┘

Grafana's Role

Grafana connects to Prometheus as a data source and provides a rich visualization layer. It supports dashboards with panels for graphs, gauges, tables, heatmaps, and alert lists. Grafana's alerting engine evaluates PromQL expressions on a schedule and routes notifications to channels like Slack, PagerDuty, email, or webhook endpoints.

Architecture and Design Patterns

Multi-Service Monitoring Architecture

In a production environment with multiple services, you deploy Prometheus alongside service-specific exporters. Each exporter translates a service's native metrics format into Prometheus format. The Prometheus server scrapes all exporters and stores the time series in its local time-series database (TSDB).

For high availability, run two identical Prometheus servers scraping the same targets. Use Grafana's built-in support for multiple data sources to query both replicas, or use Thanos or Cortex for long-term storage and global querying across clusters.

Recording Rules and Federation

Recording rules pre-compute expensive PromQL expressions at regular intervals and store the results as new time series. This trades storage space for query speed — dashboards load instantly when powered by recording rules instead of raw PromQL.

# prometheus-rules.yml
groups:
  - name: recording_rules
    interval: 30s
    rules:
      # Pre-compute request rate per service
      - record: service:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (service, method, status)
 
      # Pre-compute P99 latency
      - record: service:http_duration:p99
        expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))
 
      # Pre-compute error rate
      - record: service:http_errors:ratio
        expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)

Label Strategy

A consistent label strategy is critical for maintainable Prometheus deployments. Labels should be low-cardinality — avoid using user IDs, request IDs, or timestamps as label values. Each unique combination of metric name and labels creates a new time series, so high-cardinality labels cause memory and storage explosion.

Step-by-Step Implementation

Installing Prometheus with Docker Compose

# docker-compose.yml
version: '3.8'
services:
  prometheus:
    image: prom/prometheus:v2.50.0
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./rules:/etc/prometheus/rules
      - prometheus_data:/prometheus
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--storage.tsdb.retention.size=10GB'
      - '--web.enable-lifecycle'
 
  grafana:
    image: grafana/grafana:10.3.0
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
 
  node-exporter:
    image: prom/node-exporter:v1.7.0
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
 
volumes:
  prometheus_data:
  grafana_data:

Prometheus Configuration

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s
 
rule_files:
  - /etc/prometheus/rules/*.yml
 
alerting:
  alertmanagers:
    - static_configs:
        - targets: ['alertmanager:9093']
 
scrape_configs:
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']
 
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['node-exporter:9100']
 
  - job_name: 'api-server'
    metrics_path: '/metrics'
    scrape_interval: 10s
    static_configs:
      - targets: ['api:8080']
        labels:
          service: 'api'
          environment: 'production'
 
  - job_name: 'kubernetes-pods'
    kubernetes_sd_configs:
      - role: pod
    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
        action: keep
        regex: true
      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
        action: replace
        target_label: __metrics_path__
        regex: (.+)

Instrumenting a Node.js Application

import express from 'express';
import { register, Counter, Histogram, Gauge, collectDefaultMetrics } from 'prom-client';
 
// Collect default Node.js metrics (GC, event loop, memory, etc.)
collectDefaultMetrics({ prefix: 'myapp_' });
 
// Define custom metrics
const httpRequestsTotal = new Counter({
  name: 'http_requests_total',
  help: 'Total HTTP requests',
  labelNames: ['method', 'path', 'status'] as const,
});
 
const httpRequestDuration = new Histogram({
  name: 'http_request_duration_seconds',
  help: 'HTTP request duration in seconds',
  labelNames: ['method', 'path'] as const,
  buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
});
 
const activeConnections = new Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
});
 
const app = express();
 
// Metrics middleware
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer({ method: req.method, path: req.path });
  activeConnections.inc();
 
  res.on('finish', () => {
    end();
    httpRequestsTotal.inc({ method: req.method, path: req.path, status: String(res.statusCode) });
    activeConnections.dec();
  });
 
  next();
});
 
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});
 
app.get('/api/users', (req, res) => {
  res.json([{ id: 1, name: 'Alice' }]);
});
 
app.listen(8080, () => console.log('Server running on :8080'));

Essential PromQL Queries

# Request rate per second (5-minute window)
rate(http_requests_total[5m])
 
# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
 
# P95 request duration
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
 
# Top 5 endpoints by request volume
topk(5, sum(rate(http_requests_total[5m])) by (path))
 
# Memory usage growth over 1 hour
delta(process_resident_memory_bytes[1h])
 
# Requests per second by status code (for stacked graphs)
sum(rate(http_requests_total[5m])) by (status)

Real-World Use Cases

Use Case 1: Microservices Health Monitoring

In a microservices architecture, Prometheus scrapes metrics from every service instance. A Grafana dashboard shows the health of each service — request rate, error rate, latency percentiles, and resource consumption. When error rates spike, automated alerts page the on-call engineer with the specific service and error pattern.

Use Case 2: Kubernetes Cluster Monitoring

The Prometheus Node Exporter collects hardware and OS metrics from every node in the cluster. The kube-state-metrics exporter exposes Kubernetes object state (pod status, deployment replicas, resource requests). Grafana dashboards show cluster capacity, pod scheduling latency, and resource utilization trends that inform scaling decisions.

Use Case 3: Business Metrics Pipeline

Beyond technical metrics, Prometheus tracks business KPIs — orders per minute, payment success rates, cart abandonment rates, and average order value. Recording rules aggregate these metrics hourly and daily for executive dashboards. Alerts trigger when business metrics deviate from expected ranges, catching issues before customers notice.

Use Case 4: Database Performance Monitoring

The PostgreSQL and Redis exporters expose connection pool usage, query latency, cache hit ratios, and replication lag. Grafana dashboards correlate database performance with application metrics, helping teams identify whether slow responses are caused by database bottlenecks or application-level issues.

Best Practices for Production

Use histograms, not summaries: Histograms can be aggregated across instances and time ranges using PromQL. Summaries calculate quantiles on the client side and cannot be aggregated, making them useless for multi-instance monitoring.
Keep label cardinality low: Each unique label combination creates a separate time series. Avoid labels with high cardinality values like user IDs, email addresses, or request IDs. Aim for fewer than 10 unique values per label in most cases.
Use recording rules for dashboards: Pre-compute the PromQL expressions used in your dashboards as recording rules. This makes dashboard loading instant and reduces load on Prometheus during peak query times.
Set appropriate scrape intervals: Use 15-30 second intervals for most services. For critical paths requiring faster detection, use 10-second intervals. Shorter intervals increase storage requirements and scrape load — do not go below 5 seconds without strong justification.
Implement alerting rules with proper thresholds: Use multi-window, multi-burn-rate alerting for SLO-based alerts. Alert on error budgets, not raw error rates, to avoid alert fatigue from transient spikes.
Deploy with persistent storage: Configure Prometheus with persistent volumes to retain data across pod restarts. Set retention based on your needs — 15 days for short-term, 90+ days with Thanos or Cortex for long-term analysis.
Use Alertmanager for routing: Route critical alerts to PagerDuty, warning alerts to Slack, and informational alerts to email. Implement grouping and silencing rules to prevent alert storms during incidents.
Monitor the monitoring stack: Prometheus exposes its own metrics at /metrics. Monitor scrape duration, TSDB compaction, and storage usage to ensure the monitoring infrastructure itself is healthy.

Common Pitfalls and Solutions

Pitfall	Impact	Solution
High-cardinality labels (user IDs, request IDs)	Prometheus runs out of memory, scrapes slow down	Remove or bucket high-cardinality labels; use logs for request-level tracing
Missing alerting rules	Incidents detected by customers instead of monitoring	Define alerts for all SLO violations; test alerts regularly with `amtool`
No retention or storage limits	Disk fills up, Prometheus crashes	Set `--storage.tsdb.retention.time` and `--storage.tsdb.retention.size`
Pulling from too many targets	Scrape duration exceeds interval, gaps in data	Use federation or sharding; split jobs across multiple Prometheus instances
Dashboard queries returning too many series	Grafana panels time out or render slowly	Add recording rules; use `topk()` or `bottomk()` to limit series; aggregate by service
Not monitoring Prometheus itself	Silent failures in the monitoring infrastructure	Add Prometheus to its own scrape targets; monitor TSDB stats and scrape health

Performance Optimization

Prometheus TSDB Tuning

# prometheus.yml — optimized for high-throughput
global:
  scrape_interval: 15s
  scrape_timeout: 10s
 
# Command-line flags for TSDB optimization
# --storage.tsdb.wal-compression
# --storage.tsdb.min-block-duration=2h
# --storage.tsdb.max-block-duration=36h
# --query.max-samples=50000000
# --query.timeout=2m

Efficient PromQL Patterns

# BAD: Instant query on high-resolution data
http_request_duration_seconds
 
# GOOD: Use rate() with appropriate window
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
 
# BAD: Fetching all series
http_requests_total
 
# GOOD: Filter early with label matchers
http_requests_total{service="api", status=~"5.."}
 
# Use recording rules for dashboard queries
# prometheus-rules.yml
groups:
  - name: grafana_recording
    rules:
      - record: instance:http_requests:rate5m
        expr: sum(rate(http_requests_total[5m])) by (instance, job)

Grafana Query Optimization

Use Grafana variables to reduce the number of queries per dashboard. Define variables for service, instance, and environment selectors, then use them in PromQL expressions. Enable query caching in Grafana Enterprise or use the Grafana --query-cache-enabled flag for repeated queries.

Comparison with Alternatives

Feature	Prometheus + Grafana	Datadog	New Relic	ELK Stack
Cost	Free (self-hosted)	$15-23/host/month	$0.30/GB ingested	Free (self-hosted)
Data Model	Dimensional time series	Unified metrics, logs, traces	NRDB (proprietary)	Documents (Elasticsearch)
Query Language	PromQL	Custom query language	NRQL	Lucene/KQL
Alerting	Alertmanager (built-in)	Built-in	Built-in	Watcher/Alerting
Long-term Storage	Thanos, Cortex, Mimir	Built-in	Built-in	Built-in
Visualization	Grafana (best in class)	Built-in	Built-in	Kibana
Kubernetes Native	Yes (CNCF graduated)	Yes	Yes	Yes
Learning Curve	Medium — PromQL is powerful but unique	Low	Medium	High

Advanced Patterns

Multi-Tenant Monitoring with Thanos

Thanos extends Prometheus with global query view, long-term storage, and downsampling. Deploy Thanos Sidecar alongside each Prometheus instance to ship data to object storage (S3, GCS, Azure Blob).

# thanos-sidecar configuration
apiVersion: v1
kind: Pod
metadata:
  name: prometheus-with-thanos
spec:
  containers:
    - name: prometheus
      image: prom/prometheus:v2.50.0
      args:
        - '--storage.tsdb.path=/prometheus'
        - '--storage.tsdb.min-block-duration=2h'
    - name: thanos-sidecar
      image: thanosio/thanos:v0.34.0
      args:
        - 'sidecar'
        - '--tsdb.path=/prometheus'
        - '--objstore.config-file=/etc/thanos/bucket.yml'

Service Discovery with Consul

scrape_configs:
  - job_name: 'consul-services'
    consul_sd_configs:
      - server: 'consul:8500'
        tags:
          - 'prometheus'
    relabel_configs:
      - source_labels: [__meta_consul_service]
        target_label: service
      - source_labels: [__meta_consul_tags]
        regex: ',(?:[^,]+,)*prometheus-port-(\d+),.*'
        target_label: __address__
        replacement: '${1}'

Testing Strategies

Load Testing Prometheus

Use promtool to validate recording and alerting rules before deploying them:

# Validate configuration
promtool check config prometheus.yml
 
# Validate rules
promtool check rules rules/*.yml
 
# Test alerting rules against a running Prometheus
promtool test rules test-cases.yml

# test-cases.yml
rule_files:
  - alerts.yml
 
evaluation_interval: 1m
 
tests:
  - interval: 1m
    input_series:
      - series: 'http_requests_total{service="api", status="500"}'
        values: '0+10x30'  # 10 errors per minute for 30 minutes
      - series: 'http_requests_total{service="api", status="200"}'
        values: '0+100x30'  # 100 requests per minute
 
    alert_rule_test:
      - eval_time: 5m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels:
              service: api

Future Outlook

The Prometheus ecosystem is evolving toward a unified observability platform. OpenTelemetry is becoming the standard for metrics, traces, and logs collection, with native Prometheus compatibility built in. Grafana Mimir provides horizontally scalable, highly available long-term storage for Prometheus metrics with native Grafana integration.

Prometheus's remote write API is being enhanced to support more efficient compression and batching, enabling better integration with long-term storage backends. The Prometheus Agent mode, designed for edge and IoT deployments, scrapes and forwards metrics without storing them locally, reducing resource requirements by up to 80%.

Conclusion

The Prometheus-Grafana stack provides a powerful, flexible, and cost-effective monitoring solution that scales from development to enterprise production environments. By combining Prometheus's dimensional data model and PromQL with Grafana's visualization capabilities, teams gain comprehensive visibility into application and infrastructure health.

Key takeaways from this guide:

The pull-based model simplifies service discovery and makes monitoring infrastructure agnostic to application deployment patterns.
Recording rules pre-compute expensive queries, making dashboards fast and reducing query load during incidents.
Label discipline is critical — keep cardinality low, use consistent naming conventions, and avoid high-churn labels.
Alert on SLOs, not symptoms — define error budgets and alert when the budget is being consumed too quickly.
Monitor the monitoring stack — Prometheus, Grafana, and Alertmanager all expose metrics that should be scraped and alerted on.

Start by deploying Prometheus and Grafana with Docker Compose, instrumenting your application with the Prometheus client library, and building your first dashboard. As your system grows, add recording rules, alerting, and long-term storage with Thanos or Mimir. The Prometheus documentation and Grafana documentation provide comprehensive references for every feature discussed in this guide.

Minh Vo

Slaying code & making it lit fr fr 🔥 tagline