Introduction
Observability is the foundation of reliable software systems, and the combination of Prometheus and Grafana has become the industry-standard open-source monitoring stack. Prometheus, originally built at SoundCloud and now a Cloud Native Computing Foundation (CNCF) graduated project, provides a powerful time-series database and query language designed specifically for monitoring metrics. Grafana, the leading open-source visualization platform, transforms Prometheus data into actionable dashboards that give teams real-time visibility into system health, performance trends, and anomalies.
The Prometheus-Grafana stack replaces expensive commercial monitoring solutions like Datadog, New Relic, and Dynatrace with a self-hosted, fully customizable alternative that scales from a single-node application to thousands of microservices. Companies like DigitalOcean, GitLab, and JPMorgan Chase run this stack at massive scale, monitoring billions of time series with sub-second query latency.
This guide covers the complete monitoring pipeline — from instrumenting your applications with Prometheus metrics, through configuring scrape targets and recording rules, to building production-grade Grafana dashboards with alerting, all deployed on Kubernetes with high availability.
Understanding Prometheus and Grafana: Core Concepts
The Prometheus Data Model
Prometheus stores all data as time series — streams of timestamped values belonging to a metric with a set of label key-value pairs. A metric name describes what is being measured (e.g., http_requests_total), while labels provide dimensional context (e.g., method="POST", status="200", handler="/api/users"). This dimensional data model enables powerful ad-hoc querying without predefined table schemas.
The four core metric types are:
- Counter: A monotonically increasing value that resets to zero on process restart. Use for request counts, error counts, bytes transferred, and anything that only goes up.
- Gauge: A value that can go up or down. Use for temperatures, queue depths, memory usage, and in-flight request counts.
- Histogram: Samples observations and counts them in configurable buckets. Use for request duration, response sizes, and any value where you need percentile calculations.
- Summary: Similar to histograms but calculates quantiles on the client side. Use sparingly — histograms with server-side aggregation are preferred for most use cases.
Prometheus Architecture
Prometheus operates on a pull-based model: it scrapes metrics from instrumented targets at regular intervals via HTTP endpoints (typically /metrics). This pull model provides several advantages over push-based systems: targets do not need to know about Prometheus, you can run Prometheus against targets without modifying them, and the monitoring system controls when and how often data is collected.
┌─────────────┐ scrape ┌──────────────┐
│ Application │ ◄──────────── │ Prometheus │
│ /metrics │ │ Server │
└─────────────┘ │ ┌──────────┐ │
│ │ TSDB │ │
┌─────────────┐ scrape │ └──────────┘ │
│ Node │ ◄──────────── │ ┌──────────┐ │
│ Exporter │ │ │ PromQL │ │
└─────────────┘ │ └──────────┘ │
└──────┬───────┘
┌─────────────┐ scrape │ query
│ Kubernetes │ ◄──────────── │
│ Metrics │ ▼
└─────────────┘ ┌──────────────┐
│ Grafana │
│ Dashboards │
└──────────────┘
Grafana's Role
Grafana connects to Prometheus as a data source and provides a rich visualization layer. It supports dashboards with panels for graphs, gauges, tables, heatmaps, and alert lists. Grafana's alerting engine evaluates PromQL expressions on a schedule and routes notifications to channels like Slack, PagerDuty, email, or webhook endpoints.
Architecture and Design Patterns
Multi-Service Monitoring Architecture
In a production environment with multiple services, you deploy Prometheus alongside service-specific exporters. Each exporter translates a service's native metrics format into Prometheus format. The Prometheus server scrapes all exporters and stores the time series in its local time-series database (TSDB).
For high availability, run two identical Prometheus servers scraping the same targets. Use Grafana's built-in support for multiple data sources to query both replicas, or use Thanos or Cortex for long-term storage and global querying across clusters.
Recording Rules and Federation
Recording rules pre-compute expensive PromQL expressions at regular intervals and store the results as new time series. This trades storage space for query speed — dashboards load instantly when powered by recording rules instead of raw PromQL.
# prometheus-rules.yml
groups:
- name: recording_rules
interval: 30s
rules:
# Pre-compute request rate per service
- record: service:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (service, method, status)
# Pre-compute P99 latency
- record: service:http_duration:p99
expr: histogram_quantile(0.99, sum(rate(http_request_duration_seconds_bucket[5m])) by (service, le))
# Pre-compute error rate
- record: service:http_errors:ratio
expr: sum(rate(http_requests_total{status=~"5.."}[5m])) by (service) / sum(rate(http_requests_total[5m])) by (service)Label Strategy
A consistent label strategy is critical for maintainable Prometheus deployments. Labels should be low-cardinality — avoid using user IDs, request IDs, or timestamps as label values. Each unique combination of metric name and labels creates a new time series, so high-cardinality labels cause memory and storage explosion.
Step-by-Step Implementation
Installing Prometheus with Docker Compose
# docker-compose.yml
version: '3.8'
services:
prometheus:
image: prom/prometheus:v2.50.0
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- ./rules:/etc/prometheus/rules
- prometheus_data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
- '--storage.tsdb.retention.size=10GB'
- '--web.enable-lifecycle'
grafana:
image: grafana/grafana:10.3.0
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
environment:
GF_SECURITY_ADMIN_PASSWORD: admin
node-exporter:
image: prom/node-exporter:v1.7.0
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
volumes:
prometheus_data:
grafana_data:Prometheus Configuration
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
rule_files:
- /etc/prometheus/rules/*.yml
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
- job_name: 'api-server'
metrics_path: '/metrics'
scrape_interval: 10s
static_configs:
- targets: ['api:8080']
labels:
service: 'api'
environment: 'production'
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)Instrumenting a Node.js Application
import express from 'express';
import { register, Counter, Histogram, Gauge, collectDefaultMetrics } from 'prom-client';
// Collect default Node.js metrics (GC, event loop, memory, etc.)
collectDefaultMetrics({ prefix: 'myapp_' });
// Define custom metrics
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'path', 'status'] as const,
});
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'path'] as const,
buckets: [0.01, 0.05, 0.1, 0.3, 0.5, 1, 2, 5],
});
const activeConnections = new Gauge({
name: 'active_connections',
help: 'Number of active connections',
});
const app = express();
// Metrics middleware
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer({ method: req.method, path: req.path });
activeConnections.inc();
res.on('finish', () => {
end();
httpRequestsTotal.inc({ method: req.method, path: req.path, status: String(res.statusCode) });
activeConnections.dec();
});
next();
});
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
app.get('/api/users', (req, res) => {
res.json([{ id: 1, name: 'Alice' }]);
});
app.listen(8080, () => console.log('Server running on :8080'));Essential PromQL Queries
# Request rate per second (5-minute window)
rate(http_requests_total[5m])
# Error rate as a percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# P95 request duration
histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket[5m])) by (le))
# Top 5 endpoints by request volume
topk(5, sum(rate(http_requests_total[5m])) by (path))
# Memory usage growth over 1 hour
delta(process_resident_memory_bytes[1h])
# Requests per second by status code (for stacked graphs)
sum(rate(http_requests_total[5m])) by (status)Real-World Use Cases
Use Case 1: Microservices Health Monitoring
In a microservices architecture, Prometheus scrapes metrics from every service instance. A Grafana dashboard shows the health of each service — request rate, error rate, latency percentiles, and resource consumption. When error rates spike, automated alerts page the on-call engineer with the specific service and error pattern.
Use Case 2: Kubernetes Cluster Monitoring
The Prometheus Node Exporter collects hardware and OS metrics from every node in the cluster. The kube-state-metrics exporter exposes Kubernetes object state (pod status, deployment replicas, resource requests). Grafana dashboards show cluster capacity, pod scheduling latency, and resource utilization trends that inform scaling decisions.
Use Case 3: Business Metrics Pipeline
Beyond technical metrics, Prometheus tracks business KPIs — orders per minute, payment success rates, cart abandonment rates, and average order value. Recording rules aggregate these metrics hourly and daily for executive dashboards. Alerts trigger when business metrics deviate from expected ranges, catching issues before customers notice.
Use Case 4: Database Performance Monitoring
The PostgreSQL and Redis exporters expose connection pool usage, query latency, cache hit ratios, and replication lag. Grafana dashboards correlate database performance with application metrics, helping teams identify whether slow responses are caused by database bottlenecks or application-level issues.
Best Practices for Production
-
Use histograms, not summaries: Histograms can be aggregated across instances and time ranges using PromQL. Summaries calculate quantiles on the client side and cannot be aggregated, making them useless for multi-instance monitoring.
-
Keep label cardinality low: Each unique label combination creates a separate time series. Avoid labels with high cardinality values like user IDs, email addresses, or request IDs. Aim for fewer than 10 unique values per label in most cases.
-
Use recording rules for dashboards: Pre-compute the PromQL expressions used in your dashboards as recording rules. This makes dashboard loading instant and reduces load on Prometheus during peak query times.
-
Set appropriate scrape intervals: Use 15-30 second intervals for most services. For critical paths requiring faster detection, use 10-second intervals. Shorter intervals increase storage requirements and scrape load — do not go below 5 seconds without strong justification.
-
Implement alerting rules with proper thresholds: Use multi-window, multi-burn-rate alerting for SLO-based alerts. Alert on error budgets, not raw error rates, to avoid alert fatigue from transient spikes.
-
Deploy with persistent storage: Configure Prometheus with persistent volumes to retain data across pod restarts. Set retention based on your needs — 15 days for short-term, 90+ days with Thanos or Cortex for long-term analysis.
-
Use Alertmanager for routing: Route critical alerts to PagerDuty, warning alerts to Slack, and informational alerts to email. Implement grouping and silencing rules to prevent alert storms during incidents.
-
Monitor the monitoring stack: Prometheus exposes its own metrics at
/metrics. Monitor scrape duration, TSDB compaction, and storage usage to ensure the monitoring infrastructure itself is healthy.
Common Pitfalls and Solutions
| Pitfall | Impact | Solution |
|---|---|---|
| High-cardinality labels (user IDs, request IDs) | Prometheus runs out of memory, scrapes slow down | Remove or bucket high-cardinality labels; use logs for request-level tracing |
| Missing alerting rules | Incidents detected by customers instead of monitoring | Define alerts for all SLO violations; test alerts regularly with amtool |
| No retention or storage limits | Disk fills up, Prometheus crashes | Set --storage.tsdb.retention.time and --storage.tsdb.retention.size |
| Pulling from too many targets | Scrape duration exceeds interval, gaps in data | Use federation or sharding; split jobs across multiple Prometheus instances |
| Dashboard queries returning too many series | Grafana panels time out or render slowly | Add recording rules; use topk() or bottomk() to limit series; aggregate by service |
| Not monitoring Prometheus itself | Silent failures in the monitoring infrastructure | Add Prometheus to its own scrape targets; monitor TSDB stats and scrape health |
Performance Optimization
Prometheus TSDB Tuning
# prometheus.yml — optimized for high-throughput
global:
scrape_interval: 15s
scrape_timeout: 10s
# Command-line flags for TSDB optimization
# --storage.tsdb.wal-compression
# --storage.tsdb.min-block-duration=2h
# --storage.tsdb.max-block-duration=36h
# --query.max-samples=50000000
# --query.timeout=2mEfficient PromQL Patterns
# BAD: Instant query on high-resolution data
http_request_duration_seconds
# GOOD: Use rate() with appropriate window
rate(http_request_duration_seconds_sum[5m]) / rate(http_request_duration_seconds_count[5m])
# BAD: Fetching all series
http_requests_total
# GOOD: Filter early with label matchers
http_requests_total{service="api", status=~"5.."}
# Use recording rules for dashboard queries
# prometheus-rules.yml
groups:
- name: grafana_recording
rules:
- record: instance:http_requests:rate5m
expr: sum(rate(http_requests_total[5m])) by (instance, job)Grafana Query Optimization
Use Grafana variables to reduce the number of queries per dashboard. Define variables for service, instance, and environment selectors, then use them in PromQL expressions. Enable query caching in Grafana Enterprise or use the Grafana --query-cache-enabled flag for repeated queries.
Comparison with Alternatives
| Feature | Prometheus + Grafana | Datadog | New Relic | ELK Stack |
|---|---|---|---|---|
| Cost | Free (self-hosted) | $15-23/host/month | $0.30/GB ingested | Free (self-hosted) |
| Data Model | Dimensional time series | Unified metrics, logs, traces | NRDB (proprietary) | Documents (Elasticsearch) |
| Query Language | PromQL | Custom query language | NRQL | Lucene/KQL |
| Alerting | Alertmanager (built-in) | Built-in | Built-in | Watcher/Alerting |
| Long-term Storage | Thanos, Cortex, Mimir | Built-in | Built-in | Built-in |
| Visualization | Grafana (best in class) | Built-in | Built-in | Kibana |
| Kubernetes Native | Yes (CNCF graduated) | Yes | Yes | Yes |
| Learning Curve | Medium — PromQL is powerful but unique | Low | Medium | High |
Advanced Patterns
Multi-Tenant Monitoring with Thanos
Thanos extends Prometheus with global query view, long-term storage, and downsampling. Deploy Thanos Sidecar alongside each Prometheus instance to ship data to object storage (S3, GCS, Azure Blob).
# thanos-sidecar configuration
apiVersion: v1
kind: Pod
metadata:
name: prometheus-with-thanos
spec:
containers:
- name: prometheus
image: prom/prometheus:v2.50.0
args:
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.min-block-duration=2h'
- name: thanos-sidecar
image: thanosio/thanos:v0.34.0
args:
- 'sidecar'
- '--tsdb.path=/prometheus'
- '--objstore.config-file=/etc/thanos/bucket.yml'Service Discovery with Consul
scrape_configs:
- job_name: 'consul-services'
consul_sd_configs:
- server: 'consul:8500'
tags:
- 'prometheus'
relabel_configs:
- source_labels: [__meta_consul_service]
target_label: service
- source_labels: [__meta_consul_tags]
regex: ',(?:[^,]+,)*prometheus-port-(\d+),.*'
target_label: __address__
replacement: '${1}'Testing Strategies
Load Testing Prometheus
Use promtool to validate recording and alerting rules before deploying them:
# Validate configuration
promtool check config prometheus.yml
# Validate rules
promtool check rules rules/*.yml
# Test alerting rules against a running Prometheus
promtool test rules test-cases.yml# test-cases.yml
rule_files:
- alerts.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'http_requests_total{service="api", status="500"}'
values: '0+10x30' # 10 errors per minute for 30 minutes
- series: 'http_requests_total{service="api", status="200"}'
values: '0+100x30' # 100 requests per minute
alert_rule_test:
- eval_time: 5m
alertname: HighErrorRate
exp_alerts:
- exp_labels:
service: apiFuture Outlook
The Prometheus ecosystem is evolving toward a unified observability platform. OpenTelemetry is becoming the standard for metrics, traces, and logs collection, with native Prometheus compatibility built in. Grafana Mimir provides horizontally scalable, highly available long-term storage for Prometheus metrics with native Grafana integration.
Prometheus's remote write API is being enhanced to support more efficient compression and batching, enabling better integration with long-term storage backends. The Prometheus Agent mode, designed for edge and IoT deployments, scrapes and forwards metrics without storing them locally, reducing resource requirements by up to 80%.
Conclusion
The Prometheus-Grafana stack provides a powerful, flexible, and cost-effective monitoring solution that scales from development to enterprise production environments. By combining Prometheus's dimensional data model and PromQL with Grafana's visualization capabilities, teams gain comprehensive visibility into application and infrastructure health.
Key takeaways from this guide:
- The pull-based model simplifies service discovery and makes monitoring infrastructure agnostic to application deployment patterns.
- Recording rules pre-compute expensive queries, making dashboards fast and reducing query load during incidents.
- Label discipline is critical — keep cardinality low, use consistent naming conventions, and avoid high-churn labels.
- Alert on SLOs, not symptoms — define error budgets and alert when the budget is being consumed too quickly.
- Monitor the monitoring stack — Prometheus, Grafana, and Alertmanager all expose metrics that should be scraped and alerted on.
Start by deploying Prometheus and Grafana with Docker Compose, instrumenting your application with the Prometheus client library, and building your first dashboard. As your system grows, add recording rules, alerting, and long-term storage with Thanos or Mimir. The Prometheus documentation and Grafana documentation provide comprehensive references for every feature discussed in this guide.