Kubernetes Autoscaling: HPA, VPA, and KEDA in Depth
Autoscaling is one of Kubernetes' most powerful capabilities, enabling workloads to adapt dynamically to changing demand. But choosing the right autoscaling strategy—and configuring it correctly—requires understanding the underlying mechanics, trade-offs, and operational pitfalls. This guide dives deep into the three primary autoscaling mechanisms: HPA, VPA, and KEDA, along with Cluster Autoscaler, production patterns, and cost optimization strategies.
Why Autoscaling Matters
Without autoscaling, you face two fundamental problems:
- Over-provisioning: Allocating peak-capacity resources wastes money during low-traffic periods. A cluster sized for Black Friday traffic runs at 5% utilization the other 364 days.
- Under-provisioning: Allocating too few resources causes latency spikes, dropped requests, and outages during traffic surges.
Autoscaling solves both by dynamically adjusting resources to match actual demand. In a cloud environment where you pay per resource-hour, right-sizing translates directly to cost savings—often 30-60% compared to static provisioning.
The Cost of Getting It Wrong
Consider a real-world scenario: an e-commerce platform running 50 pods at 2 vCPU each during normal traffic. During a flash sale, traffic increases 10x. Without autoscaling, the platform either crashes (under-provisioned) or wastes $15,000/month running 500 pods year-round (over-provisioned). With HPA, the platform scales from 50 to 500 pods in minutes, then back down after the sale—paying only for what it used.
The Three Autoscaling Dimensions
Kubernetes autoscaling operates across three dimensions:
- Horizontal (HPA): Adds or removes pod replicas. Best for stateless workloads that can distribute load across instances.
- Vertical (VPA): Adjusts CPU and memory requests/limits for existing pods. Best for stateful workloads or batch jobs where horizontal scaling is impractical.
- Event-driven (KEDA): Scales based on external event sources like message queues, databases, or custom metrics. Best for event-driven architectures and can scale to zero.
In production, you often combine them: HPA for CPU/memory-based scaling during traffic spikes, KEDA for queue-based scaling in async pipelines, and VPA for right-sizing resource requests over time.
HPA: Horizontal Pod Autoscaler
How HPA Works Under the Hood
The HPA controller runs in the kube-controller-manager and follows a control loop:
- Fetches current metric values from the Metrics API (or custom metrics API)
- Calculates the desired number of replicas using the formula:
desiredReplicas = ceil(currentReplicas * (currentMetricValue / targetMetricValue)) - Compares desired replicas with current replicas
- If the difference exceeds the tolerance (default 10%), applies scaling
- Respects stabilization windows to prevent flapping
Example calculation: If you have 3 pods at 80% CPU utilization with a target of 40%, the HPA computes ceil(3 * (80/40)) = 6 pods.
The tolerance parameter (default 0.1 or 10%) prevents unnecessary scaling when the metric is close to the target. If current utilization is 62% and target is 60%, the 3.3% difference is within tolerance, so no scaling occurs.
Installing the Metrics Server
The Metrics Server is a prerequisite for HPA. It collects resource metrics from kubelet and exposes them through the Kubernetes Metrics API.
# Install Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
# For development clusters with self-signed certificates, add:
# --kubelet-insecure-tls to the metrics-server deployment args
# Verify installation
kubectl get deployment metrics-server -n kube-system
kubectl top nodes
kubectl top podsIf kubectl top returns "Metrics API not available," check that the metrics-server pods are running and that the API group metrics.k8s.io is registered:
kubectl get apiservice v1beta1.metrics.k8s.ioBasic HPA Configuration
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
spec:
replicas: 2
selector:
matchLabels:
app: web-app
template:
metadata:
labels:
app: web-app
spec:
containers:
- name: web
image: nginx:1.25
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "256Mi"
ports:
- containerPort: 80
readinessProbe:
httpGet:
path: /healthz
port: 80
initialDelaySeconds: 5
periodSeconds: 10
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 25
periodSeconds: 60
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 30The behavior field (available in autoscaling/v2) gives you fine-grained control over scaling velocity. The configuration above allows aggressive scale-up (double capacity every 30 seconds) but conservative scale-down (remove at most 25% of pods per minute, with a 5-minute stabilization window).
Custom Metrics with Prometheus
For application-specific metrics like request latency or queue depth, you need the Prometheus Adapter:
# Install Prometheus Adapter
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
--set prometheus.url=http://prometheus-server.prometheus.svc# prometheus-adapter config: map Prometheus metric to Kubernetes custom metric
rules:
custom:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources:
overrides:
namespace: {resource: "namespace"}
pod: {resource: "pod"}
name:
matches: "^(.*)_total$"
as: "${1}_per_second"
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'# HPA using custom metrics from Prometheus
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: web-app-hpa-custom
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web-app
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric:
name: http_requests_per_second
target:
type: AverageValue
averageValue: "1000"
- type: Object
object:
metric:
name: requests-per-second
describedObject:
apiVersion: networking.k8s.io/v1
kind: Ingress
name: web-app-ingress
target:
type: Value
value: "10k"Multi-Metric HPA with Weighted Scoring
When using multiple metrics, the HPA evaluates each independently and selects the highest desired replica count:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-server-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 50
metrics:
# Scale on CPU
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
# Scale on memory
- type: Resource
resource:
name: memory
target:
type: Utilization
averageUtilization: 75
# Scale on request latency (from Prometheus)
- type: Pods
pods:
metric:
name: http_request_duration_p99
target:
type: AverageValue
averageValue: "250m" # 250ms p99 latency
# Scale on queue depth (from Prometheus)
- type: Pods
pods:
metric:
name: job_queue_depth
target:
type: AverageValue
averageValue: "100"In this example, if CPU wants 5 replicas but latency wants 12 replicas, HPA scales to 12. This conservative approach (take the max) prevents under-provisioning.
HPA Behavior Tuning for Production
The behavior field controls scaling velocity. Getting this wrong causes flapping (rapid scale-up followed by immediate scale-down):
behavior:
scaleUp:
stabilizationWindowSeconds: 60
selectPolicy: Max
policies:
- type: Pods
value: 4
periodSeconds: 60
- type: Percent
value: 100
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300 # 5 minutes
selectPolicy: Min
policies:
- type: Percent
value: 10
periodSeconds: 60Key tuning guidelines:
- scaleDown stabilizationWindowSeconds: 300+ — Prevents premature scale-down after transient load spikes. The HPA uses the highest recommendation within the window.
- scaleUp stabilizationWindowSeconds: 0-60 — Fast reaction to load increases. Users experience latency if scale-up is too slow.
- selectPolicy: Min for scaleDown — Use the most conservative (smallest) scale-down to avoid removing too many pods at once.
- selectPolicy: Max for scaleUp — Use the most aggressive scale-up to handle sudden traffic.
Cooldown Periods and Flapping Prevention
Flapping occurs when HPA scales up, traffic drops briefly, HPA scales down, then traffic returns—creating a cycle of scaling that wastes resources and destabilizes the application. The stabilization window is your primary defense:
# Anti-flapping configuration
behavior:
scaleUp:
stabilizationWindowSeconds: 30 # React quickly to demand
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 600 # 10-minute cooldown
policies:
- type: Percent
value: 10
periodSeconds: 120The 10-minute scale-down window means HPA won't scale down unless the metric has been below target for 10 continuous minutes. This is critical for traffic patterns with natural dips (lunch breaks, shift changes).
VPA: Vertical Pod Autoscaler
VPA Architecture
The Vertical Pod Autoscaler consists of three components:
- VPA Recommender — Analyzes historical resource usage via the Metrics API and recommends CPU/memory requests. Uses exponential histogram decay to weight recent data more heavily.
- VPA Updater — Evicts pods whose resource requests deviate significantly from recommendations. Only evicts pods that can be safely disrupted (respects PodDisruptionBudgets).
- VPA Admission Controller — Intercepts pod creation requests via a mutating webhook and injects recommended resource values.
The VPA does not modify running pods directly. It evicts pods that have drifted from recommendations, and the admission controller sets new values when the pod is recreated.
Installing VPA
# Install VPA from the official autoscaler repository
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
# Verify all three components are running
kubectl get pods -n kube-system | grep vpaVPA Configuration Modes
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: postgres-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: StatefulSet
name: postgres
updatePolicy:
updateMode: "Auto" # Auto | Off | Initial | Recreate
resourcePolicy:
containerPolicies:
- containerName: postgres
minAllowed:
cpu: "250m"
memory: "512Mi"
maxAllowed:
cpu: "4"
memory: "16Gi"
controlledResources: ["cpu", "memory"]
controlledValues: RequestsOnlyUpdate modes explained:
| Mode | Behavior | Use Case |
|---|---|---|
| Off | Only provides recommendations, no changes | Observing before automating |
| Initial | Sets values only at pod creation | New deployments |
| Recreate | Evicts and recreates pods with new values | Stateless workloads |
| Auto | Like Recreate, but with eviction timing optimization | Production workloads |
Critical warning: Never use VPA with HPA on the same resource metric. If HPA scales on CPU utilization and VPA adjusts CPU requests, they create a feedback loop: VPA increases requests → utilization drops → HPA scales down → load increases → HPA scales up → utilization spikes → VPA increases requests. Use HPA on CPU and VPA on memory, or use HPA with custom metrics while VPA handles resource right-sizing.
VPA Recommendation Analysis
Check VPA recommendations before enabling Auto mode:
kubectl describe vpa postgres-vpaOutput includes:
Status:
Recommendation:
Container Recommendations:
Container Name: postgres
Lower Bound: cpu: 200m, memory: 400Mi
Target: cpu: 500m, memory: 1Gi
Upper Bound: cpu: 2, memory: 4Gi
Uncapped Target: cpu: 800m, memory: 1.5Gi
- Target: The recommended value VPA will apply
- Lower Bound: Minimum value VPA considers safe
- Upper Bound: Maximum value VPA considers reasonable
- Uncapped Target: The recommendation without min/max constraints applied
VPA Right-Sizing Workflow
The recommended production workflow for VPA is:
- Week 1-2: Deploy VPA in
Offmode. Collect recommendations without making changes. - Week 3-4: Review recommendations. Adjust min/max constraints based on observed patterns.
- Week 5+: Enable
Automode with conservative constraints. Monitor closely for eviction storms.
VPA for Stateful Workloads
VPA is particularly valuable for stateful workloads like databases where horizontal scaling is complex:
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: redis-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: StatefulSet
name: redis
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: redis
minAllowed:
cpu: "100m"
memory: "256Mi"
maxAllowed:
cpu: "4"
memory: "8Gi"
controlledResources: ["cpu", "memory"]For databases, VPA can reduce costs by 20-40% by right-sizing requests that were initially over-provisioned based on guesswork. The key is running in recommendation mode long enough to capture all workload patterns (peak hours, batch jobs, backups).
VPA vs HPA: When to Use Which
| Workload Type | Recommended | Reason |
|---|---|---|
| Stateless web servers | HPA | Easy to add/remove replicas |
| Batch processing | KEDA | Event-driven, can scale to zero |
| Databases | VPA | Horizontal scaling is complex |
| Message consumers | KEDA | Scale based on queue depth |
| CPU-bound services | HPA | Scale based on CPU utilization |
| Memory-bound services | VPA | Adjust memory requests |
| Mixed workloads | HPA + VPA | HPA on custom metrics, VPA on resources |
# Production VPA with conservative constraints
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
name: api-server-vpa
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
updatePolicy:
updateMode: "Auto"
resourcePolicy:
containerPolicies:
- containerName: api
minAllowed:
cpu: "100m"
memory: "128Mi"
maxAllowed:
cpu: "2"
memory: "4Gi"
controlledResources: ["cpu", "memory"]
controlledValues: RequestsOnlyKEDA: Kubernetes Event-Driven Autoscaling
Why KEDA Exists
Standard HPA has limitations:
- Cannot scale to zero replicas (minimum is 1)
- Primarily designed for CPU/memory metrics
- Requires the Metrics Server or Prometheus Adapter for custom metrics
- Not designed for event-driven architectures (queues, streams, topics)
KEDA addresses all of these. It bridges Kubernetes autoscaling with external event sources, enabling workloads to scale based on queue depth, message rates, database queries, or any custom metric.
KEDA Architecture
KEDA consists of two components:
- KEDA Operator — Watches ScaledObject resources and manages the lifecycle of the underlying HPA. When the event source reports zero events, the operator scales the deployment to zero.
- Metrics Server — Implements the Kubernetes external metrics API, translating event source metrics into values the HPA can consume.
KEDA creates an HPA resource under the hood. The HPA uses KEDA's external metrics for scaling decisions. When the metric drops to zero, KEDA scales the deployment to zero—a capability the standard HPA does not support.
Installing KEDA
# Install with Helm
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
--namespace keda \
--create-namespace
# Verify installation
kubectl get pods -n kedaKEDA ScaledObject for RabbitMQ
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-processor-scaledobject
spec:
scaleTargetRef:
name: order-processor
pollingInterval: 15
cooldownPeriod: 300
minReplicaCount: 0
maxReplicaCount: 50
triggers:
- type: rabbitmq
metadata:
host: amqp://user:password@rabbitmq.rabbitmq.svc:5672
queueName: order-processing
queueLength: "10" # Target 10 messages per pod
authenticationRef:
name: rabbitmq-auth
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
name: rabbitmq-auth
spec:
secretTargetRef:
- parameter: host
name: rabbitmq-secret
key: connection-stringWith this configuration:
- When the
order-processingqueue has 0 messages, KEDA scales to 0 replicas - When 100 messages arrive, KEDA scales to
ceil(100/10) = 10pods - When the queue drains, KEDA waits 300 seconds (cooldown) before scaling back to 0
KEDA with Multiple Triggers
KEDA supports multiple triggers on a single ScaledObject. It evaluates each trigger independently and selects the highest replica count:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: image-processor-scaledobject
spec:
scaleTargetRef:
name: image-processor
minReplicaCount: 0
maxReplicaCount: 100
triggers:
# Scale on SQS queue depth
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/image-processing
queueLength: "5"
authenticationRef:
name: aws-credentials
# Scale on Prometheus metric (GPU utilization)
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: gpu_utilization_avg
threshold: "80"
query: avg(gpu_utilization{app="image-processor"})
# Scale on Kafka lag
- type: kafka
metadata:
bootstrapServers: kafka-cluster.kafka.svc:9092
consumerGroup: image-processor-group
topic: image-uploads
lagThreshold: "50"KEDA Scalers Reference
KEDA supports 50+ scalers out of the box. Here are the most commonly used:
| Scaler | Trigger Type | Use Case |
|---|---|---|
| RabbitMQ | rabbitmq | Message queue processing |
| Apache Kafka | kafka | Stream processing with consumer lag |
| AWS SQS | aws-sqs-queue | AWS queue-based processing |
| Azure Service Bus | azure-servicebus | Azure message processing |
| Prometheus | prometheus | Any metric from Prometheus |
| PostgreSQL | postgresql | Scale based on query results |
| MySQL | mysql | Scale based on query results |
| Redis | redis | Scale based on Redis list length |
| Cron | cron | Time-based scaling |
| CPU | cpu | Built-in CPU scaling (with scale-to-zero) |
| Memory | memory | Built-in memory scaling (with scale-to-zero) |
| Google Cloud Pub/Sub | google-pubsub | GCP message processing |
| AWS CloudWatch | aws-cloudwatch | AWS metric-based scaling |
| Datadog | datadog | Datadog metric-based scaling |
| Elasticsearch | elasticsearch | Scale based on search queue |
| MongoDB | mongodb | Scale based on query results |
| NATS | nats | NATS message processing |
| Apache Pulsar | pulsar | Pulsar message processing |
| Huawei Cloud SMN | huawei-cloudeye | Huawei Cloud metrics |
| Liiklus | liiklus | Reactive streams |
| OpenStack Metrics | openstack | OpenStack metrics |
KEDA with ScaledJob
For batch workloads that run to completion, use ScaledJob instead of ScaledObject:
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
name: data-migration-job
spec:
jobTargetRef:
template:
spec:
containers:
- name: migration
image: data-migrator:latest
command: ["./migrate"]
resources:
requests:
cpu: "500m"
memory: "512Mi"
restartPolicy: Never
pollingInterval: 30
maxReplicaCount: 20
successfulJobsHistoryLimit: 5
failedJobsHistoryLimit: 3
triggers:
- type: rabbitmq
metadata:
host: amqp://user:password@rabbitmq.rabbitmq.svc:5672
queueName: migration-tasks
queueLength: "1"
authenticationRef:
name: rabbitmq-authScaledJob creates a new Job for each message in the queue, up to maxReplicaCount. When the queue empties, no new Jobs are created and completed Jobs are cleaned up based on history limits.
Cron-Based Scaling
For predictable traffic patterns, combine KEDA's cron scaler with event-driven scaling:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: api-server-scaledobject
spec:
scaleTargetRef:
name: api-server
minReplicaCount: 1
maxReplicaCount: 100
triggers:
- type: cron
metadata:
timezone: America/New_York
start: "0 8 * * *" # Scale up at 8 AM
end: "0 20 * * *" # Scale down at 8 PM
desiredReplicas: "10"
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: http_requests_per_second
threshold: "500"
query: sum(rate(http_requests_total{app="api-server"}[1m]))During business hours (8 AM - 8 PM), KEDA ensures at least 10 replicas. Outside business hours, it scales down to 1 (the minReplicaCount) unless Prometheus metrics indicate higher demand.
Cluster Autoscaler
How Cluster Autoscaler Works
While HPA, VPA, and KEDA scale pods, the Cluster Autoscaler scales nodes. When pods can't be scheduled due to insufficient cluster capacity, the Cluster Autoscaler adds nodes. When nodes are underutilized, it removes them.
# Install Cluster Autoscaler (AWS EKS example)
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
--set autoDiscovery.clusterName=my-cluster \
--set awsRegion=us-east-1 \
--set "extraArgs.balance-similar-node-groups=true" \
--set "extraArgs.skip-nodes-with-local-storage=false"Cluster Autoscaler Configuration
# Cluster Autoscaler deployment arguments
extraArgs:
scale-down-delay-after-add: "10m" # Wait 10 min after adding a node before considering removal
scale-down-unneeded-time: "10m" # Node must be unneeded for 10 min before removal
scale-down-utilization-threshold: "0.5" # Remove nodes below 50% utilization
max-node-provision-time: "15m" # Max time to wait for a new node
skip-nodes-with-local-storage: "false" # Consider nodes with local storage for removal
expander: "least-waste" # Choose node type with least wasted resourcesNode Group Strategies
# Terraform: Multiple node groups for different workload types
resource "aws_eks_node_group" "general" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "general-purpose"
instance_types = ["m5.large", "m5.xlarge"]
scaling_config {
min_size = 2
max_size = 20
desired_size = 3
}
labels = {
workload-type = "general"
}
}
resource "aws_eks_node_group" "compute" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "compute-optimized"
instance_types = ["c5.2xlarge", "c5.4xlarge"]
scaling_config {
min_size = 0
max_size = 10
desired_size = 0
}
labels = {
workload-type = "compute"
}
taint {
key = "workload-type"
value = "compute"
effect = "NO_SCHEDULE"
}
}HPA + Cluster Autoscaler Coordination
When HPA scales pods but the cluster lacks capacity, the Cluster Autoscaler must add nodes. This creates a two-step scaling delay:
Traffic spike → HPA wants more pods → Pods pending (no capacity) → Cluster Autoscaler adds node → Pods scheduled
To minimize this delay:
# HPA configuration for fast scaling
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
# Pod Disruption Budget to ensure minimum availability
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: api-pdb
spec:
minAvailable: "50%"
selector:
matchLabels:
app: api-server
# Priority Class to ensure critical pods are scheduled first
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
name: high-priority
value: 1000000
globalDefault: false
description: "High priority for user-facing services"Real-World Scaling Scenarios
Scenario 1: Traffic Spike Handling
A news website experiences 50x traffic during breaking news events:
# HPA for baseline scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: news-frontend-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: news-frontend
minReplicas: 5
maxReplicas: 200
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
behavior:
scaleUp:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 15
scaleDown:
stabilizationWindowSeconds: 600
policies:
- type: Percent
value: 10
periodSeconds: 120Scenario 2: Batch Processing with KEDA
A data pipeline processes files from S3:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: data-processor-scaledobject
spec:
scaleTargetRef:
name: data-processor
minReplicaCount: 0
maxReplicaCount: 50
cooldownPeriod: 300
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/123456789/data-processing
queueLength: "100"
authenticationRef:
name: aws-credentialsWhen 10,000 files are queued, KEDA scales to 100 pods. As the queue drains, pods scale back to zero.
Scenario 3: Queue-Based Microservices
An order processing system with multiple stages:
# Order ingestion service
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: order-ingestion-scaledobject
spec:
scaleTargetRef:
name: order-ingestion
minReplicaCount: 1
maxReplicaCount: 20
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: http_requests_per_second
threshold: "1000"
query: sum(rate(http_requests_total{app="order-ingestion"}[1m]))
---
# Payment processing service
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: payment-processor-scaledobject
spec:
scaleTargetRef:
name: payment-processor
minReplicaCount: 0
maxReplicaCount: 50
triggers:
- type: rabbitmq
metadata:
host: amqp://user:password@rabbitmq.rabbitmq.svc:5672
queueName: payment-processing
queueLength: "50"
authenticationRef:
name: rabbitmq-authProduction Scaling Patterns
Pattern 1: Warm Pool with Startup Probe
Cold-start latency is a major problem with scale-to-zero. Use a startup probe and pre-warmed pods:
spec:
template:
spec:
containers:
- name: api
startupProbe:
httpGet:
path: /ready
port: 8080
failureThreshold: 30
periodSeconds: 2
readinessProbe:
httpGet:
path: /healthz
port: 8080
periodSeconds: 5Combined with KEDA's idleReplicaCount (available in newer versions), you can maintain a warm pool:
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: api-scaledobject
spec:
scaleTargetRef:
name: api-server
minReplicaCount: 0
idleReplicaCount: 2 # Keep 2 warm pods when idle
maxReplicaCount: 50Pattern 2: Scaling with Custom Application Metrics
Expose application-specific metrics from your code and scale on them:
// Node.js Express app exposing metrics for Prometheus
const promClient = require('prom-client');
const httpRequestDuration = new promClient.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});
const activeConnections = new promClient.Gauge({
name: 'active_connections',
help: 'Number of active connections'
});
const queueDepth = new promClient.Gauge({
name: 'job_queue_depth',
help: 'Current depth of the processing queue'
});
// Middleware to record request duration
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer();
res.on('finish', () => {
end({ method: req.method, route: req.route?.path || 'unknown', status_code: res.statusCode });
});
next();
});
// Expose /metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', promClient.register.contentType);
res.end(await promClient.register.metrics());
});# Prometheus ServiceMonitor to scrape the app
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: api-server-monitor
spec:
selector:
matchLabels:
app: api-server
endpoints:
- port: http
path: /metrics
interval: 15s
---
# HPA scaling on p99 latency
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-latency-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api-server
minReplicas: 3
maxReplicas: 50
metrics:
- type: Pods
pods:
metric:
name: http_request_duration_seconds_p99
target:
type: AverageValue
averageValue: "500m" # Scale when p99 > 500msCost Optimization with Autoscaling
Right-Sizing Resources
Autoscaling only saves money if your resource requests are accurate. Over-provisioned requests waste money even with autoscaling:
# Check actual vs requested resources
kubectl top pods -l app=api-server --containers
# Use VPA recommendations to right-size
kubectl describe vpa api-server-vpa | grep -A 10 "Recommendation"Spot Instances and Autoscaling
Combine Cluster Autoscaler with spot instances for up to 90% cost savings:
# Terraform: Spot instance node group
resource "aws_eks_node_group" "spot" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "spot-instances"
instance_types = ["m5.large", "m5.xlarge", "m5.2xlarge", "m5a.large"]
capacity_type = "SPOT"
scaling_config {
min_size = 0
max_size = 50
desired_size = 5
}
labels = {
"node-lifecycle" = "spot"
}
taint {
key = "spot"
value = "true"
effect = "NO_SCHEDULE"
}
}Scaling Down Aggressively
During off-peak hours, scale down aggressively to minimize costs:
# KEDA cron-based scaling for cost optimization
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: batch-processor-scaledobject
spec:
scaleTargetRef:
name: batch-processor
minReplicaCount: 0
maxReplicaCount: 100
triggers:
- type: cron
metadata:
timezone: UTC
start: "0 2 * * *" # Scale up at 2 AM for batch processing
end: "0 6 * * *" # Scale down at 6 AM
desiredReplicas: "50"
- type: prometheus
metadata:
serverAddress: http://prometheus.monitoring.svc:9090
metricName: queue_depth
threshold: "1000"
query: sum(job_queue_depth{app="batch-processor"})Cost Monitoring
Set up cost monitoring to track autoscaling impact:
# Prometheus alert for excessive scaling
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: autoscaling-alerts
spec:
groups:
- name: autoscaling
rules:
- alert: ExcessiveScaling
expr: |
rate(kube_horizontalpodautoscaler_status_desired_replicas{namespace="production"}[1h]) > 5
for: 10m
labels:
severity: warning
annotations:
summary: "HPA scaling excessively in {{ $labels.namespace }}"
description: "HPA {{ $labels.horizontalpodautoscaler }} has been scaling rapidly"Autoscaling Anti-Patterns
Anti-Pattern 1: Scaling Without Resource Requests
HPA requires resource requests to calculate utilization. Without them, the Metrics Server can't provide data and HPA shows <unknown>:
# Wrong: No resource requests
containers:
- name: api
image: api:latest
# No resources section
# Correct: Resource requests defined
containers:
- name: api
image: api:latest
resources:
requests:
cpu: "100m"
memory: "128Mi"
limits:
cpu: "500m"
memory: "256Mi"Anti-Pattern 2: Overly Aggressive Scale-Down
Setting scale-down policies too aggressively causes thrashing during traffic fluctuations:
# Wrong: Removes 100% of pods immediately
behavior:
scaleDown:
stabilizationWindowSeconds: 0
policies:
- type: Percent
value: 100
periodSeconds: 60
# Correct: Conservative scale-down
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Percent
value: 10
periodSeconds: 120Anti-Pattern 3: Using VPA with HPA on Same Metric
The classic feedback loop problem:
# Wrong: Both scale on CPU
# VPA increases CPU requests → utilization drops → HPA scales down
---
# Correct: HPA on custom metrics, VPA on resources
# Or HPA on CPU, VPA on memory onlyAnti-Pattern 4: Ignoring Pod Disruption Budgets
Without PDBs, autoscalers can evict all pods simultaneously during scaling events:
# Always define PDBs for critical workloads
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: critical-app-pdb
spec:
minAvailable: "50%"
selector:
matchLabels:
app: critical-appAnti-Pattern 5: Not Testing Scaling Behavior
Deploying autoscaling to production without testing is risky. Use load testing tools:
# Install k6 for load testing
brew install k6
# Run load test
k6 run --vus 100 --duration 5m load-test.js
# Monitor HPA during test
watch kubectl get hpa// load-test.js
import http from 'k6/http';
import { sleep } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 50 }, // Ramp up
{ duration: '5m', target: 50 }, // Stay at 50 users
{ duration: '2m', target: 200 }, // Spike to 200 users
{ duration: '5m', target: 200 }, // Stay at 200 users
{ duration: '2m', target: 0 }, // Ramp down
],
};
export default function () {
http.get('https://api.example.com/health');
sleep(1);
}Monitoring and Debugging Autoscaling
Essential HPA Commands
# Check HPA status
kubectl get hpa
kubectl describe hpa web-app-hpa
# View current metrics
kubectl get hpa web-app-hpa -o yaml
# Check events for scaling decisions
kubectl get events --field-selector reason=SuccessfulRescale
# Monitor HPA in real-time
watch kubectl get hpaCommon HPA Issues and Solutions
| Issue | Symptom | Solution |
|---|---|---|
| Flapping | Pods scale up/down rapidly | Increase stabilizationWindowSeconds |
| No scaling | HPA shows <unknown> | Verify Metrics Server is running |
| Slow scale-up | Latency during traffic spikes | Reduce stabilizationWindowSeconds, increase policy values |
| Metrics delay | Scaling lags behind traffic | Reduce Metrics Server --metric-resolution |
| Insufficient capacity | Pods stuck in Pending | Configure Cluster Autoscaler, add node groups |
| VPA conflict | HPA and VPA fighting | Remove VPA from CPU or use custom metrics only |
VPA Debugging
# Check VPA recommendations
kubectl describe vpa postgres-vpa
# Compare recommendations vs actual usage
kubectl top pods -l app=postgres
# Check VPA updater events
kubectl get events --field-selector reason=EvictedByVPAKEDA Debugging
# Check ScaledObject status
kubectl get scaledobject
kubectl describe scaledobject order-processor-scaledobject
# Check KEDA operator logs
kubectl logs -n keda -l app=keda-operator
# Check trigger authentication
kubectl get triggerauthentication
kubectl describe triggerauthentication rabbitmq-auth
# Verify external metrics
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .Best Practices Summary
-
Always set resource requests — HPA and VPA both depend on accurate resource requests. Without them, the scheduler cannot make informed decisions.
-
Use PDBs with autoscaling — Pod Disruption Budgets prevent autoscalers from evicting too many pods simultaneously.
-
Test scaling behavior in staging — Simulate traffic patterns before deploying autoscaling to production. Use tools like
heyork6to generate load. -
Monitor scaling decisions — Set up alerts for rapid scaling events, pods stuck in Pending, and HPA metrics showing
<unknown>. -
Combine HPA and KEDA for complex workloads — Use HPA for CPU/memory-based scaling and KEDA for event-driven scaling on the same deployment.
-
Start with conservative settings — Begin with narrow scaling ranges (min: 3, max: 10) and expand as you understand your traffic patterns.
-
Use VPA in recommendation-only mode first — Run VPA with
updateMode: "Off"for several weeks to understand resource patterns before enabling automatic adjustments. -
Right-size before scaling — Ensure resource requests are accurate before enabling autoscaling. Autoscaling with wrong requests wastes money.
-
Use Cluster Autoscaler with multiple node groups — Different workload types benefit from different instance types. Use labels and taints to route workloads appropriately.
-
Implement cost monitoring — Track the impact of autoscaling on your cloud bill. Set alerts for unexpected scaling behavior.
Kubernetes autoscaling is not a set-and-forget feature. It requires ongoing monitoring, tuning, and adaptation as your workloads evolve. Start with the basics, measure the impact, and iterate toward more sophisticated scaling strategies. The combination of HPA, VPA, KEDA, and Cluster Autoscaler gives you complete control over your infrastructure's responsiveness and cost efficiency.