MinhVo

Minh Vo

rss feed

Slaying code & making it lit fr fr 🔥 tagline

Hey there 👋 I'm an AI Engineer with 7 years of experience building scalable web and mobile applications. Currently at Neurond AI (May 2025 — present), architecting an Enterprise AI Assistant Platform with multi-tenant RAG on pgvector, multi-provider LLM orchestration, and Azure-native infrastructure. Previously spent 5+ years at SNAPTEC (Sep 2019 — Apr 2025), leading SaaS themes, admin dashboards, and e-commerce platforms — earned the Hero of the Year award in 2021. I specialize in TypeScript, React, Next.js, and AI-Native engineering with Claude Code and Cursor.bio

Back to blogs

Kubernetes Autoscaling: HPA, VPA, and KEDA in Depth

Master Kubernetes autoscaling with Horizontal Pod Autoscaler, Vertical Pod Autoscaler, and KEDA. Learn scaling algorithms, custom metrics, event-driven autoscaling, and production-grade scaling strategies.

kubernetesautoscalinghpavpakedacloud-nativedevops

By MinhOmega

Kubernetes Autoscaling: HPA, VPA, and KEDA in Depth

Kubernetes cluster scaling

Autoscaling is one of Kubernetes' most powerful capabilities, enabling workloads to adapt dynamically to changing demand. But choosing the right autoscaling strategy—and configuring it correctly—requires understanding the underlying mechanics, trade-offs, and operational pitfalls. This guide dives deep into the three primary autoscaling mechanisms: HPA, VPA, and KEDA, along with Cluster Autoscaler, production patterns, and cost optimization strategies.

Why Autoscaling Matters

Without autoscaling, you face two fundamental problems:

  1. Over-provisioning: Allocating peak-capacity resources wastes money during low-traffic periods. A cluster sized for Black Friday traffic runs at 5% utilization the other 364 days.
  2. Under-provisioning: Allocating too few resources causes latency spikes, dropped requests, and outages during traffic surges.

Autoscaling solves both by dynamically adjusting resources to match actual demand. In a cloud environment where you pay per resource-hour, right-sizing translates directly to cost savings—often 30-60% compared to static provisioning.

The Cost of Getting It Wrong

Consider a real-world scenario: an e-commerce platform running 50 pods at 2 vCPU each during normal traffic. During a flash sale, traffic increases 10x. Without autoscaling, the platform either crashes (under-provisioned) or wastes $15,000/month running 500 pods year-round (over-provisioned). With HPA, the platform scales from 50 to 500 pods in minutes, then back down after the sale—paying only for what it used.

The Three Autoscaling Dimensions

Kubernetes autoscaling operates across three dimensions:

  • Horizontal (HPA): Adds or removes pod replicas. Best for stateless workloads that can distribute load across instances.
  • Vertical (VPA): Adjusts CPU and memory requests/limits for existing pods. Best for stateful workloads or batch jobs where horizontal scaling is impractical.
  • Event-driven (KEDA): Scales based on external event sources like message queues, databases, or custom metrics. Best for event-driven architectures and can scale to zero.

In production, you often combine them: HPA for CPU/memory-based scaling during traffic spikes, KEDA for queue-based scaling in async pipelines, and VPA for right-sizing resource requests over time.

HPA: Horizontal Pod Autoscaler

How HPA Works Under the Hood

The HPA controller runs in the kube-controller-manager and follows a control loop:

  1. Fetches current metric values from the Metrics API (or custom metrics API)
  2. Calculates the desired number of replicas using the formula: desiredReplicas = ceil(currentReplicas * (currentMetricValue / targetMetricValue))
  3. Compares desired replicas with current replicas
  4. If the difference exceeds the tolerance (default 10%), applies scaling
  5. Respects stabilization windows to prevent flapping

Example calculation: If you have 3 pods at 80% CPU utilization with a target of 40%, the HPA computes ceil(3 * (80/40)) = 6 pods.

The tolerance parameter (default 0.1 or 10%) prevents unnecessary scaling when the metric is close to the target. If current utilization is 62% and target is 60%, the 3.3% difference is within tolerance, so no scaling occurs.

Installing the Metrics Server

The Metrics Server is a prerequisite for HPA. It collects resource metrics from kubelet and exposes them through the Kubernetes Metrics API.

# Install Metrics Server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml
 
# For development clusters with self-signed certificates, add:
# --kubelet-insecure-tls to the metrics-server deployment args
 
# Verify installation
kubectl get deployment metrics-server -n kube-system
kubectl top nodes
kubectl top pods

If kubectl top returns "Metrics API not available," check that the metrics-server pods are running and that the API group metrics.k8s.io is registered:

kubectl get apiservice v1beta1.metrics.k8s.io

Basic HPA Configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
spec:
  replicas: 2
  selector:
    matchLabels:
      app: web-app
  template:
    metadata:
      labels:
        app: web-app
    spec:
      containers:
      - name: web
        image: nginx:1.25
        resources:
          requests:
            cpu: "100m"
            memory: "128Mi"
          limits:
            cpu: "500m"
            memory: "256Mi"
        ports:
        - containerPort: 80
        readinessProbe:
          httpGet:
            path: /healthz
            port: 80
          initialDelaySeconds: 5
          periodSeconds: 10
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 10
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 25
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30

The behavior field (available in autoscaling/v2) gives you fine-grained control over scaling velocity. The configuration above allows aggressive scale-up (double capacity every 30 seconds) but conservative scale-down (remove at most 25% of pods per minute, with a 5-minute stabilization window).

Server infrastructure

Custom Metrics with Prometheus

For application-specific metrics like request latency or queue depth, you need the Prometheus Adapter:

# Install Prometheus Adapter
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus-adapter prometheus-community/prometheus-adapter \
  --set prometheus.url=http://prometheus-server.prometheus.svc
# prometheus-adapter config: map Prometheus metric to Kubernetes custom metric
rules:
  custom:
  - seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
    resources:
      overrides:
        namespace: {resource: "namespace"}
        pod: {resource: "pod"}
    name:
      matches: "^(.*)_total$"
      as: "${1}_per_second"
    metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
# HPA using custom metrics from Prometheus
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: web-app-hpa-custom
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: web-app
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  - type: Object
    object:
      metric:
        name: requests-per-second
      describedObject:
        apiVersion: networking.k8s.io/v1
        kind: Ingress
        name: web-app-ingress
      target:
        type: Value
        value: "10k"

Multi-Metric HPA with Weighted Scoring

When using multiple metrics, the HPA evaluates each independently and selects the highest desired replica count:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-server-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
  # Scale on CPU
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  # Scale on memory
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 75
  # Scale on request latency (from Prometheus)
  - type: Pods
    pods:
      metric:
        name: http_request_duration_p99
      target:
        type: AverageValue
        averageValue: "250m"  # 250ms p99 latency
  # Scale on queue depth (from Prometheus)
  - type: Pods
    pods:
      metric:
        name: job_queue_depth
      target:
        type: AverageValue
        averageValue: "100"

In this example, if CPU wants 5 replicas but latency wants 12 replicas, HPA scales to 12. This conservative approach (take the max) prevents under-provisioning.

HPA Behavior Tuning for Production

The behavior field controls scaling velocity. Getting this wrong causes flapping (rapid scale-up followed by immediate scale-down):

behavior:
  scaleUp:
    stabilizationWindowSeconds: 60
    selectPolicy: Max
    policies:
    - type: Pods
      value: 4
      periodSeconds: 60
    - type: Percent
      value: 100
      periodSeconds: 60
  scaleDown:
    stabilizationWindowSeconds: 300  # 5 minutes
    selectPolicy: Min
    policies:
    - type: Percent
      value: 10
      periodSeconds: 60

Key tuning guidelines:

  • scaleDown stabilizationWindowSeconds: 300+ — Prevents premature scale-down after transient load spikes. The HPA uses the highest recommendation within the window.
  • scaleUp stabilizationWindowSeconds: 0-60 — Fast reaction to load increases. Users experience latency if scale-up is too slow.
  • selectPolicy: Min for scaleDown — Use the most conservative (smallest) scale-down to avoid removing too many pods at once.
  • selectPolicy: Max for scaleUp — Use the most aggressive scale-up to handle sudden traffic.

Cooldown Periods and Flapping Prevention

Flapping occurs when HPA scales up, traffic drops briefly, HPA scales down, then traffic returns—creating a cycle of scaling that wastes resources and destabilizes the application. The stabilization window is your primary defense:

# Anti-flapping configuration
behavior:
  scaleUp:
    stabilizationWindowSeconds: 30  # React quickly to demand
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15
  scaleDown:
    stabilizationWindowSeconds: 600  # 10-minute cooldown
    policies:
    - type: Percent
      value: 10
      periodSeconds: 120

The 10-minute scale-down window means HPA won't scale down unless the metric has been below target for 10 continuous minutes. This is critical for traffic patterns with natural dips (lunch breaks, shift changes).

Cloud computing

VPA: Vertical Pod Autoscaler

VPA Architecture

The Vertical Pod Autoscaler consists of three components:

  1. VPA Recommender — Analyzes historical resource usage via the Metrics API and recommends CPU/memory requests. Uses exponential histogram decay to weight recent data more heavily.
  2. VPA Updater — Evicts pods whose resource requests deviate significantly from recommendations. Only evicts pods that can be safely disrupted (respects PodDisruptionBudgets).
  3. VPA Admission Controller — Intercepts pod creation requests via a mutating webhook and injects recommended resource values.

The VPA does not modify running pods directly. It evicts pods that have drifted from recommendations, and the admission controller sets new values when the pod is recreated.

Installing VPA

# Install VPA from the official autoscaler repository
git clone https://github.com/kubernetes/autoscaler.git
cd autoscaler/vertical-pod-autoscaler
./hack/vpa-up.sh
 
# Verify all three components are running
kubectl get pods -n kube-system | grep vpa

VPA Configuration Modes

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: postgres-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: postgres
  updatePolicy:
    updateMode: "Auto"  # Auto | Off | Initial | Recreate
  resourcePolicy:
    containerPolicies:
    - containerName: postgres
      minAllowed:
        cpu: "250m"
        memory: "512Mi"
      maxAllowed:
        cpu: "4"
        memory: "16Gi"
      controlledResources: ["cpu", "memory"]
      controlledValues: RequestsOnly

Update modes explained:

ModeBehaviorUse Case
OffOnly provides recommendations, no changesObserving before automating
InitialSets values only at pod creationNew deployments
RecreateEvicts and recreates pods with new valuesStateless workloads
AutoLike Recreate, but with eviction timing optimizationProduction workloads

Critical warning: Never use VPA with HPA on the same resource metric. If HPA scales on CPU utilization and VPA adjusts CPU requests, they create a feedback loop: VPA increases requests → utilization drops → HPA scales down → load increases → HPA scales up → utilization spikes → VPA increases requests. Use HPA on CPU and VPA on memory, or use HPA with custom metrics while VPA handles resource right-sizing.

VPA Recommendation Analysis

Check VPA recommendations before enabling Auto mode:

kubectl describe vpa postgres-vpa

Output includes:

Status:
  Recommendation:
    Container Recommendations:
      Container Name:  postgres
      Lower Bound:     cpu: 200m, memory: 400Mi
      Target:          cpu: 500m, memory: 1Gi
      Upper Bound:     cpu: 2, memory: 4Gi
      Uncapped Target: cpu: 800m, memory: 1.5Gi
  • Target: The recommended value VPA will apply
  • Lower Bound: Minimum value VPA considers safe
  • Upper Bound: Maximum value VPA considers reasonable
  • Uncapped Target: The recommendation without min/max constraints applied

VPA Right-Sizing Workflow

The recommended production workflow for VPA is:

  1. Week 1-2: Deploy VPA in Off mode. Collect recommendations without making changes.
  2. Week 3-4: Review recommendations. Adjust min/max constraints based on observed patterns.
  3. Week 5+: Enable Auto mode with conservative constraints. Monitor closely for eviction storms.

VPA for Stateful Workloads

VPA is particularly valuable for stateful workloads like databases where horizontal scaling is complex:

apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: redis-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: redis
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: redis
      minAllowed:
        cpu: "100m"
        memory: "256Mi"
      maxAllowed:
        cpu: "4"
        memory: "8Gi"
      controlledResources: ["cpu", "memory"]

For databases, VPA can reduce costs by 20-40% by right-sizing requests that were initially over-provisioned based on guesswork. The key is running in recommendation mode long enough to capture all workload patterns (peak hours, batch jobs, backups).

VPA vs HPA: When to Use Which

Workload TypeRecommendedReason
Stateless web serversHPAEasy to add/remove replicas
Batch processingKEDAEvent-driven, can scale to zero
DatabasesVPAHorizontal scaling is complex
Message consumersKEDAScale based on queue depth
CPU-bound servicesHPAScale based on CPU utilization
Memory-bound servicesVPAAdjust memory requests
Mixed workloadsHPA + VPAHPA on custom metrics, VPA on resources
# Production VPA with conservative constraints
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: api-server-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  updatePolicy:
    updateMode: "Auto"
  resourcePolicy:
    containerPolicies:
    - containerName: api
      minAllowed:
        cpu: "100m"
        memory: "128Mi"
      maxAllowed:
        cpu: "2"
        memory: "4Gi"
      controlledResources: ["cpu", "memory"]
      controlledValues: RequestsOnly

Container orchestration

KEDA: Kubernetes Event-Driven Autoscaling

Why KEDA Exists

Standard HPA has limitations:

  • Cannot scale to zero replicas (minimum is 1)
  • Primarily designed for CPU/memory metrics
  • Requires the Metrics Server or Prometheus Adapter for custom metrics
  • Not designed for event-driven architectures (queues, streams, topics)

KEDA addresses all of these. It bridges Kubernetes autoscaling with external event sources, enabling workloads to scale based on queue depth, message rates, database queries, or any custom metric.

KEDA Architecture

KEDA consists of two components:

  1. KEDA Operator — Watches ScaledObject resources and manages the lifecycle of the underlying HPA. When the event source reports zero events, the operator scales the deployment to zero.
  2. Metrics Server — Implements the Kubernetes external metrics API, translating event source metrics into values the HPA can consume.

KEDA creates an HPA resource under the hood. The HPA uses KEDA's external metrics for scaling decisions. When the metric drops to zero, KEDA scales the deployment to zero—a capability the standard HPA does not support.

Installing KEDA

# Install with Helm
helm repo add kedacore https://kedacore.github.io/charts
helm repo update
helm install keda kedacore/keda \
  --namespace keda \
  --create-namespace
 
# Verify installation
kubectl get pods -n keda

KEDA ScaledObject for RabbitMQ

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-processor-scaledobject
spec:
  scaleTargetRef:
    name: order-processor
  pollingInterval: 15
  cooldownPeriod: 300
  minReplicaCount: 0
  maxReplicaCount: 50
  triggers:
  - type: rabbitmq
    metadata:
      host: amqp://user:password@rabbitmq.rabbitmq.svc:5672
      queueName: order-processing
      queueLength: "10"  # Target 10 messages per pod
    authenticationRef:
      name: rabbitmq-auth
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: rabbitmq-auth
spec:
  secretTargetRef:
  - parameter: host
    name: rabbitmq-secret
    key: connection-string

With this configuration:

  • When the order-processing queue has 0 messages, KEDA scales to 0 replicas
  • When 100 messages arrive, KEDA scales to ceil(100/10) = 10 pods
  • When the queue drains, KEDA waits 300 seconds (cooldown) before scaling back to 0

KEDA with Multiple Triggers

KEDA supports multiple triggers on a single ScaledObject. It evaluates each trigger independently and selects the highest replica count:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: image-processor-scaledobject
spec:
  scaleTargetRef:
    name: image-processor
  minReplicaCount: 0
  maxReplicaCount: 100
  triggers:
  # Scale on SQS queue depth
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123456789/image-processing
      queueLength: "5"
    authenticationRef:
      name: aws-credentials
  # Scale on Prometheus metric (GPU utilization)
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc:9090
      metricName: gpu_utilization_avg
      threshold: "80"
      query: avg(gpu_utilization{app="image-processor"})
  # Scale on Kafka lag
  - type: kafka
    metadata:
      bootstrapServers: kafka-cluster.kafka.svc:9092
      consumerGroup: image-processor-group
      topic: image-uploads
      lagThreshold: "50"

KEDA Scalers Reference

KEDA supports 50+ scalers out of the box. Here are the most commonly used:

ScalerTrigger TypeUse Case
RabbitMQrabbitmqMessage queue processing
Apache KafkakafkaStream processing with consumer lag
AWS SQSaws-sqs-queueAWS queue-based processing
Azure Service Busazure-servicebusAzure message processing
PrometheusprometheusAny metric from Prometheus
PostgreSQLpostgresqlScale based on query results
MySQLmysqlScale based on query results
RedisredisScale based on Redis list length
CroncronTime-based scaling
CPUcpuBuilt-in CPU scaling (with scale-to-zero)
MemorymemoryBuilt-in memory scaling (with scale-to-zero)
Google Cloud Pub/Subgoogle-pubsubGCP message processing
AWS CloudWatchaws-cloudwatchAWS metric-based scaling
DatadogdatadogDatadog metric-based scaling
ElasticsearchelasticsearchScale based on search queue
MongoDBmongodbScale based on query results
NATSnatsNATS message processing
Apache PulsarpulsarPulsar message processing
Huawei Cloud SMNhuawei-cloudeyeHuawei Cloud metrics
LiiklusliiklusReactive streams
OpenStack MetricsopenstackOpenStack metrics

KEDA with ScaledJob

For batch workloads that run to completion, use ScaledJob instead of ScaledObject:

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: data-migration-job
spec:
  jobTargetRef:
    template:
      spec:
        containers:
        - name: migration
          image: data-migrator:latest
          command: ["./migrate"]
          resources:
            requests:
              cpu: "500m"
              memory: "512Mi"
        restartPolicy: Never
  pollingInterval: 30
  maxReplicaCount: 20
  successfulJobsHistoryLimit: 5
  failedJobsHistoryLimit: 3
  triggers:
  - type: rabbitmq
    metadata:
      host: amqp://user:password@rabbitmq.rabbitmq.svc:5672
      queueName: migration-tasks
      queueLength: "1"
    authenticationRef:
      name: rabbitmq-auth

ScaledJob creates a new Job for each message in the queue, up to maxReplicaCount. When the queue empties, no new Jobs are created and completed Jobs are cleaned up based on history limits.

Cron-Based Scaling

For predictable traffic patterns, combine KEDA's cron scaler with event-driven scaling:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: api-server-scaledobject
spec:
  scaleTargetRef:
    name: api-server
  minReplicaCount: 1
  maxReplicaCount: 100
  triggers:
  - type: cron
    metadata:
      timezone: America/New_York
      start: "0 8 * * *"     # Scale up at 8 AM
      end: "0 20 * * *"      # Scale down at 8 PM
      desiredReplicas: "10"
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc:9090
      metricName: http_requests_per_second
      threshold: "500"
      query: sum(rate(http_requests_total{app="api-server"}[1m]))

During business hours (8 AM - 8 PM), KEDA ensures at least 10 replicas. Outside business hours, it scales down to 1 (the minReplicaCount) unless Prometheus metrics indicate higher demand.

Data center

Cluster Autoscaler

How Cluster Autoscaler Works

While HPA, VPA, and KEDA scale pods, the Cluster Autoscaler scales nodes. When pods can't be scheduled due to insufficient cluster capacity, the Cluster Autoscaler adds nodes. When nodes are underutilized, it removes them.

# Install Cluster Autoscaler (AWS EKS example)
helm repo add autoscaler https://kubernetes.github.io/autoscaler
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --set autoDiscovery.clusterName=my-cluster \
  --set awsRegion=us-east-1 \
  --set "extraArgs.balance-similar-node-groups=true" \
  --set "extraArgs.skip-nodes-with-local-storage=false"

Cluster Autoscaler Configuration

# Cluster Autoscaler deployment arguments
extraArgs:
  scale-down-delay-after-add: "10m"      # Wait 10 min after adding a node before considering removal
  scale-down-unneeded-time: "10m"         # Node must be unneeded for 10 min before removal
  scale-down-utilization-threshold: "0.5" # Remove nodes below 50% utilization
  max-node-provision-time: "15m"          # Max time to wait for a new node
  skip-nodes-with-local-storage: "false"  # Consider nodes with local storage for removal
  expander: "least-waste"                 # Choose node type with least wasted resources

Node Group Strategies

# Terraform: Multiple node groups for different workload types
resource "aws_eks_node_group" "general" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "general-purpose"
  instance_types  = ["m5.large", "m5.xlarge"]
  
  scaling_config {
    min_size     = 2
    max_size     = 20
    desired_size = 3
  }
  
  labels = {
    workload-type = "general"
  }
}
 
resource "aws_eks_node_group" "compute" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "compute-optimized"
  instance_types  = ["c5.2xlarge", "c5.4xlarge"]
  
  scaling_config {
    min_size     = 0
    max_size     = 10
    desired_size = 0
  }
  
  labels = {
    workload-type = "compute"
  }
  
  taint {
    key    = "workload-type"
    value  = "compute"
    effect = "NO_SCHEDULE"
  }
}

HPA + Cluster Autoscaler Coordination

When HPA scales pods but the cluster lacks capacity, the Cluster Autoscaler must add nodes. This creates a two-step scaling delay:

Traffic spike → HPA wants more pods → Pods pending (no capacity) → Cluster Autoscaler adds node → Pods scheduled

To minimize this delay:

# HPA configuration for fast scaling
behavior:
  scaleUp:
    stabilizationWindowSeconds: 0
    policies:
    - type: Percent
      value: 100
      periodSeconds: 15
 
# Pod Disruption Budget to ensure minimum availability
---
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: api-pdb
spec:
  minAvailable: "50%"
  selector:
    matchLabels:
      app: api-server
 
# Priority Class to ensure critical pods are scheduled first
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: high-priority
value: 1000000
globalDefault: false
description: "High priority for user-facing services"

Real-World Scaling Scenarios

Scenario 1: Traffic Spike Handling

A news website experiences 50x traffic during breaking news events:

# HPA for baseline scaling
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: news-frontend-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: news-frontend
  minReplicas: 5
  maxReplicas: 200
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 60
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 15
    scaleDown:
      stabilizationWindowSeconds: 600
      policies:
      - type: Percent
        value: 10
        periodSeconds: 120

Scenario 2: Batch Processing with KEDA

A data pipeline processes files from S3:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: data-processor-scaledobject
spec:
  scaleTargetRef:
    name: data-processor
  minReplicaCount: 0
  maxReplicaCount: 50
  cooldownPeriod: 300
  triggers:
  - type: aws-sqs-queue
    metadata:
      queueURL: https://sqs.us-east-1.amazonaws.com/123456789/data-processing
      queueLength: "100"
    authenticationRef:
      name: aws-credentials

When 10,000 files are queued, KEDA scales to 100 pods. As the queue drains, pods scale back to zero.

Scenario 3: Queue-Based Microservices

An order processing system with multiple stages:

# Order ingestion service
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: order-ingestion-scaledobject
spec:
  scaleTargetRef:
    name: order-ingestion
  minReplicaCount: 1
  maxReplicaCount: 20
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc:9090
      metricName: http_requests_per_second
      threshold: "1000"
      query: sum(rate(http_requests_total{app="order-ingestion"}[1m]))
---
# Payment processing service
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: payment-processor-scaledobject
spec:
  scaleTargetRef:
    name: payment-processor
  minReplicaCount: 0
  maxReplicaCount: 50
  triggers:
  - type: rabbitmq
    metadata:
      host: amqp://user:password@rabbitmq.rabbitmq.svc:5672
      queueName: payment-processing
      queueLength: "50"
    authenticationRef:
      name: rabbitmq-auth

Software engineering

Production Scaling Patterns

Pattern 1: Warm Pool with Startup Probe

Cold-start latency is a major problem with scale-to-zero. Use a startup probe and pre-warmed pods:

spec:
  template:
    spec:
      containers:
      - name: api
        startupProbe:
          httpGet:
            path: /ready
            port: 8080
          failureThreshold: 30
          periodSeconds: 2
        readinessProbe:
          httpGet:
            path: /healthz
            port: 8080
          periodSeconds: 5

Combined with KEDA's idleReplicaCount (available in newer versions), you can maintain a warm pool:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: api-scaledobject
spec:
  scaleTargetRef:
    name: api-server
  minReplicaCount: 0
  idleReplicaCount: 2  # Keep 2 warm pods when idle
  maxReplicaCount: 50

Pattern 2: Scaling with Custom Application Metrics

Expose application-specific metrics from your code and scale on them:

// Node.js Express app exposing metrics for Prometheus
const promClient = require('prom-client');
 
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code'],
  buckets: [0.01, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
});
 
const activeConnections = new promClient.Gauge({
  name: 'active_connections',
  help: 'Number of active connections'
});
 
const queueDepth = new promClient.Gauge({
  name: 'job_queue_depth',
  help: 'Current depth of the processing queue'
});
 
// Middleware to record request duration
app.use((req, res, next) => {
  const end = httpRequestDuration.startTimer();
  res.on('finish', () => {
    end({ method: req.method, route: req.route?.path || 'unknown', status_code: res.statusCode });
  });
  next();
});
 
// Expose /metrics endpoint
app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});
# Prometheus ServiceMonitor to scrape the app
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: api-server-monitor
spec:
  selector:
    matchLabels:
      app: api-server
  endpoints:
  - port: http
    path: /metrics
    interval: 15s
---
# HPA scaling on p99 latency
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: api-latency-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: api-server
  minReplicas: 3
  maxReplicas: 50
  metrics:
  - type: Pods
    pods:
      metric:
        name: http_request_duration_seconds_p99
      target:
        type: AverageValue
        averageValue: "500m"  # Scale when p99 > 500ms

Cost Optimization with Autoscaling

Right-Sizing Resources

Autoscaling only saves money if your resource requests are accurate. Over-provisioned requests waste money even with autoscaling:

# Check actual vs requested resources
kubectl top pods -l app=api-server --containers
 
# Use VPA recommendations to right-size
kubectl describe vpa api-server-vpa | grep -A 10 "Recommendation"

Spot Instances and Autoscaling

Combine Cluster Autoscaler with spot instances for up to 90% cost savings:

# Terraform: Spot instance node group
resource "aws_eks_node_group" "spot" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "spot-instances"
  instance_types  = ["m5.large", "m5.xlarge", "m5.2xlarge", "m5a.large"]
  
  capacity_type = "SPOT"
  
  scaling_config {
    min_size     = 0
    max_size     = 50
    desired_size = 5
  }
  
  labels = {
    "node-lifecycle" = "spot"
  }
  
  taint {
    key    = "spot"
    value  = "true"
    effect = "NO_SCHEDULE"
  }
}

Scaling Down Aggressively

During off-peak hours, scale down aggressively to minimize costs:

# KEDA cron-based scaling for cost optimization
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: batch-processor-scaledobject
spec:
  scaleTargetRef:
    name: batch-processor
  minReplicaCount: 0
  maxReplicaCount: 100
  triggers:
  - type: cron
    metadata:
      timezone: UTC
      start: "0 2 * * *"     # Scale up at 2 AM for batch processing
      end: "0 6 * * *"       # Scale down at 6 AM
      desiredReplicas: "50"
  - type: prometheus
    metadata:
      serverAddress: http://prometheus.monitoring.svc:9090
      metricName: queue_depth
      threshold: "1000"
      query: sum(job_queue_depth{app="batch-processor"})

Cost Monitoring

Set up cost monitoring to track autoscaling impact:

# Prometheus alert for excessive scaling
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: autoscaling-alerts
spec:
  groups:
  - name: autoscaling
    rules:
    - alert: ExcessiveScaling
      expr: |
        rate(kube_horizontalpodautoscaler_status_desired_replicas{namespace="production"}[1h]) > 5
      for: 10m
      labels:
        severity: warning
      annotations:
        summary: "HPA scaling excessively in {{ $labels.namespace }}"
        description: "HPA {{ $labels.horizontalpodautoscaler }} has been scaling rapidly"

DevOps workflow

Autoscaling Anti-Patterns

Anti-Pattern 1: Scaling Without Resource Requests

HPA requires resource requests to calculate utilization. Without them, the Metrics Server can't provide data and HPA shows <unknown>:

# Wrong: No resource requests
containers:
- name: api
  image: api:latest
  # No resources section
 
# Correct: Resource requests defined
containers:
- name: api
  image: api:latest
  resources:
    requests:
      cpu: "100m"
      memory: "128Mi"
    limits:
      cpu: "500m"
      memory: "256Mi"

Anti-Pattern 2: Overly Aggressive Scale-Down

Setting scale-down policies too aggressively causes thrashing during traffic fluctuations:

# Wrong: Removes 100% of pods immediately
behavior:
  scaleDown:
    stabilizationWindowSeconds: 0
    policies:
    - type: Percent
      value: 100
      periodSeconds: 60
 
# Correct: Conservative scale-down
behavior:
  scaleDown:
    stabilizationWindowSeconds: 300
    policies:
    - type: Percent
      value: 10
      periodSeconds: 120

Anti-Pattern 3: Using VPA with HPA on Same Metric

The classic feedback loop problem:

# Wrong: Both scale on CPU
# VPA increases CPU requests → utilization drops → HPA scales down
---
# Correct: HPA on custom metrics, VPA on resources
# Or HPA on CPU, VPA on memory only

Anti-Pattern 4: Ignoring Pod Disruption Budgets

Without PDBs, autoscalers can evict all pods simultaneously during scaling events:

# Always define PDBs for critical workloads
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: critical-app-pdb
spec:
  minAvailable: "50%"
  selector:
    matchLabels:
      app: critical-app

Anti-Pattern 5: Not Testing Scaling Behavior

Deploying autoscaling to production without testing is risky. Use load testing tools:

# Install k6 for load testing
brew install k6
 
# Run load test
k6 run --vus 100 --duration 5m load-test.js
 
# Monitor HPA during test
watch kubectl get hpa
// load-test.js
import http from 'k6/http';
import { sleep } from 'k6';
 
export const options = {
  stages: [
    { duration: '2m', target: 50 },   // Ramp up
    { duration: '5m', target: 50 },   // Stay at 50 users
    { duration: '2m', target: 200 },  // Spike to 200 users
    { duration: '5m', target: 200 },  // Stay at 200 users
    { duration: '2m', target: 0 },    // Ramp down
  ],
};
 
export default function () {
  http.get('https://api.example.com/health');
  sleep(1);
}

Monitoring and Debugging Autoscaling

Essential HPA Commands

# Check HPA status
kubectl get hpa
kubectl describe hpa web-app-hpa
 
# View current metrics
kubectl get hpa web-app-hpa -o yaml
 
# Check events for scaling decisions
kubectl get events --field-selector reason=SuccessfulRescale
 
# Monitor HPA in real-time
watch kubectl get hpa

Common HPA Issues and Solutions

IssueSymptomSolution
FlappingPods scale up/down rapidlyIncrease stabilizationWindowSeconds
No scalingHPA shows <unknown>Verify Metrics Server is running
Slow scale-upLatency during traffic spikesReduce stabilizationWindowSeconds, increase policy values
Metrics delayScaling lags behind trafficReduce Metrics Server --metric-resolution
Insufficient capacityPods stuck in PendingConfigure Cluster Autoscaler, add node groups
VPA conflictHPA and VPA fightingRemove VPA from CPU or use custom metrics only

VPA Debugging

# Check VPA recommendations
kubectl describe vpa postgres-vpa
 
# Compare recommendations vs actual usage
kubectl top pods -l app=postgres
 
# Check VPA updater events
kubectl get events --field-selector reason=EvictedByVPA

KEDA Debugging

# Check ScaledObject status
kubectl get scaledobject
kubectl describe scaledobject order-processor-scaledobject
 
# Check KEDA operator logs
kubectl logs -n keda -l app=keda-operator
 
# Check trigger authentication
kubectl get triggerauthentication
kubectl describe triggerauthentication rabbitmq-auth
 
# Verify external metrics
kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .

Best Practices Summary

  1. Always set resource requests — HPA and VPA both depend on accurate resource requests. Without them, the scheduler cannot make informed decisions.

  2. Use PDBs with autoscaling — Pod Disruption Budgets prevent autoscalers from evicting too many pods simultaneously.

  3. Test scaling behavior in staging — Simulate traffic patterns before deploying autoscaling to production. Use tools like hey or k6 to generate load.

  4. Monitor scaling decisions — Set up alerts for rapid scaling events, pods stuck in Pending, and HPA metrics showing <unknown>.

  5. Combine HPA and KEDA for complex workloads — Use HPA for CPU/memory-based scaling and KEDA for event-driven scaling on the same deployment.

  6. Start with conservative settings — Begin with narrow scaling ranges (min: 3, max: 10) and expand as you understand your traffic patterns.

  7. Use VPA in recommendation-only mode first — Run VPA with updateMode: "Off" for several weeks to understand resource patterns before enabling automatic adjustments.

  8. Right-size before scaling — Ensure resource requests are accurate before enabling autoscaling. Autoscaling with wrong requests wastes money.

  9. Use Cluster Autoscaler with multiple node groups — Different workload types benefit from different instance types. Use labels and taints to route workloads appropriately.

  10. Implement cost monitoring — Track the impact of autoscaling on your cloud bill. Set alerts for unexpected scaling behavior.

Kubernetes autoscaling is not a set-and-forget feature. It requires ongoing monitoring, tuning, and adaptation as your workloads evolve. Start with the basics, measure the impact, and iterate toward more sophisticated scaling strategies. The combination of HPA, VPA, KEDA, and Cluster Autoscaler gives you complete control over your infrastructure's responsiveness and cost efficiency.