Chaos Engineering: Breaking Systems to Make Them Resilient

Introduction

In 2012, Netflix engineers deliberately terminated hundreds of production instances to verify that their streaming service could survive the loss of an entire AWS availability zone. This was not reckless behavior — it was a carefully orchestrated experiment that confirmed their system's resilience and revealed a subtle failover bug that would have caused a major outage during the next real failure. This practice, now known as chaos engineering, has become a cornerstone of modern reliability engineering.

Chaos engineering is the discipline of experimenting on a system to build confidence in its ability to withstand turbulent conditions in production. Rather than waiting for failures to occur and then scrambling to respond, chaos engineering proactively injects failures to discover weaknesses before they cause outages. It is the difference between testing a parachute by jumping out of a plane and testing it in a wind tunnel — both reveal problems, but only one does so safely.

This article covers the principles, practices, and tools of chaos engineering, with practical examples of fault injection experiments, game day planning, and building a culture of resilience within engineering organizations.

Understanding Chaos Engineering: Core Concepts

The Steady-State Hypothesis

Every chaos experiment begins with a hypothesis about the system's steady state. The steady state is the normal operating condition — the metrics and behaviors you expect to see when the system is healthy. Before injecting any fault, you must define what "healthy" looks like.

A steady-state hypothesis typically includes:

Latency percentiles: p50, p95, p99 response times remain within acceptable bounds
Error rates: The percentage of failed requests stays below a defined threshold
Throughput: The system continues to process requests at expected rates
Resource utilization: CPU, memory, disk, and network usage remain stable
Business metrics: Orders are processed, messages are delivered, data is consistent

interface SteadyStateHypothesis {
  name: string;
  metrics: Array<{
    name: string;
    query: string;        // PromQL or similar
    threshold: number;
    comparison: "lt" | "gt" | "eq";
  }>;
  duration: string;       // How long to observe
}
 
const apiSteadyState: SteadyStateHypothesis = {
  name: "API Gateway Health",
  metrics: [
    { name: "p99 latency", query: "histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))", threshold: 0.5, comparison: "lt" },
    { name: "error rate", query: "rate(http_requests_total{status=~'5..'}[5m]) / rate(http_requests_total[5m])", threshold: 0.01, comparison: "lt" },
    { name: "throughput", query: "rate(http_requests_total[5m])", threshold: 100, comparison: "gt" },
  ],
  duration: "10m",
};

Blast Radius and Scope

Chaos experiments must be carefully scoped to limit their impact. The blast radius defines how many users, services, or infrastructure components are affected by an experiment. Start with the smallest possible blast radius and expand only as confidence grows.

The blast radius is controlled by three dimensions:

Scope: Which components are affected (single instance, single service, entire cluster)
Magnitude: How severe the failure is (10% latency increase vs. complete outage)
Duration: How long the experiment runs (seconds, minutes, or until manually stopped)

A typical progression starts with injecting a 10% latency increase on a single instance for 5 minutes. If the system handles this gracefully, increase the magnitude to 50%, then expand the scope to multiple instances, and finally simulate a complete instance failure.

Fault Injection Categories

Chaos experiments inject faults across several categories:

Infrastructure faults: Instance termination, network partition, disk failure, CPU pressure, memory pressure.

Network faults: Latency injection, packet loss, DNS failure, connection timeout, bandwidth throttling.

Application faults: Exception injection, slow responses, resource exhaustion, dependency failure.

State faults: Data corruption, clock skew, configuration drift, certificate expiration.

Each category reveals different classes of bugs. Infrastructure faults test failover mechanisms. Network faults test timeout handling and retry logic. Application faults test error handling and circuit breakers. State faults test data consistency and validation.

Architecture and Design Patterns

Chaos Engineering Platform Architecture

A chaos engineering platform consists of several components:

Experiment engine: Orchestrates fault injection, monitors metrics, and determines experiment outcomes.
Fault injectors: Agents that apply faults to target systems (process killers, network manipulators, resource stressors).
Observability stack: Collects and analyzes metrics during experiments to determine if steady-state was maintained.
Experiment library: Reusable experiment definitions that encode common failure scenarios.
Safety mechanisms: Automatic rollback, emergency stop buttons, and kill switches that halt experiments if they exceed blast radius limits.

interface ChaosExperiment {
  id: string;
  name: string;
  hypothesis: SteadyStateHypothesis;
  fault: FaultDefinition;
  target: TargetDefinition;
  rollback: RollbackStrategy;
  schedule?: CronExpression;
}
 
interface FaultDefinition {
  type: "latency" | "failure" | "stress" | "network" | "state";
  magnitude: number;
  duration: string;
  parameters: Record<string, any>;
}
 
interface TargetDefinition {
  service: string;
  instances: "random" | "all" | "specific";
  count?: number;
  labels?: Record<string, string>;
}
 
interface RollbackStrategy {
  automatic: boolean;
  conditions: Array<{
    metric: string;
    threshold: number;
    comparison: "lt" | "gt";
  }>;
  timeout: string;
}

Game Day Framework

A game day is a planned chaos engineering event where the entire team participates in running experiments and responding to failures. Game days simulate real incident scenarios in a controlled environment, providing valuable training for on-call engineers and revealing gaps in runbooks and tooling.

interface GameDay {
  id: string;
  date: Date;
  participants: string[];
  scenarios: GameDayScenario[];
  duration: string;
  objectives: string[];
}
 
interface GameDayScenario {
  name: string;
  description: string;
  experiment: ChaosExperiment;
  expectedResponse: string;
  actualResponse?: string;
  runbook: string;
  timeToDetect?: number;
  timeToResolve?: number;
  lessonsLearned?: string[];
}

Automated Experiment Pipeline

class ChaosPipeline {
  private experiments: ChaosExperiment[] = [];
  private observer: MetricsObserver;
  private injector: FaultInjector;
 
  constructor(observer: MetricsObserver, injector: FaultInjector) {
    this.observer = observer;
    this.injector = injector;
  }
 
  async runExperiment(experiment: ChaosExperiment): Promise<ExperimentResult> {
    console.log(`Starting experiment: ${experiment.name}`);
 
    // Phase 1: Verify steady state
    console.log("Verifying steady state...");
    const baseline = await this.observer.captureBaseline(
      experiment.hypothesis,
      "2m"
    );
    if (!baseline.stable) {
      throw new Error("System is not in steady state. Aborting experiment.");
    }
 
    // Phase 2: Inject fault
    console.log(`Injecting fault: ${experiment.fault.type}`);
    const injection = await this.injector.apply(experiment.fault, experiment.target);
 
    // Phase 3: Observe during fault
    console.log("Observing system behavior during fault...");
    const duringFault = await this.observer.observe(
      experiment.hypothesis,
      experiment.fault.duration
    );
 
    // Phase 4: Rollback
    console.log("Rolling back fault...");
    await this.injector.revert(injection);
 
    // Phase 5: Verify recovery
    console.log("Verifying recovery...");
    const afterRecovery = await this.observer.observe(
      experiment.hypothesis,
      "5m"
    );
 
    // Phase 6: Determine result
    const result: ExperimentResult = {
      experiment: experiment.name,
      hypothesis: experiment.hypothesis.name,
      passed: duringFault.withinThreshold && afterRecovery.withinThreshold,
      baseline,
      duringFault,
      afterRecovery,
      timestamp: new Date(),
    };
 
    console.log(`Experiment ${result.passed ? "PASSED" : "FAILED"}`);
    return result;
  }
}

Step-by-Step Implementation

Building a Fault Injection Service

import { Server } from "bun";
 
interface FaultConfig {
  latencyMs?: number;
  failureRate?: number;
  errorCode?: number;
  cpuStress?: number;
  memoryStress?: number;
}
 
class ChaosAgent {
  private activeFaults: Map<string, FaultConfig> = new Map();
  private server: Server;
 
  constructor(port: number) {
    this.server = Bun.serve({
      port,
      fetch: this.handleRequest.bind(this),
    });
  }
 
  private async handleRequest(req: Request): Promise<Response> {
    const url = new URL(req.url);
 
    if (url.pathname === "/faults" && req.method === "POST") {
      const config = await req.json() as { target: string; fault: FaultConfig };
      this.activeFaults.set(config.target, config.fault);
      return Response.json({ status: "active", target: config.target });
    }
 
    if (url.pathname === "/faults" && req.method === "DELETE") {
      const target = url.searchParams.get("target");
      if (target) {
        this.activeFaults.delete(target);
        return Response.json({ status: "removed", target });
      }
      this.activeFaults.clear();
      return Response.json({ status: "cleared" });
    }
 
    if (url.pathname === "/faults") {
      return Response.json(Object.fromEntries(this.activeFaults));
    }
 
    return new Response("Not found", { status: 404 });
  }
 
  getMiddleware() {
    return async (req: Request, next: () => Promise<Response>): Promise<Response> => {
      const target = new URL(req.url).pathname;
      const fault = this.activeFaults.get(target) ?? this.activeFaults.get("*");
 
      if (!fault) return next();
 
      // Inject latency
      if (fault.latencyMs) {
        await Bun.sleep(fault.latencyMs);
      }
 
      // Inject random failure
      if (fault.failureRate && Math.random() < fault.failureRate) {
        return new Response("Chaos: Injected failure", {
          status: fault.errorCode ?? 500,
        });
      }
 
      return next();
    };
  }
}
 
// Usage
const chaos = new ChaosAgent(9090);
const middleware = chaos.getMiddleware();
 
Bun.serve({
  port: 3000,
  async fetch(req) {
    return middleware(req, async () => {
      return Response.json({ message: "Hello, world!" });
    });
  },
});

Kubernetes Chaos Experiments with Labels

import { execSync } from "child_process";
 
class KubernetesChaos {
  async killPods(namespace: string, labelSelector: string, count: number): Promise<string[]> {
    const pods = JSON.parse(
      execSync(`kubectl get pods -n ${namespace} -l ${labelSelector} -o json`)
        .toString()
    ).items;
 
    const targets = pods
      .sort(() => Math.random() - 0.5)
      .slice(0, count);
 
    for (const pod of targets) {
      execSync(`kubectl delete pod -n ${namespace} ${pod.metadata.name}`);
      console.log(`Killed pod: ${pod.metadata.name}`);
    }
 
    return targets.map((p: any) => p.metadata.name);
  }
 
  async addNetworkLatency(
    namespace: string,
    labelSelector: string,
    latencyMs: number
  ): Promise<void> {
    const command = [
      "kubectl exec -n", namespace,
      "-l", labelSelector,
      "--",
      "tc qdisc add dev eth0 root netem delay", `${latencyMs}ms`,
    ].join(" ");
 
    execSync(command);
    console.log(`Added ${latencyMs}ms latency to pods matching ${labelSelector}`);
  }
 
  async drainNode(nodeName: string): Promise<void> {
    execSync(`kubectl drain ${nodeName} --ignore-daemonsets --delete-emptydir-data --force`);
    console.log(`Drained node: ${nodeName}`);
  }
 
  async cordonNode(nodeName: string): Promise<void> {
    execSync(`kubectl cordon ${nodeName}`);
    console.log(`Cordoned node: ${nodeName} (no new pods will be scheduled)`);
  }
 
  async fillDisk(namespace: string, pod: string, sizeMB: number): Promise<void> {
    execSync(
      `kubectl exec -n ${namespace} ${pod} -- dd if=/dev/zero of=/tmp/fill bs=1M count=${sizeMB}`
    );
    console.log(`Filled ${sizeMB}MB on ${pod}`);
  }
}

Network Partition Simulation

import { execSync } from "child_process";
 
class NetworkChaos {
  async partition(serviceA: string, serviceB: string): Promise<void> {
    // Block traffic between two services using iptables
    execSync(`iptables -A OUTPUT -d ${serviceB} -j DROP`);
    execSync(`iptables -A INPUT -s ${serviceB} -j DROP`);
    console.log(`Network partition: ${serviceA} <-> ${serviceB}`);
  }
 
  async heal(serviceA: string, serviceB: string): Promise<void> {
    execSync(`iptables -D OUTPUT -d ${serviceB} -j DROP`);
    execSync(`iptables -D INPUT -s ${serviceB} -j DROP`);
    console.log(`Network healed: ${serviceA} <-> ${serviceB}`);
  }
 
  async addLatency(target: string, latencyMs: number, jitterMs = 10): Promise<void> {
    execSync(`tc qdisc add dev eth0 root netem delay ${latencyMs}ms ${jitterMs}ms`);
    console.log(`Added ${latencyMs}ms ± ${jitterMs}ms latency to ${target}`);
  }
 
  async addPacketLoss(target: string, lossPercent: number): Promise<void> {
    execSync(`tc qdisc add dev eth0 root netem loss ${lossPercent}%`);
    console.log(`Added ${lossPercent}% packet loss to ${target}`);
  }
 
  async reset(): Promise<void> {
    execSync(`tc qdisc del dev eth0 root 2>/dev/null || true`);
    console.log("Network conditions reset");
  }
}

Steady-State Observer

interface MetricsClient {
  query(promql: string): Promise<number[]>;
}
 
class SteadyStateObserver {
  private metrics: MetricsClient;
  private prometheusUrl: string;
 
  constructor(prometheusUrl: string) {
    this.prometheusUrl = prometheusUrl;
    this.metrics = {
      query: async (promql: string) => {
        const response = await fetch(
          `${this.prometheusUrl}/api/v1/query?query=${encodeURIComponent(promql)}`
        );
        const data = await response.json();
        return data.data.result.map((r: any) => parseFloat(r.value[1]));
      },
    };
  }
 
  async captureBaseline(
    hypothesis: SteadyStateHypothesis,
    duration: string
  ): Promise<BaselineResult> {
    const results: Record<string, number[]> = {};
 
    for (const metric of hypothesis.metrics) {
      const values = await this.metrics.query(metric.query);
      results[metric.name] = values;
    }
 
    return {
      stable: this.evaluateHypothesis(hypothesis, results),
      metrics: results,
      timestamp: new Date(),
    };
  }
 
  async observe(
    hypothesis: SteadyStateHypothesis,
    duration: string
  ): Promise<ObservationResult> {
    const startTime = Date.now();
    const endTime = startTime + this.parseDuration(duration);
    const observations: Array<{ timestamp: Date; metrics: Record<string, number[]> }> = [];
 
    while (Date.now() < endTime) {
      const metrics: Record<string, number[]> = {};
      for (const metric of hypothesis.metrics) {
        metrics[metric.name] = await this.metrics.query(metric.query);
      }
      observations.push({ timestamp: new Date(), metrics });
      await Bun.sleep(10000); // Observe every 10 seconds
    }
 
    const allWithinThreshold = observations.every((obs) =>
      this.evaluateHypothesis(hypothesis, obs.metrics)
    );
 
    return {
      withinThreshold: allWithinThreshold,
      observations,
      duration: Date.now() - startTime,
    };
  }
 
  private evaluateHypothesis(
    hypothesis: SteadyStateHypothesis,
    metrics: Record<string, number[]>
  ): boolean {
    return hypothesis.metrics.every((metric) => {
      const values = metrics[metric.name] ?? [];
      const avg = values.reduce((a, b) => a + b, 0) / (values.length || 1);
 
      switch (metric.comparison) {
        case "lt": return avg < metric.threshold;
        case "gt": return avg > metric.threshold;
        case "eq": return Math.abs(avg - metric.threshold) < 0.001;
      }
    });
  }
 
  private parseDuration(duration: string): number {
    const match = duration.match(/^(\d+)(s|m|h)$/);
    if (!match) return 60000;
    const value = parseInt(match[1]);
    switch (match[2]) {
      case "s": return value * 1000;
      case "m": return value * 60 * 1000;
      case "h": return value * 60 * 60 * 1000;
    }
    return 60000;
  }
}

Real-World Use Cases

Database Failover Testing

Simulate primary database failure to verify that your application correctly fails over to the replica. Kill the primary database instance, observe how long it takes for the replica to be promoted, and verify that no data is lost during the transition. This reveals issues with connection pool configuration, failover detection, and data replication lag.

Microservice Dependency Failure

Inject failures into individual microservice dependencies to verify that circuit breakers, retry logic, and fallback mechanisms work correctly. For example, make the payment service return 503 errors and verify that the order service queues failed payments for retry instead of returning errors to users.

Region Failover Simulation

Simulate the loss of an entire cloud region to verify multi-region failover procedures. This is the most complex chaos experiment, involving DNS failover, database replication, and load balancer reconfiguration. It should only be attempted after single-service experiments have been validated.

Certificate Expiration

Inject expired TLS certificates into service mesh connections to verify that certificate rotation works correctly and that services handle certificate errors gracefully. This prevents the all-too-common production outage caused by expired certificates.

Best Practices for Production

Start small and expand gradually: Begin with read-only experiments in non-production environments. Graduate to production experiments only after building confidence with smaller blast radii.
Get organizational buy-in: Chaos engineering requires support from management and stakeholders. Explain the value in terms of reduced outage frequency and duration, not just technical curiosity.
Automate experiments: Manual chaos experiments are valuable for learning but should be automated for continuous validation. Schedule experiments to run regularly, ensuring that new code changes don't break resilience mechanisms.
Use feature flags: Wrap chaos experiments in feature flags so they can be quickly disabled if something goes wrong. This provides a safety net that encourages experimentation.
Document everything: Record every experiment — its hypothesis, setup, observations, and results. This creates a knowledge base that helps new team members understand the system's resilience characteristics.
Integrate with incident response: Use chaos experiments to test incident response procedures. Verify that alerts fire, runbooks are followed, and escalation paths work correctly.
Measure time to detect and resolve: Track how quickly your monitoring detects injected failures and how quickly your team responds. These metrics directly indicate your production resilience.
Celebrate findings, not failures: When a chaos experiment reveals a weakness, treat it as a success — a bug found in a controlled environment rather than during a production outage.

Common Pitfalls and Solutions

Pitfall	Impact	Solution
Running experiments without monitoring	Unable to determine impact	Always set up observability before injecting faults
Too large blast radius	Production outage	Start with single instances and expand gradually
No rollback plan	Extended outages	Implement automatic rollback with timeout
Testing in production without preparation	Unexpected cascading failures	Validate in staging first, then use canary production
Not involving the team	Missed learning opportunities	Run game days with full team participation
Ignoring experiment results	Repeated failures	Create action items for every weakness discovered

Debugging Failed Experiments

When a chaos experiment reveals that the system did not maintain steady state, the investigation should follow this process:

interface ExperimentFailureAnalysis {
  experiment: string;
  hypothesis: SteadyStateHypothesis;
  failurePoint: {
    metric: string;
    expected: string;
    actual: string;
  };
  rootCause: string;
  actionItems: Array<{
    description: string;
    owner: string;
    priority: "P0" | "P1" | "P2";
    deadline: Date;
  }>;
}

Performance Optimization

Efficient Metric Collection

During chaos experiments, metric collection must be efficient to avoid adding overhead to the already-stressed system:

class EfficientMetricCollector {
  private cache: Map<string, { value: number; timestamp: number }> = new Map();
  private cacheTtl: number;
 
  constructor(cacheTtlMs: number = 5000) {
    this.cacheTtl = cacheTtlMs;
  }
 
  async query(prometheusUrl: string, promql: string): Promise<number> {
    const cached = this.cache.get(promql);
    if (cached && Date.now() - cached.timestamp < this.cacheTtl) {
      return cached.value;
    }
 
    const response = await fetch(
      `${prometheusUrl}/api/v1/query?query=${encodeURIComponent(promql)}`
    );
    const data = await response.json();
    const value = parseFloat(data.data.result[0]?.value[1] ?? "0");
 
    this.cache.set(promql, { value, timestamp: Date.now() });
    return value;
  }
}

Comparison with Alternatives

Approach	Proactive	Safe	Automated	Cost
Chaos Engineering	Yes	Controlled	Yes	Medium
Load Testing	Partial	Yes	Yes	Low
Manual Failover Drills	Yes	High risk	No	High
Post-Incident Review	No	N/A	No	Variable
Static Analysis	Partial	Yes	Yes	Low
Game Days	Yes	Controlled	Partial	High

When to Use Each Approach

Chaos engineering is most valuable when combined with other approaches. Load testing verifies performance under high traffic. Chaos engineering verifies resilience under failure conditions. Static analysis catches common patterns at compile time. Post-incident reviews capture lessons from real failures. Together, these approaches provide comprehensive coverage.

Advanced Patterns

Automated Chaos in CI/CD

import { execSync } from "child_process";
 
class ChaosCI {
  async runChaosTests(): Promise<void> {
    // Deploy to staging
    execSync("bun run deploy:staging");
 
    // Wait for deployment to stabilize
    await Bun.sleep(30000);
 
    // Run chaos experiments
    const experiments = [
      { name: "API latency", fault: { latencyMs: 200 }, duration: "2m" },
      { name: "DB connection drop", fault: { failureRate: 0.1 }, duration: "1m" },
      { name: "Cache eviction", fault: { cacheFlush: true }, duration: "5m" },
    ];
 
    for (const experiment of experiments) {
      const result = await this.runExperiment(experiment);
      if (!result.passed) {
        console.error(`Chaos test failed: ${experiment.name}`);
        process.exit(1);
      }
    }
 
    console.log("All chaos tests passed");
  }
}

Chaos Monkey as a Service

class ChaosMonkeyService {
  private schedule: Map<string, ChaosSchedule> = new Map();
 
  registerService(service: string, config: ChaosSchedule): void {
    this.schedule.set(service, config);
  }
 
  async start(): Promise<void> {
    setInterval(async () => {
      for (const [service, config] of this.schedule) {
        if (this.shouldAttack(config)) {
          await this.attack(service, config);
        }
      }
    }, 60000); // Check every minute
  }
 
  private shouldAttack(config: ChaosSchedule): boolean {
    const now = new Date();
    const hour = now.getHours();
    return hour >= config.businessHours.start && hour < config.businessHours.end && Math.random() < config.probability;
  }
 
  private async attack(service: string, config: ChaosSchedule): Promise<void> {
    console.log(`Chaos Monkey attacking: ${service}`);
    // Randomly select a fault type
    const faults = config.enabledFaults;
    const fault = faults[Math.floor(Math.random() * faults.length)];
    await this.injectFault(service, fault);
  }
}

Testing Strategies

Chaos Experiment Unit Tests

import { test, expect } from "bun:test";
 
test("chaos agent applies latency fault", async () => {
  const agent = new ChaosAgent(0);
  const middleware = agent.getMiddleware();
 
  // Activate fault
  await agent.activateFault("*", { latencyMs: 100 });
 
  const start = Date.now();
  const response = await middleware(
    new Request("http://localhost/test"),
    async () => new Response("OK")
  );
  const elapsed = Date.now() - start;
 
  expect(response.status).toBe(200);
  expect(elapsed).toBeGreaterThanOrEqual(90); // Allow some variance
});

Future Outlook

Chaos engineering is evolving from a niche practice to a mainstream reliability engineering discipline. Cloud providers are offering managed chaos engineering services (AWS Fault Injection Simulator, Azure Chaos Studio), making it easier for organizations to get started. The integration of AI/ML into chaos platforms will enable automated experiment selection based on production traffic patterns and historical failure data.

The rise of serverless and edge computing introduces new failure modes that chaos engineering must address. Cold starts, function timeouts, and edge node failures require new fault injection techniques. As systems become more distributed, chaos engineering will become essential for verifying resilience across increasingly complex architectures.

Conclusion

Chaos engineering transforms resilience from an assumption into a verified property of your system. By proactively injecting failures and observing system behavior, you build confidence that your system can withstand the inevitable failures that occur in production.

Key takeaways:

Start with a hypothesis: Define what steady state looks like before injecting any fault. Without a hypothesis, you cannot determine if the experiment succeeded or failed.
Control the blast radius: Start small and expand gradually. The goal is to learn, not to cause outages.
Automate experiments: Manual experiments are valuable for learning, but automated experiments provide continuous validation.
Integrate with incident response: Use chaos experiments to test not just system resilience, but also team resilience.
Build a culture of experimentation: Encourage teams to view failures as learning opportunities. Celebrate bugs found through chaos engineering, because every bug found is a production outage prevented.

Start by running one chaos experiment this week. Pick a single service, inject a simple fault, and observe what happens. The insights you gain will change how you think about system resilience.

Minh Vo

Slaying code & making it lit fr fr 🔥 tagline