MinhVo

Minh Vo

rss feed

Slaying code & making it lit fr fr 🔥 tagline

Hey there 👋 I'm an AI Engineer with 7 years of experience building scalable web and mobile applications. Currently at Neurond AI (May 2025 — present), architecting an Enterprise AI Assistant Platform with multi-tenant RAG on pgvector, multi-provider LLM orchestration, and Azure-native infrastructure. Previously spent 5+ years at SNAPTEC (Sep 2019 — Apr 2025), leading SaaS themes, admin dashboards, and e-commerce platforms — earned the Hero of the Year award in 2021. I specialize in TypeScript, React, Next.js, and AI-Native engineering with Claude Code and Cursor.bio

Back to blogs

WebGPU Compute Shaders: GPGPU in the Browser

Run compute shaders in the browser: matrix operations, physics simulation, and ML inference.

WebGPUComputeGPUFrontend

By MinhVo

Introduction

General-Purpose GPU (GPGPU) computing has traditionally been the domain of native applications using CUDA or OpenCL. WebGPU changes this by bringing compute shader capabilities directly to the browser, enabling developers to harness the massive parallelism of modern GPUs for tasks like matrix operations, physics simulations, image processing, and machine learning inference—all without plugins or native code.

The compute shader model in WebGPU differs fundamentally from WebGL's fragment shader workaround for GPGPU. While developers previously shoehorned general-purpose computations into WebGL's rendering pipeline by encoding data as textures and reading results through render passes, WebGPU provides first-class compute support with dedicated compute pipelines, storage buffers, workgroup memory, and atomic operations. This makes GPU programming in the browser more natural, more performant, and more accessible to developers who aren't graphics specialists.

In this guide, we will explore WebGPU compute shaders from foundational concepts through production-ready implementations. You will learn how to set up compute pipelines, write WGSL shader code, implement real-world algorithms like matrix multiplication and particle simulation, and optimize for performance across different GPU architectures. Whether you are building data visualization tools, scientific simulations, or client-side ML inference, this guide provides the knowledge and patterns you need.

GPU Computing Architecture

Understanding WebGPU Compute Shaders: Core Concepts

The Compute Pipeline Model

A WebGPU compute pipeline consists of three stages: the shader module (written in WGSL), the pipeline layout (defining bind group structures), and the dispatch call that launches the computation. Unlike render pipelines, compute pipelines have no vertex or fragment stages—they operate purely on data buffers and produce output through storage buffers.

The execution model follows the SIMT (Single Instruction, Multiple Threads) paradigm that GPUs use natively. When you dispatch a compute shader, the GPU launches thousands of lightweight threads organized into workgroups. Each thread executes the same shader code but operates on different data, identified by a unique global_invocation_id. This massive parallelism is what makes GPU computing orders of magnitude faster than CPU for parallel workloads.

Workgroups and Invocation

Threads are organized hierarchically. A workgroup contains a fixed number of threads (specified in the shader with @workgroup_size), and a dispatch launches a grid of workgroups. The total number of threads is workgroup_size_x * workgroup_size_y * workgroup_size_z * dispatch_x * dispatch_y * dispatch_z. Each thread accesses its position through built-in variables: local_invocation_id (position within the workgroup), workgroup_id (position of the workgroup in the grid), and global_invocation_id (absolute position).

Workgroup size matters for performance. GPUs schedule threads in warps (NVIDIA, 32 threads) or wavefronts (AMD, 64 threads). Choosing workgroup sizes that are multiples of 32 or 64 ensures full utilization of the GPU's execution units. Common choices are 64, 128, or 256 threads per workgroup, depending on register pressure and shared memory usage.

Storage Buffers and Bind Groups

Compute shaders read and write data through storage buffers. These are GPU buffers created with the storage usage flag and bound to the shader through bind groups. A bind group is a set of resources (buffers, textures, samplers) that are collectively bound to a specific slot in the pipeline layout. Each binding in the group has a binding number that corresponds to a @binding attribute in the WGSL shader.

The data flow for a typical compute operation is: create a GPU buffer with input data, copy it to a storage buffer, dispatch the compute shader, then read the results back from the output buffer. WebGPU's buffer mapping API handles the CPU-GPU synchronization, and staging buffers optimize transfers by avoiding direct CPU access to GPU-local memory.

WGSL: The WebGPU Shading Language

WGSL (WebGPU Shading Language) is the shader language for WebGPU. It is a statically typed, C-like language designed for safety and portability. WGSL supports structured programming with functions, control flow, and a rich set of built-in types including vectors (vec2f, vec3f, vec4f), matrices (mat4x4f), and arrays. For compute shaders, the most important types are scalar numerics (f32, i32, u32) and arrays.

WGSL includes synchronization primitives essential for compute workloads: storageBarrier() ensures all writes to storage buffers are visible to subsequent reads within a workgroup, and workgroupBarrier() synchronizes threads within a workgroup. These barriers are critical for correctness in algorithms where threads must cooperate, such as parallel reductions or prefix sums.

Compute Pipeline Architecture

Architecture and Design Patterns

The Buffer Lifecycle Pattern

WebGPU buffers have a well-defined lifecycle: creation, mapping (CPU writes data), unmapping (buffer becomes available to GPU), use in GPU operations, and optional readback (GPU results copied to CPU). Understanding this lifecycle prevents synchronization bugs and performance pitfalls.

// Buffer lifecycle pattern
async function createAndPopulateBuffer(
  device: GPUDevice,
  data: Float32Array
): Promise<GPUBuffer> {
  // Create a staging buffer (mappable) for CPU upload
  const stagingBuffer = device.createBuffer({
    size: data.byteLength,
    usage: GPUBufferUsage.MAP_WRITE | GPUBufferUsage.COPY_SRC,
    mappedAtCreation: true,
  });
 
  // Write data while the buffer is mapped
  new Float32Array(stagingBuffer.getMappedRange()).set(data);
  stagingBuffer.unmap();
 
  // Create a storage buffer (GPU-local) for compute shader access
  const storageBuffer = device.createBuffer({
    size: data.byteLength,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST,
  });
 
  // Copy from staging to storage
  const encoder = device.createCommandEncoder();
  encoder.copyBufferToBuffer(stagingBuffer, 0, storageBuffer, 0, data.byteLength);
  device.queue.submit([encoder.finish()]);
 
  // Staging buffer can be destroyed after copy
  stagingBuffer.destroy();
  return storageBuffer;
}

The Compute-Readback Pattern

The most common pattern for GPU computing is: upload data, dispatch compute, read results. This requires careful buffer management because GPU operations are asynchronous.

async function gpuCompute(
  device: GPUDevice,
  pipeline: GPUComputePipeline,
  bindGroup: GPUBindGroup,
  outputBuffer: GPUBuffer,
  workgroupCount: [number, number, number]
): Promise<Float32Array> {
  // Encode compute commands
  const encoder = device.createCommandEncoder();
  const pass = encoder.beginComputePass();
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bindGroup);
  pass.dispatchWorkgroups(...workgroupCount);
  pass.end();
 
  // Create a readback buffer
  const readbackBuffer = device.createBuffer({
    size: outputBuffer.size,
    usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
  });
 
  // Copy results to readback buffer
  encoder.copyBufferToBuffer(outputBuffer, 0, readbackBuffer, 0, outputBuffer.size);
  device.queue.submit([encoder.finish()]);
 
  // Wait for GPU to finish and map the readback buffer
  await readbackBuffer.mapAsync(GPUMapMode.READ);
  const result = new Float32Array(readbackBuffer.getMappedRange().slice(0));
  readbackBuffer.unmap();
  readbackBuffer.destroy();
 
  return result;
}

Pipeline Caching and Reuse

Creating compute pipelines is expensive. Cache pipelines by their configuration and reuse them across dispatches. The pipeline layout (bind group layouts) can also be shared between pipelines that use the same resource bindings.

class PipelineCache {
  private cache = new Map<string, GPUComputePipeline>();
 
  constructor(private device: GPUDevice) {}
 
  getPipeline(
    shaderCode: string,
    entryPoint: string,
    layout: GPUPipelineLayout | 'auto'
  ): GPUComputePipeline {
    const key = `${shaderCode}:${entryPoint}`;
    if (this.cache.has(key)) return this.cache.get(key)!;
 
    const module = this.device.createShaderModule({ code: shaderCode });
    const pipeline = this.device.createComputePipeline({
      layout,
      compute: { module, entryPoint },
    });
 
    this.cache.set(key, pipeline);
    return pipeline;
  }
}

Step-by-Step Implementation

Setting Up WebGPU

// webgpu-init.ts
async function initWebGPU(): Promise<{
  adapter: GPUAdapter;
  device: GPUDevice;
  context: GPUCanvasContext;
}> {
  if (!navigator.gpu) {
    throw new Error('WebGPU not supported in this browser');
  }
 
  const adapter = await navigator.gpu.requestAdapter({
    powerPreference: 'high-performance',
  });
  if (!adapter) throw new Error('No GPU adapter found');
 
  const device = await adapter.requestDevice({
    requiredLimits: {
      maxStorageBufferBindingSize: adapter.limits.maxStorageBufferBindingSize,
      maxBufferSize: adapter.limits.maxBufferSize,
    },
  });
 
  device.lost.then((info) => {
    console.error('GPU device lost:', info.message);
  });
 
  return { adapter, device, context: null as any };
}

Matrix Multiplication with Compute Shaders

Matrix multiplication is the canonical GPU compute workload. Here is a complete implementation using tiled shared memory for optimal performance:

// matrix-multiply.wgsl
@group(0) @binding(0) var<storage, read> A: array<f32>;
@group(0) @binding(1) var<storage, read> B: array<f32>;
@group(0) @binding(2) var<storage, read_write> C: array<f32>;
 
struct Params {
  M: u32,
  N: u32,
  K: u32,
};
@group(0) @binding(3) var<uniform> params: Params;
 
const TILE_SIZE: u32 = 16u;
 
var<workgroup> tileA: array<array<f32, 16>, 16>;
var<workgroup> tileB: array<array<f32, 16>, 16>;
 
@compute @workgroup_size(16, 16, 1)
fn main(
  @builtin(global_invocation_id) globalId: vec3u,
  @builtin(local_invocation_id) localId: vec3u,
  @builtin(workgroup_id) workgroupId: vec3u
) {
  let row = globalId.y;
  let col = globalId.x;
 
  var sum: f32 = 0.0;
  let numTiles = (params.K + TILE_SIZE - 1u) / TILE_SIZE;
 
  for (var t: u32 = 0u; t < numTiles; t++) {
    // Load tile from A into shared memory
    let aCol = t * TILE_SIZE + localId.x;
    if (row < params.M && aCol < params.K) {
      tileA[localId.y][localId.x] = A[row * params.K + aCol];
    } else {
      tileA[localId.y][localId.x] = 0.0;
    }
 
    // Load tile from B into shared memory
    let bRow = t * TILE_SIZE + localId.y;
    if (bRow < params.K && col < params.N) {
      tileB[localId.y][localId.x] = B[bRow * params.N + col];
    } else {
      tileB[localId.y][localId.x] = 0.0;
    }
 
    workgroupBarrier();
 
    // Compute partial sum for this tile
    for (var k: u32 = 0u; k < TILE_SIZE; k++) {
      sum += tileA[localId.y][k] * tileB[k][localId.x];
    }
 
    workgroupBarrier();
  }
 
  if (row < params.M && col < params.N) {
    C[row * params.N + col] = sum;
  }
}
// matrix-compute.ts
async function gpuMatrixMultiply(
  device: GPUDevice,
  a: Float32Array, b: Float32Array,
  M: number, N: number, K: number
): Promise<Float32Array> {
  const shaderCode = await fetch('/shaders/matrix-multiply.wgsl').then(r => r.text());
  const module = device.createShaderModule({ code: shaderCode });
 
  const pipeline = device.createComputePipeline({
    layout: 'auto',
    compute: { module, entryPoint: 'main' },
  });
 
  // Create buffers
  const bufferSize = (arr: Float32Array) => Math.max(arr.byteLength, 16);
  const createStorageBuffer = (data: Float32Array) => {
    const buf = device.createBuffer({
      size: bufferSize(data),
      usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST,
    });
    device.queue.writeBuffer(buf, 0, data);
    return buf;
  };
 
  const bufA = createStorageBuffer(a);
  const bufB = createStorageBuffer(b);
  const bufC = device.createBuffer({
    size: M * N * 4,
    usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC,
  });
 
  const params = new Uint32Array([M, N, K]);
  const paramBuffer = device.createBuffer({
    size: 16,
    usage: GPUBufferUsage.UNIFORM | GPUBufferUsage.COPY_DST,
  });
  device.queue.writeBuffer(paramBuffer, 0, params);
 
  const bindGroup = device.createBindGroup({
    layout: pipeline.getBindGroupLayout(0),
    entries: [
      { binding: 0, resource: { buffer: bufA } },
      { binding: 1, resource: { buffer: bufB } },
      { binding: 2, resource: { buffer: bufC } },
      { binding: 3, resource: { buffer: paramBuffer } },
    ],
  });
 
  const TILE = 16;
  const encoder = device.createCommandEncoder();
  const pass = encoder.beginComputePass();
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bindGroup);
  pass.dispatchWorkgroups(
    Math.ceil(N / TILE),
    Math.ceil(M / TILE),
    1
  );
  pass.end();
 
  // Readback
  const readback = device.createBuffer({
    size: M * N * 4,
    usage: GPUBufferUsage.MAP_READ | GPUBufferUsage.COPY_DST,
  });
  encoder.copyBufferToBuffer(bufC, 0, readback, 0, M * N * 4);
  device.queue.submit([encoder.finish()]);
 
  await readback.mapAsync(GPUMapMode.READ);
  const result = new Float32Array(readback.getMappedRange().slice(0));
  readback.unmap();
 
  bufA.destroy(); bufB.destroy(); bufC.destroy();
  paramBuffer.destroy(); readback.destroy();
 
  return result;
}

GPU Compute Workflow

Real-World Use Cases

Use Case 1: N-Body Particle Simulation

Physics simulations with thousands of interacting particles are embarrassingly parallel and ideally suited for GPU compute. Each particle computes gravitational forces from every other particle, yielding O(n²) computations that the GPU handles efficiently.

// n-body.wgsl
struct Particle { pos: vec4f, vel: vec4f, mass: f32, _pad: array<f32, 3> };
@group(0) @binding(0) var<storage, read_write> particles: array<Particle>;
struct SimParams { dt: f32, epsilon: f32, numParticles: u32, _pad: u32 };
@group(0) @binding(1) var<uniform> params: SimParams;
 
@compute @workgroup_size(256)
fn main(@builtin(global_invocation_id) id: vec3u) {
  let i = id.x;
  if (i >= params.numParticles) { return; }
 
  var acc = vec3f(0.0);
  let myPos = particles[i].pos.xyz;
 
  for (var j: u32 = 0u; j < params.numParticles; j++) {
    let diff = particles[j].pos.xyz - myPos;
    let distSq = dot(diff, diff) + params.epsilon * params.epsilon;
    let invDist = inverseSqrt(distSq);
    acc += particles[j].mass * diff * invDist * invDist * invDist;
  }
 
  particles[i].vel.xyz += acc * params.dt;
  particles[i].pos.xyz += particles[i].vel.xyz * params.dt;
}

Use Case 2: Prefix Sum (Scan) for Stream Compaction

Prefix sum is a fundamental parallel primitive used in sorting, stream compaction, and histogram computation. The GPU implementation uses the work-efficient Blelloch scan algorithm with workgroup-shared memory.

Use Case 3: Image Convolution

Applying convolution kernels (blur, edge detection, sharpening) to images is a natural fit for GPU compute. Each thread computes one output pixel by sampling the kernel neighborhood from the input image, achieving real-time processing of high-resolution images.

Use Case 4: Client-Side Data Aggregation

When visualizing large datasets in the browser, GPU compute can perform aggregations (sums, histograms, binning) on millions of data points in milliseconds, enabling interactive filtering and drill-down without server round-trips.

Best Practices for Production

  1. Choose workgroup sizes aligned with warp/wavefront size — Use 64, 128, or 256 threads per workgroup. Sizes that are multiples of 32 (NVIDIA warp size) or 64 (AMD wavefront size) ensure full hardware utilization. Non-aligned sizes waste execution lanes.

  2. Use workgroup (shared) memory for data reuse — When multiple threads in a workgroup read the same data, load it into var<workgroup> memory once. This is 10–100× faster than each thread reading from storage buffers independently. The tiled matrix multiply pattern demonstrates this optimization.

  3. Minimize buffer creation per frame — Creating GPU buffers is expensive. Pre-allocate buffers at initialization and reuse them across frames. Use buffer sub-allocation to manage multiple data structures within a single large buffer.

  4. Batch small dispatches into indirect dispatches — Instead of multiple small compute dispatches, combine them into fewer, larger dispatches. Use indirect dispatch buffers when the workload size is determined by GPU output.

  5. Avoid readback when possible — Reading results from GPU to CPU is the slowest part of the pipeline. Chain compute shader output directly to another compute pass or to a render pass (e.g., rendering computed particle positions). Only read back when the CPU needs the data.

  6. Profile with browser GPU profilers — Chrome DevTools (with --enable-unsafe-webgpu flag) provides GPU timing for each pass. Use this to identify bottlenecks. If a compute pass takes >4ms, consider splitting it or reducing workgroup register pressure.

  7. Handle GPU device loss gracefully — GPU devices can be lost due to driver updates, resource exhaustion, or browser tab throttling. Listen for the device.lost promise and implement recovery logic (re-request adapter, recreate resources).

  8. Use writeBuffer for small data, staging buffers for large uploads — device.queue.writeBuffer() handles small data transfers (< 4MB) efficiently through an internal staging path. For larger uploads, explicitly create staging buffers to control the transfer.

Common Pitfalls and Solutions

PitfallImpactSolution
Missing workgroupBarrier() between shared memory writes and readsRace conditions produce garbage dataAlways place barriers between phases: load → barrier → compute → barrier → store
Dispatching too many workgroupsGPU hangs or browser kills the tabRespect maxComputeWorkgroupsPerDimension limit; split large dispatches
Incorrect buffer alignmentValidation errors or silent data corruptionEnsure buffer sizes are multiples of 4 bytes; struct fields follow WGSL alignment rules
Reading stale buffer dataUsing data from a previous frameUse double-buffering or explicit fences; await onSubmittedWorkDone() before reuse
Too many bindings per bind groupHits maxBindingsPerBindGroup limitConsolidate resources; use storage buffer arrays instead of separate bindings
Exceeding workgroup shared memoryPipeline creation fails silently or OOMKeep shared memory under maxComputeWorkgroupStorageSize (typically 16KB–64KB)

Performance Optimization

The key to GPU compute performance is maximizing memory bandwidth utilization and minimizing thread divergence. GPUs achieve peak performance when all threads in a warp execute the same instruction on contiguous memory addresses.

// Optimized parallel reduction using shared memory
var<workgroup> sharedData: array<f32, 256>;
 
@compute @workgroup_size(256)
fn reduceSum(
  @builtin(local_invocation_id) localId: vec3u,
  @builtin(workgroup_id) groupId: vec3u
) {
  // Each thread loads data and stores in shared memory
  let globalIdx = groupId.x * 256u + localId.x;
  sharedData[localId.x] = inputData[globalIdx];
  workgroupBarrier();
 
  // Tree reduction in shared memory
  for (var stride = 128u; stride > 0u; stride >>= 1u) {
    if (localId.x < stride) {
      sharedData[localId.x] += sharedData[localId.x + stride];
    }
    workgroupBarrier();
  }
 
  // Thread 0 writes the workgroup result
  if (localId.x == 0u) {
    outputData[groupId.x] = sharedData[0];
  }
}

Comparison with Alternatives

FeatureWebGPU ComputeWebGL (fragment hacks)WebAssembly (CPU)CUDA/OpenCL
ParallelismThousands of GPU threadsLimited to fragment shaderMulti-core CPUFull GPU
Ease of UseNative compute APIAwkward render-to-textureFamiliar languagesSpecialized APIs
PerformanceNear-native GPULimited by pipeline overheadNear-native CPUNative GPU
Browser SupportChrome 113+, Edge, Firefox (flag)UniversalUniversalN/A (native only)
Data SharingStorage buffers, shared memoryTexturesShared memoryFull GPU memory
Use CasesGeneral compute, ML, physicsBasic GPGPU via renderingGeneral computationHPC, ML training

Advanced Patterns

Indirect Dispatch with GPU-Generated Workgroup Counts

// Dispatch a variable number of workgroups determined by GPU output
function indirectDispatch(
  device: GPUDevice,
  pipeline: GPUComputePipeline,
  indirectBuffer: GPUBuffer, // Contains [x, y, z] workgroup counts
  bindGroup: GPUBindGroup
) {
  const encoder = device.createCommandEncoder();
  const pass = encoder.beginComputePass();
  pass.setPipeline(pipeline);
  pass.setBindGroup(0, bindGroup);
  pass.dispatchWorkgroupsIndirect(indirectBuffer, 0);
  pass.end();
  device.queue.submit([encoder.finish()]);
}

Double-Buffering for Read-Modify-Write Patterns

// Swap input/output buffers each frame to avoid hazards
class DoubleBuffer {
  private buffers: [GPUBuffer, GPUBuffer];
  private current = 0;
 
  constructor(device: GPUDevice, size: number) {
    this.buffers = [
      device.createBuffer({ size, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST }),
      device.createBuffer({ size, usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST }),
    ];
  }
 
  get input() { return this.buffers[this.current]; }
  get output() { return this.buffers[1 - this.current]; }
  swap() { this.current = 1 - this.current; }
}

Testing Strategies

import { describe, it, expect, beforeAll } from 'vitest';
 
describe('GPU Matrix Multiply', () => {
  let device: GPUDevice;
 
  beforeAll(async () => {
    const adapter = await navigator.gpu.requestAdapter();
    device = await adapter!.requestDevice();
  });
 
  it('produces correct result for 2x2 matrices', async () => {
    const a = new Float32Array([1, 2, 3, 4]);
    const b = new Float32Array([5, 6, 7, 8]);
    const result = await gpuMatrixMultiply(device, a, b, 2, 2, 2);
 
    // Expected: [19, 22, 43, 50]
    expect(result[0]).toBeCloseTo(19);
    expect(result[1]).toBeCloseTo(22);
    expect(result[2]).toBeCloseTo(43);
    expect(result[3]).toBeCloseTo(50);
  });
 
  it('handles large matrices efficiently', async () => {
    const N = 512;
    const a = new Float32Array(N * N).map(() => Math.random());
    const b = new Float32Array(N * N).map(() => Math.random());
 
    const start = performance.now();
    const result = await gpuMatrixMultiply(device, a, b, N, N, N);
    const elapsed = performance.now() - start;
 
    expect(result.length).toBe(N * N);
    expect(elapsed).toBeLessThan(100); // Should complete in < 100ms
  });
});

Future Outlook

WebGPU compute shaders are evolving rapidly. The subgroup operations proposal will enable efficient cross-thread communication within a subgroup (warp/wavefront) without workgroup barriers, significantly improving reduction and scan performance. The atomic float operations proposal will unlock more algorithms that require floating-point synchronization. Storage buffer arrays and buffer device addresses will enable more flexible data structures like linked lists and trees on the GPU.

Conclusion

WebGPU compute shaders bring genuine GPGPU capabilities to the browser, enabling parallel workloads that were previously impossible without native code. The combination of WGSL's safety guarantees, the explicit memory model, and the massive parallelism of modern GPUs makes this a powerful tool for web developers.

Key takeaways:

  1. WebGPU compute shaders provide first-class GPGPU support, unlike WebGL's rendering-pipeline workarounds
  2. WGSL is the shader language—statically typed, safe, and designed for portability across GPU vendors
  3. Workgroup shared memory is critical for performance—use it for data reuse and inter-thread communication
  4. The buffer lifecycle (create → upload → compute → readback) requires careful synchronization
  5. Choose workgroup sizes aligned with warp/wavefront size for full hardware utilization
  6. Minimize CPU-GPU readback by chaining compute outputs to other passes
  7. Profile with browser DevTools to identify memory bandwidth vs. compute bottlenecks

Start by implementing a simple parallel operation like vector addition, then progress to tiled matrix multiplication and particle simulation. The tooling is maturing rapidly, and the performance gains over CPU-only JavaScript are dramatic for parallel workloads.