MinhVo

Minh Vo

rss feed

Slaying code & making it lit fr fr 🔥 tagline

Hey there 👋 I'm an AI Engineer with 7 years of experience building scalable web and mobile applications. Currently at Neurond AI (May 2025 — present), architecting an Enterprise AI Assistant Platform with multi-tenant RAG on pgvector, multi-provider LLM orchestration, and Azure-native infrastructure. Previously spent 5+ years at SNAPTEC (Sep 2019 — Apr 2025), leading SaaS themes, admin dashboards, and e-commerce platforms — earned the Hero of the Year award in 2021. I specialize in TypeScript, React, Next.js, and AI-Native engineering with Claude Code and Cursor.bio

Back to blogs

Edge AI: Running Machine Learning in the Browser

Run ML models in the browser with ONNX Runtime, TensorFlow.js, and WebGPU.

AIEdge AIMachine LearningFrontend

By MinhVo

Introduction

Machine learning has traditionally been the domain of powerful cloud servers equipped with GPUs and vast memory pools. Sending data to a remote server, waiting for inference, and receiving results back has been the standard pattern for ML-powered applications. But a paradigm shift is underway: running machine learning models directly in the browser, on the user's own device, with zero server round-trips. This is Edge AI, and it is transforming how we build intelligent web applications.

The browser has evolved from a simple document renderer into a sophisticated computing platform. With APIs like WebGPU providing direct GPU access, WebAssembly enabling near-native execution speeds, and JavaScript engines optimized for numerical computation, modern browsers can now run inference on models that previously required dedicated servers. This means lower latency, better privacy, offline capability, and reduced infrastructure costs.

In this comprehensive guide, we'll explore the three major frameworks for browser-based ML — TensorFlow.js, ONNX Runtime Web, and the emerging WebGPU-powered solutions. You'll learn how to load pre-trained models, optimize them for browser execution, handle real-time inference on video and audio streams, and build production-ready applications that leverage the full power of Edge AI.

Edge AI architecture showing neural network processing on browser devices

Understanding Edge AI: Core Concepts

Edge AI refers to running artificial intelligence algorithms locally on a device rather than in the cloud. When we specifically talk about browser-based Edge AI, we mean executing ML inference within the browser's JavaScript runtime, using the user's CPU, GPU, and memory resources. This stands in stark contrast to the traditional cloud AI pattern where data leaves the device, travels to a server, gets processed, and returns.

The fundamental advantage is proximity to data. When a user's webcam feed is processed in the browser, frames never leave the device. When a text model runs locally, the user's private documents stay private. This architectural decision has profound implications for privacy, latency, and cost structure.

Several technological breakthroughs have made browser-based ML practical. WebAssembly provides a compilation target that runs at near-native speed, allowing frameworks like ONNX Runtime to execute optimized inference pipelines. WebGPU, the successor to WebGL, gives developers direct access to GPU compute shaders, enabling massively parallel tensor operations. And JavaScript engine optimizations like V8's TurboFan compiler have made numerical computation in JavaScript surprisingly efficient.

The model formats that work in browsers include TensorFlow Lite models (converted to TensorFlow.js format), ONNX models (run via ONNX Runtime Web), and increasingly, custom WebGPU shaders for specialized operations. Each format has trade-offs in model size, inference speed, and operator support that we will explore in depth. The ecosystem has matured significantly — TensorFlow.js has been around since 2018 and offers the broadest model zoo, ONNX Runtime Web backed by Microsoft provides excellent cross-platform compatibility, and new projects like Transformers.js from Hugging Face are making state-of-the-art NLP models accessible directly in the browser with just a few lines of code.

Browser ML Fundamentals

At the core of browser-based ML lies the concept of tensors — multi-dimensional arrays that represent data flowing through neural networks. A color image of 224x224 pixels is represented as a 3D tensor of shape [224, 224, 3] where the last dimension holds RGB values. Text is tokenized and encoded as 1D tensors of integer indices into a vocabulary. Audio waveforms become 1D tensors of amplitude samples.

The inference process follows a universal pattern regardless of framework: load model weights into memory, convert input data into tensor format, execute the forward pass through the model's layers (convolution, attention, pooling, etc.), and interpret the output tensors. The challenge in the browser is doing this efficiently within the constraints of the JavaScript runtime — limited memory, single-threaded execution by default, and no direct access to system-level GPU drivers.

WebGPU architecture diagram showing browser GPU pipeline

Architecture and Design Patterns

Browser-based ML inference follows a distinct architecture pattern that differs from server-side inference. Understanding this architecture is critical for building performant applications.

The Inference Pipeline

Every browser ML application follows the same basic pipeline: model loading, preprocessing, inference, and postprocessing. Model loading involves downloading the model weights (which can range from 1MB for a MobileNet to 2GB for a large language model) and initializing the runtime. Preprocessing converts raw input — pixels, audio samples, text tokens — into the tensor format the model expects. Inference runs the forward pass through the model's layers. Postprocessing interprets the output tensors into human-readable results.

Each stage has different optimization opportunities. Model loading can be optimized through caching, streaming, and progressive loading. Preprocessing can leverage Web Workers or OffscreenCanvas to avoid blocking the main thread. Inference benefits from WebGL or WebGPU acceleration. Postprocessing is typically lightweight but can become a bottleneck with large output tensors.

Hardware Acceleration Backends

TensorFlow.js supports three execution backends: the CPU backend for maximum compatibility, the WebGL backend for GPU acceleration on all browsers, and the WebGPU backend for cutting-edge performance. The backend selection is typically automatic but can be configured manually:

// Set WebGL backend for broad GPU support
import * as tf from '@tensorflow/tfjs';
await tf.setBackend('webgl');
 
// Or use WebGPU for maximum performance (Chrome 113+)
await tf.setBackend('webgpu');
 
// Check which backend was actually selected
console.log(tf.getBackend());

ONNX Runtime Web similarly supports multiple backends: WASM for CPU execution, WASM with SIMD for optimized CPU paths, and WebNN for hardware-specific acceleration when available.

Memory Management

Tensor allocation and disposal are critical concerns in browser ML. Unlike server-side frameworks with sophisticated garbage collectors, browser JavaScript engines may not free GPU memory promptly. TensorFlow.js requires explicit tensor disposal to prevent memory leaks that can crash the browser tab:

// Memory leak: tensors accumulate
for (let i = 0; i < 1000; i++) {
  const tensor = tf.tensor([1, 2, 3]);
  // tensor is never disposed — GPU memory leaks
}
 
// Correct: use tf.tidy for automatic cleanup
const results = tf.tidy(() => {
  const tensors = [];
  for (let i = 0; i < 1000; i++) {
    tensors.push(tf.tensor([1, 2, 3]));
  }
  return tf.stack(tensors).mean();
});
// All intermediate tensors disposed automatically

Threading Model

The browser's main thread handles UI rendering and user interaction. Running ML inference on the main thread causes UI freezes that degrade user experience. The solution is to offload inference to Web Workers, which run JavaScript in isolated threads. Communication between the main thread and workers happens through postMessage, which can transfer ArrayBuffers without copying for zero-overhead data transfer.

// main.ts — send image data to worker
const worker = new Worker(new URL('./inference-worker.ts', import.meta.url));
 
// Transfer the buffer with zero-copy
const imageData = canvas.getContext('2d')!.getImageData(0, 0, 224, 224);
worker.postMessage(
  { type: 'inference', buffer: imageData.data.buffer },
  [imageData.data.buffer] // Transferable — no copy
);

Step-by-Step Implementation

Let's build a complete real-time image classification application using TensorFlow.js. This will demonstrate all the core concepts in practice.

Setting Up TensorFlow.js

First, install the necessary packages and initialize the runtime:

npm install @tensorflow/tfjs @tensorflow-models/mobilenet
import * as tf from '@tensorflow/tfjs';
import * as mobilenet from '@tensorflow-models/mobilenet';
 
async function initializeTensorFlow() {
  await tf.ready();
  console.log(`TensorFlow.js backend: ${tf.getBackend()}`);
 
  // Log memory info in development
  if (process.env.NODE_ENV === 'development') {
    const mem = tf.memory();
    console.log(`Tensors: ${mem.numTensors}, Bytes: ${mem.numBytes}`);
  }
}

Loading and Running a Pre-trained Model

MobileNet is a lightweight convolutional neural network designed for mobile and edge devices. It's an excellent starting point for browser-based classification:

async function loadModel() {
  // Load the model — this downloads ~16MB of weights
  // Subsequent loads use the browser cache
  const model = await mobilenet.load({
    version: 2,
    alpha: 1.0, // Full model (1.0) or smaller variants (0.25, 0.5, 0.75)
  });
 
  console.log('MobileNet loaded successfully');
  return model;
}
 
async function classifyImage(
  model: mobilenet.MobileNet,
  imageElement: HTMLImageElement
) {
  // TensorFlow.js handles preprocessing (resize, normalize) automatically
  const predictions = await model.classify(imageElement, 5);
 
  predictions.forEach((pred) => {
    console.log(`${pred.className}: ${(pred.probability * 100).toFixed(2)}%`);
  });
 
  return predictions;
}

Real-time Video Processing

For webcam-based applications, we need to process video frames continuously while keeping the UI responsive:

class RealTimeClassifier {
  private model: mobilenet.MobileNet | null = null;
  private video: HTMLVideoElement | null = null;
  private isProcessing = false;
 
  async initialize() {
    this.model = await mobilenet.load({ version: 2, alpha: 0.5 });
    await this.setupCamera();
  }
 
  private async setupCamera() {
    this.video = document.createElement('video');
    this.video.setAttribute('playsinline', 'true');
 
    const stream = await navigator.mediaDevices.getUserMedia({
      video: { width: 224, height: 224, facingMode: 'environment' },
    });
 
    this.video.srcObject = stream;
    await new Promise((resolve) => {
      this.video!.onloadedmetadata = resolve;
    });
    await this.video.play();
  }
 
  async classifyFrame(): Promise<mobilenet.PredictionType[]> {
    if (!this.model || !this.video || this.isProcessing) return [];
    this.isProcessing = true;
 
    try {
      return tf.tidy(() => {
        const tensor = tf.browser.fromPixels(this.video!);
        const predictions = this.model!.classify(tensor);
        return predictions;
      });
    } finally {
      this.isProcessing = false;
    }
  }
 
  stop() {
    const stream = this.video?.srcObject as MediaStream;
    stream?.getTracks().forEach((t) => t.stop());
  }
}

Using ONNX Runtime Web

For models trained in PyTorch or other frameworks, ONNX Runtime Web provides a universal solution:

import * as ort from 'onnxruntime-web';
 
async function runONNXInference(imageData: Float32Array) {
  ort.env.wasm.numThreads = navigator.hardwareConcurrency;
 
  const session = await ort.InferenceSession.create('./model.onnx', {
    executionProviders: ['webgl', 'wasm'],
    graphOptimizationLevel: 'all',
  });
 
  // Create input tensor in NCHW format
  const inputTensor = new ort.Tensor('float32', imageData, [1, 3, 224, 224]);
 
  // Run inference
  const results = await session.run({ input: inputTensor });
  const output = results[Object.keys(results)[0]].data as Float32Array;
 
  return output;
}

ML inference pipeline diagram showing data flow from input to output

WebGPU Compute with Transformers.js

Hugging Face's Transformers.js brings state-of-the-art NLP models to the browser:

import { pipeline } from '@xenova/transformers';
 
// Sentiment analysis running entirely in the browser
async function analyzeSentiment(text: string) {
  const classifier = await pipeline(
    'sentiment-analysis',
    'Xenova/distilbert-base-uncased-finetuned-sst-2-english'
  );
 
  const result = await classifier(text);
  return result;
  // [{ label: 'POSITIVE', score: 0.9998 }]
}
 
// Text generation with quantized models
async function generateText(prompt: string) {
  const generator = await pipeline(
    'text-generation',
    'Xenova/LaMini-Flan-T5-783M' // Quantized to ~400MB
  );
 
  const output = await generator(prompt, {
    max_new_tokens: 100,
    temperature: 0.7,
  });
 
  return output[0].generated_text;
}

Real-World Use Cases and Case Studies

Use Case 1: Real-Time Document Scanning

Edge AI enables browser-based document scanning without uploading sensitive documents. A model runs locally to detect document edges, apply perspective correction, and enhance text readability. The user's documents never leave their device, providing complete privacy. This pattern is used by online notary services and healthcare portals where document privacy is legally mandated under regulations like HIPAA.

Use Case 2: Accessibility Features

Browser-based speech recognition and text-to-speech powered by Edge AI provide accessibility features without requiring users to install additional software. Models like Whisper-tiny (40MB) can run in the browser for real-time transcription, enabling live captions for video content without sending audio to external servers. This is especially valuable for users on shared or restricted devices.

Use Case 3: Content Moderation at Upload Time

Social media platforms use Edge AI to detect inappropriate content before it reaches the server. A lightweight NSFW detection model runs in the browser, flagging potentially problematic images in real-time. This reduces server-side processing costs by filtering at the edge and provides instant feedback to users about content policy violations before upload completes.

Use Case 4: Augmented Reality Shopping

E-commerce sites use Edge AI to power virtual try-on experiences. Face detection models run locally to identify facial landmarks with sub-pixel accuracy, then overlay glasses, makeup, or hats in real-time. The entire pipeline runs in the browser at 30+ FPS using WebGL-accelerated inference, creating an engaging shopping experience that converts at 2-3x the rate of static product images.

Best Practices for Production

  1. Choose the right model size: For mobile browsers, keep models under 10MB. For desktop, 50MB is acceptable. Use model quantization to reduce size by 4x with minimal accuracy loss. MobileNet 0.25 runs 4x faster than MobileNet 1.0 with only a 5% accuracy drop on ImageNet.

  2. Implement progressive loading: Load model architecture first, then stream weights. Show loading progress to users with percentage indicators. Cache models in IndexedDB for instant subsequent loads — model loading drops from seconds to milliseconds on repeat visits.

  3. Use Web Workers for inference: Never run inference on the main thread. Transfer tensors to workers using SharedArrayBuffer when available, or use Transferable objects for zero-copy data transfer. This keeps the UI at 60fps even during heavy inference.

  4. Implement fallback strategies: Not all browsers support WebGPU or WebGL compute. Always have a WASM fallback. Detect capabilities at runtime and adjust model complexity accordingly — use a smaller model on low-end devices.

  5. Monitor memory usage: Use tf.memory() in TensorFlow.js to track tensor counts and byte usage. Implement explicit cleanup with tf.tidy() and tensor.dispose(). Set memory budgets per feature and reject inference if budget is exceeded.

  6. Batch inference requests: Instead of processing one input at a time, batch multiple inputs together. This dramatically improves GPU utilization. Collect requests over a 16ms frame and process them together for 3-5x throughput improvement.

  7. Leverage model caching: Store downloaded models in IndexedDB or Cache API. Check for model updates using versioned URLs. Implement differential updates to download only changed weights, reducing update sizes by 90%+.

  8. Profile on low-end devices: Test on devices like budget Android phones and older laptops, not just developer MacBooks. Use Chrome DevTools Performance panel to identify bottlenecks. A model that runs at 30fps on an M1 MacBook may run at 2fps on a budget phone.

Common Pitfalls and Solutions

PitfallImpactSolution
Loading large models synchronouslyBlocks main thread for 5-30 secondsUse async loading with progress indicators and cache in IndexedDB
Not disposing tensorsMemory leaks causing browser tab crashesWrap all inference in tf.tidy() and use explicit dispose()
Processing every video frame100% CPU usage, device overheatingThrottle to 15-30 FPS using requestAnimationFrame with frame skipping
Ignoring model quantizationModels 4x larger than necessaryUse INT8 or FLOAT16 quantization for 4x size reduction with <1% accuracy loss
Running inference on main threadUI freezes during inferenceOffload to Web Workers with SharedArrayBuffer for data transfer
Hardcoding model input sizesCrashes on different model versionsRead input dimensions from model metadata dynamically

Performance Optimization

WebGPU represents the biggest performance leap for browser ML. Early benchmarks show 3-10x speedup over WebGL for transformer-based models. Here's how to benchmark and leverage it:

async function benchmarkBackends() {
  const backends = ['cpu', 'webgl', 'webgpu'] as const;
  const results: Record<string, number> = {};
 
  for (const backend of backends) {
    try {
      await tf.setBackend(backend);
      await tf.ready();
 
      // Warm up — first inference is always slow due to shader compilation
      const warmup = tf.randomNormal([1, 224, 224, 3]);
      await warmup.data();
      warmup.dispose();
 
      // Benchmark 50 iterations
      const iterations = 50;
      const start = performance.now();
 
      for (let i = 0; i < iterations; i++) {
        tf.tidy(() => {
          const input = tf.randomNormal([1, 224, 224, 3]);
          return tf.conv2d(
            input.expandDims(0),
            tf.randomNormal([3, 3, 3, 32]),
            1,
            'same'
          );
        });
      }
 
      const elapsed = performance.now() - start;
      results[backend] = elapsed / iterations;
      console.log(`${backend}: ${(elapsed / iterations).toFixed(2)}ms/inference`);
    } catch {
      console.log(`${backend}: not available`);
    }
  }
 
  return results;
}

Comparison with Alternatives

FeatureTensorFlow.jsONNX Runtime WebTransformers.jsCloud API
Model EcosystemLarge (TF Hub)Universal (any framework)Hugging Face HubAny
Browser SupportAll (WebGL)All (WASM)All (WASM)All
GPU AccelerationWebGL, WebGPUWebGL, WebGPUWebGPU (emerging)Server GPUs
PrivacyComplete (local)Complete (local)Complete (local)Data leaves device
Model Size Limit~200MB practical~200MB practical~500MB practicalUnlimited
Latency10-100ms10-100ms50-500ms100-2000ms
Offline SupportYesYesYesNo
Setup ComplexityLowMediumLowLowest

Advanced Patterns

Model Pipelining with Web Workers

For complex applications requiring multiple models, pipeline inference across workers:

class MLPipeline {
  private workers: Worker[] = [];
 
  async initialize(models: string[]) {
    for (const modelPath of models) {
      const worker = new Worker(new URL('./ml-worker.ts', import.meta.url));
      worker.postMessage({ type: 'load', modelPath });
      this.workers.push(worker);
    }
  }
 
  async processFrame(imageData: ImageData) {
    const buffer = imageData.data.buffer;
    this.workers[0].postMessage({ type: 'detect', buffer }, [buffer]);
 
    return new Promise((resolve) => {
      this.workers[this.workers.length - 1].onmessage = (e) => {
        resolve(e.data.result);
      };
    });
  }
}

Quantized Model Conversion

import * as tf from '@tensorflow/tfjs';
 
async function quantizeModel(modelPath: string) {
  const model = await tf.loadLayersModel(modelPath);
 
  // Apply dynamic range quantization
  // Reduces model size by ~4x with minimal accuracy impact
  await model.save('indexeddb://quantized-model');
 
  // For custom quantization: iterate weight tensors
  model.layers.forEach((layer) => {
    const weights = layer.getWeights();
    const quantized = weights.map((w) => {
      const min = w.min();
      const max = w.max();
      const scale = max.sub(min).div(tf.scalar(255));
      const quantized = w.sub(min).div(scale).round();
      return quantized.cast('int8');
    });
    layer.setWeights(quantized);
  });
 
  return model;
}

Testing Strategies

Browser ML applications require specialized testing approaches. Model accuracy should be validated against known datasets before deployment. Inference timing should be benchmarked across target devices:

describe('MobileNet Classification', () => {
  let model: mobilenet.MobileNet;
 
  beforeAll(async () => {
    model = await mobilenet.load({ version: 2, alpha: 0.5 });
  });
 
  test('classifies known image correctly', async () => {
    const img = document.createElement('img');
    img.src = '/test-images/cat.jpg';
    await new Promise((resolve) => { img.onload = resolve; });
 
    const predictions = await model.classify(img, 3);
    expect(predictions[0].className).toContain('cat');
    expect(predictions[0].probability).toBeGreaterThan(0.5);
  });
 
  test('inference completes within time budget', async () => {
    const img = document.createElement('img');
    img.src = '/test-images/dog.jpg';
    await new Promise((resolve) => { img.onload = resolve; });
 
    const start = performance.now();
    await model.classify(img);
    const elapsed = performance.now() - start;
    expect(elapsed).toBeLessThan(200);
  });
 
  test('no memory leaks after repeated inference', async () => {
    const memBefore = tf.memory();
    for (let i = 0; i < 100; i++) {
      await model.classify(document.querySelector('img')!);
    }
    const memAfter = tf.memory();
    // Allow some variance but catch major leaks
    expect(memAfter.numTensors - memBefore.numTensors).toBeLessThan(10);
  });
});

Future Outlook

The browser ML ecosystem is advancing rapidly. WebGPU adoption will unlock performance levels previously impossible in browsers. WebNN, a new W3C specification, will provide a dedicated neural network API that can leverage NPUs (Neural Processing Units) found in modern chips like Apple's Neural Engine and Intel's Meteor Lake.

The rise of quantized and distilled models means increasingly powerful models will fit within browser memory constraints. Projects like llama.cpp compiled to WASM already run 7B parameter models in the browser. Within the next two years, running GPT-3.5 equivalent models locally in the browser will become practical.

Edge AI will become the default for privacy-sensitive applications. Regulatory frameworks like GDPR and CCPA already incentivize local processing, and browser vendors are investing heavily in the primitives needed for production-grade ML inference.

Conclusion

Edge AI in the browser represents a fundamental shift in how we build intelligent applications. By running ML inference locally, we gain privacy, reduce latency, eliminate server costs, and enable offline functionality. The ecosystem — TensorFlow.js, ONNX Runtime Web, and Transformers.js — has matured to the point where production deployment is practical.

Key takeaways:

  1. Choose TensorFlow.js for broadest ecosystem support, ONNX Runtime for framework-agnostic models, and Transformers.js for cutting-edge NLP
  2. Always implement proper memory management with tf.tidy() and explicit disposal
  3. Offload inference to Web Workers to keep the UI responsive
  4. Implement fallback strategies for browsers without GPU support
  5. Use model quantization to reduce download sizes by 4x with minimal accuracy loss
  6. Cache models in IndexedDB for instant subsequent loads

Start small with a simple classification model, measure performance on real devices, and progressively add complexity. The browser is now a legitimate ML runtime — treat it as such, and your users will experience intelligent applications that are faster, more private, and more reliable than their cloud-dependent alternatives.