Building AI Chatbots with Streaming Responses

Introduction

The era of waiting for a complete AI response before displaying anything to users is over. Modern chatbot interfaces deliver a fluid, token-by-token experience that mirrors natural conversation pacing, keeps users engaged, and dramatically reduces perceived latency. Streaming responses transform the user experience from a frustrating "please wait" spinner into a dynamic, real-time dialogue where each word materializes as the language model generates it.

Building a streaming chatbot involves orchestrating several moving parts: a backend that connects to an LLM provider and forwards tokens as they arrive, a transport layer that maintains an open connection between server and client, and a frontend that incrementally renders the incoming text. The most common transport mechanisms are Server-Sent Events (SSE) for unidirectional server-to-client streaming and WebSockets for bidirectional communication. For chat applications, SSE is often sufficient and simpler to implement.

In this comprehensive guide, we will build a production-ready streaming chatbot from scratch. We will cover the architecture decisions, implement the backend with Node.js and the OpenAI API, wire up the frontend with React, handle edge cases like connection drops and error recovery, and discuss advanced patterns such as function calling and tool use within a streaming context.

AI Chatbot streaming interface showing real-time token generation

Understanding Streaming Responses: Core Concepts

Why Streaming Matters

When a user sends a message to a chatbot powered by a large language model, the model generates its response token by token. Without streaming, the server must wait for the entire response to be generated before sending it to the client. For a 500-token response, this can mean 5-15 seconds of staring at a blank screen, which feels interminable in a conversational context.

With streaming, the server begins sending tokens to the client as soon as the first one is generated. The user sees the first word appear within 100-300 milliseconds, and subsequent words arrive at a natural reading pace. This transforms the perceived performance from "slow" to "responsive," even though the total time to generate the full response is identical.

Research from Google and Amazon consistently shows that perceived latency is as important as actual latency. Users who see progressive output report higher satisfaction scores than users who wait for a complete response, even when the total wait time is the same. Streaming taps into the same psychological principle that makes skeleton screens feel faster than spinners.

The Token Stream Architecture

Language models like GPT-4, Claude, and Llama produce text one token at a time. Each token is a small piece of text — typically a word fragment or punctuation mark. The model's API exposes this generation process as a stream of events, where each event contains one or more tokens along with metadata like the model name, usage statistics, and finish reason.

The streaming pipeline flows from the LLM API through your backend to the frontend. First, the client sends a message via HTTP POST or WebSocket. The backend forwards the request to the LLM API with the stream: true parameter. The LLM API returns a stream of Server-Sent Events. The backend parses each event and forwards tokens to the client. Finally, the client appends each token to the displayed message in real time.

This architecture introduces unique challenges around error handling (what happens if the connection drops mid-stream?), state management (how do you track partial responses?), and user experience (should the user be able to send another message while a response is streaming?).

Server-Sent Events vs WebSockets

Server-Sent Events (SSE) provide a simple, HTTP-based mechanism for unidirectional streaming from server to client. The client opens a persistent HTTP connection, and the server sends events over that connection as text chunks delimited by data: lines. SSE is built into the browser via the EventSource API and works seamlessly with HTTP/2 multiplexing. Automatic reconnection is built into the browser specification, so if the connection drops, the client automatically reconnects.

WebSockets provide full-duplex bidirectional communication. They are more complex to set up but allow the client and server to send messages in both directions simultaneously. For chat applications, SSE is often the better choice because the primary data flow is unidirectional (server to client), and SSE automatically handles reconnection. WebSockets shine when you need real-time bidirectional communication, such as collaborative editing or live cursor tracking.

Architecture diagram showing streaming data flow

Architecture and Design Patterns

The Proxy Pattern

A common architecture for streaming chatbots is the proxy pattern. The frontend never communicates directly with the LLM API. Instead, it sends requests to your backend server, which proxies the request to the LLM and forwards the streaming response. This pattern provides several critical benefits: API key security (your LLM API keys never leave the server), rate limiting (you can enforce per-user rate limits on your backend), logging and analytics (every request passes through your server for monitoring), prompt injection protection (you can sanitize inputs before they reach the LLM), and conversation management (the server maintains conversation history and context).

Message Protocol Design

Define a clear protocol for streaming messages between your backend and frontend. A well-designed protocol includes typed events that the client can switch on, allowing for extensibility. A token event carries the incremental text content. A done event signals completion and may include usage statistics. An error event communicates failures with structured error codes. A function_call event indicates the model wants to invoke a tool. This typed event approach makes the client code clean and extensible.

interface StreamEvent {
  type: 'token' | 'done' | 'error' | 'function_call' | 'tool_result';
  content?: string;
  functionCall?: {
    name: string;
    arguments: string;
  };
  usage?: {
    promptTokens: number;
    completionTokens: number;
    totalTokens: number;
  };
  error?: {
    code: string;
    message: string;
  };
}

State Management for Chat

Streaming introduces unique state management challenges. While a response is being streamed, the UI must be in a "receiving" state that prevents the user from sending another message (or allows queuing). The conversation history must be updated incrementally as tokens arrive, and the final complete message must be stored once streaming finishes.

A robust state machine for a single message transitions through: idle (no message being generated), sending (user message sent, waiting for first token), streaming (tokens arriving and being rendered), complete (full response received and stored), and error (an error occurred during generation). Each transition should be explicit and logged for debugging purposes.

Step-by-Step Implementation

Backend: Express Server with SSE Streaming

Set up the backend server that proxies requests to the OpenAI API and streams responses to the client. Install the required dependencies and create the main server file:

npm init -y
npm install express openai cors
npm install -D typescript @types/express @types/node

Create the streaming server with proper SSE headers, error handling, and cleanup:

import express from 'express';
import cors from 'cors';
import OpenAI from 'openai';
 
const app = express();
app.use(cors());
app.use(express.json());
 
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
 
interface ChatMessage {
  role: 'system' | 'user' | 'assistant';
  content: string;
}
 
app.post('/api/chat/stream', async (req, res) => {
  const { messages, model = 'gpt-4', temperature = 0.7 } = req.body;
 
  // Set SSE headers - critical for streaming to work through proxies
  res.setHeader('Content-Type', 'text/event-stream');
  res.setHeader('Cache-Control', 'no-cache');
  res.setHeader('Connection', 'keep-alive');
  res.setHeader('X-Accel-Buffering', 'no'); // Disable nginx buffering
 
  // Handle client disconnect
  let aborted = false;
  req.on('close', () => { aborted = true; });
 
  try {
    const stream = await openai.chat.completions.create({
      model,
      messages: messages as ChatMessage[],
      temperature,
      stream: true,
    });
 
    for await (const chunk of stream) {
      if (aborted) break;
      const delta = chunk.choices[0]?.delta;
      if (delta?.content) {
        res.write(`data: ${JSON.stringify({
          type: 'token',
          content: delta.content,
        })}\n\n`);
      }
      if (chunk.choices[0]?.finish_reason === 'stop') {
        res.write(`data: ${JSON.stringify({
          type: 'done',
          usage: chunk.usage,
        })}\n\n`);
      }
    }
  } catch (error: any) {
    res.write(`data: ${JSON.stringify({
      type: 'error',
      error: { code: 'GENERATION_ERROR', message: error.message },
    })}\n\n`);
  } finally {
    res.end();
  }
});
 
app.listen(3001, () => console.log('Chat server running on port 3001'));

Frontend: React Streaming Hook

Create a custom hook that manages the streaming connection and message state. The hook handles the entire lifecycle: sending the initial request, reading the SSE stream, accumulating tokens, and managing error states:

import { useState, useCallback, useRef } from 'react';
 
interface Message {
  id: string;
  role: 'user' | 'assistant';
  content: string;
  status: 'sending' | 'streaming' | 'complete' | 'error';
  timestamp: Date;
}
 
export function useChat() {
  const [messages, setMessages] = useState<Message[]>([]);
  const [isStreaming, setIsStreaming] = useState(false);
  const abortControllerRef = useRef<AbortController | null>(null);
 
  const sendMessage = useCallback(async (content: string) => {
    const userMessage: Message = {
      id: crypto.randomUUID(),
      role: 'user',
      content,
      status: 'complete',
      timestamp: new Date(),
    };
 
    const assistantMessage: Message = {
      id: crypto.randomUUID(),
      role: 'assistant',
      content: '',
      status: 'sending',
      timestamp: new Date(),
    };
 
    setMessages(prev => [...prev, userMessage, assistantMessage]);
    setIsStreaming(true);
 
    const controller = new AbortController();
    abortControllerRef.current = controller;
 
    try {
      const apiMessages = [...messages, userMessage].map(m => ({
        role: m.role,
        content: m.content,
      }));
 
      const response = await fetch('/api/chat/stream', {
        method: 'POST',
        headers: { 'Content-Type': 'application/json' },
        body: JSON.stringify({ messages: apiMessages }),
        signal: controller.signal,
      });
 
      if (!response.ok) throw new Error(`HTTP ${response.status}`);
 
      const reader = response.body!.getReader();
      const decoder = new TextDecoder();
      let buffer = '';
 
      setMessages(prev => prev.map(m =>
        m.id === assistantMessage.id
          ? { ...m, status: 'streaming' }
          : m
      ));
 
      while (true) {
        const { done, value } = await reader.read();
        if (done) break;
 
        buffer += decoder.decode(value, { stream: true });
        const lines = buffer.split('\n');
        buffer = lines.pop() || '';
 
        for (const line of lines) {
          if (!line.startsWith('data: ')) continue;
          const data = JSON.parse(line.slice(6));
 
          if (data.type === 'token') {
            setMessages(prev => prev.map(m =>
              m.id === assistantMessage.id
                ? { ...m, content: m.content + data.content }
                : m
            ));
          } else if (data.type === 'done') {
            setMessages(prev => prev.map(m =>
              m.id === assistantMessage.id
                ? { ...m, status: 'complete' }
                : m
            ));
          } else if (data.type === 'error') {
            setMessages(prev => prev.map(m =>
              m.id === assistantMessage.id
                ? { ...m, status: 'error', content: data.error.message }
                : m
            ));
          }
        }
      }
    } catch (error: any) {
      if (error.name !== 'AbortError') {
        setMessages(prev => prev.map(m =>
          m.id === assistantMessage.id
            ? { ...m, status: 'error', content: error.message }
            : m
        ));
      }
    } finally {
      setIsStreaming(false);
      abortControllerRef.current = null;
    }
  }, [messages]);
 
  const stopGeneration = useCallback(() => {
    abortControllerRef.current?.abort();
    setIsStreaming(false);
  }, []);
 
  return { messages, sendMessage, stopGeneration, isStreaming };
}

The Chat Component

Build the React component that renders the streaming conversation with auto-scrolling, a blinking cursor during streaming, and a stop button:

import { useState, useRef, useEffect } from 'react';
import { useChat } from './useChat';
 
export function ChatBot() {
  const [input, setInput] = useState('');
  const { messages, sendMessage, stopGeneration, isStreaming } = useChat();
  const messagesEndRef = useRef<HTMLDivElement>(null);
 
  useEffect(() => {
    messagesEndRef.current?.scrollIntoView({ behavior: 'smooth' });
  }, [messages]);
 
  const handleSubmit = (e: React.FormEvent) => {
    e.preventDefault();
    if (!input.trim() || isStreaming) return;
    sendMessage(input.trim());
    setInput('');
  };
 
  return (
    <div className="chat-container">
      <div className="messages">
        {messages.map(msg => (
          <div key={msg.id} className={`message ${msg.role}`}>
            <div className="avatar">
              {msg.role === 'user' ? '👤' : '🤖'}
            </div>
            <div className="content">
              {msg.content}
              {msg.status === 'streaming' && (
                <span className="cursor-blink">▊</span>
              )}
              {msg.status === 'error' && (
                <span className="error-badge">⚠ Error</span>
              )}
            </div>
          </div>
        ))}
        <div ref={messagesEndRef} />
      </div>
      <form onSubmit={handleSubmit} className="input-form">
        <input
          value={input}
          onChange={e => setInput(e.target.value)}
          placeholder="Type a message..."
          disabled={isStreaming}
        />
        {isStreaming ? (
          <button type="button" onClick={stopGeneration}>
            Stop
          </button>
        ) : (
          <button type="submit" disabled={!input.trim()}>
            Send
          </button>
        )}
      </form>
    </div>
  );
}

React chat interface with streaming animation

Real-World Use Cases

Customer Support Automation

Streaming chatbots excel in customer support where response speed directly impacts satisfaction. Companies like Intercom and Zendesk integrate streaming LLM responses to provide instant answers to common questions. The streaming experience reassures the user that the system is actively working on their query, reducing the tendency to abandon the chat or submit duplicate requests. A typical implementation combines a RAG pipeline with streaming: the system retrieves relevant knowledge base articles, constructs a prompt with the retrieved context, and streams the generated answer to the user. Response times drop from 30-60 seconds to under 2 seconds for the first token.

Developer Documentation Assistants

Technical documentation sites increasingly embed AI assistants that answer questions about their APIs and frameworks. Streaming is essential here because developer questions often require detailed, multi-paragraph responses with code examples. Seeing the code blocks materialize token by token gives developers time to start reading from the top while the rest loads, improving comprehension and reducing wait time. Vercel's AI chatbot for Next.js documentation and Stripe's developer assistant are prominent examples that demonstrate this pattern effectively.

Interactive Coding Tutors

Educational platforms use streaming chatbots to create interactive tutoring experiences. When a student asks how to implement a binary search tree, the AI tutor streams its explanation step by step, interspersing conceptual explanations with code examples. The streaming format allows the student to follow along naturally, and the tutor can include pauses between sections by sending empty tokens with delays, creating a natural teaching rhythm that mimics a human instructor.

Real-Time Content Generation

Content creation tools integrate streaming chatbots to provide a collaborative writing experience. As the user describes what they want — say, a blog post about cloud architecture — the AI generates the content in real time. The user can observe the direction the content is taking and interrupt at any point to redirect, ask for changes, or expand on a section. This tight feedback loop is only possible with streaming; with batch responses, the user would have to wait for the entire output before providing feedback.

Best Practices for Production

Implement connection keep-alive: Send periodic heartbeat events every 15-30 seconds to prevent proxies and load balancers from closing idle connections. A simple comment line (: keepalive\n\n) suffices for SSE connections and is ignored by compliant parsers.
Set appropriate timeouts: Configure both server-side and client-side timeouts. The server should set a maximum generation time (e.g., 120 seconds) to prevent runaway generations. The client should implement a read timeout that triggers reconnection if no data arrives within a threshold.
Handle backpressure gracefully: If the client processes tokens slower than the server sends them, the TCP buffer fills and the server blocks. Use appropriate buffer sizes and consider batching tokens before sending to prevent stalls.
Sanitize streaming content: LLMs can generate harmful content mid-stream. Implement content filtering that checks accumulated chunks using a sliding window approach, because harmful patterns may span multiple tokens. Check the last N characters for prohibited content after each token arrives.
Persist conversation state server-side: Never rely solely on client-side state for conversation history. The server should maintain a conversation store (Redis, database) that associates conversation IDs with message arrays. This enables session resumption after page refreshes and provides a foundation for analytics.
Use exponential backoff for retries: When the LLM API returns a transient error (429, 503), implement exponential backoff with jitter. Stream an error message to the user explaining the retry, then attempt the request again. Show the user that recovery is in progress rather than failing silently.
Implement token counting: Track token usage for both the conversation history and the generated response. This allows you to implement per-user quotas, prevent context window overflow, and optimize costs. Libraries like tiktoken provide fast, accurate token counting for OpenAI models.
Compress conversation history: As conversations grow, the context window fills up. Implement a summarization strategy where older messages are compressed into a summary, preserving the most recent N messages in full. This keeps the prompt within the model's context window while retaining the gist of earlier exchanges.

Common Pitfalls and Solutions

Pitfall	Impact	Solution
Buffering by reverse proxy	Client receives no data until stream ends, defeating the entire purpose of streaming	Set `X-Accel-Buffering: no` header for nginx; disable proxy buffering in your load balancer configuration
Not handling connection drops	User sees partial response with no way to recover or resume the conversation	Implement resumable streams with `Last-Event-ID` or persist conversation state server-side
Memory leaks from unclosed streams	Server memory grows unbounded under sustained load, eventually causing OOM	Always call `res.end()` in finally blocks; use AbortController for cleanup on client disconnect
Race conditions in state updates	Messages appear out of order, duplicate tokens appear, or UI flickers	Use functional state updates (`prev =>`) and ensure single-flight request guards prevent concurrent requests
Token encoding mismatches	Token counts are inaccurate, leading to unexpected context overflow errors	Use the correct tokenizer for your model; different models use different tokenization schemes
Missing error events in SSE	Client never learns about generation failures, leaving it stuck in "streaming" state indefinitely	Always send an error event before ending the stream; the client must handle both success and error paths

Performance Optimization

Batching Tokens for Rendering

Rendering every individual token as a separate React state update can cause performance issues, especially with fast models that generate hundreds of tokens per second. Batch tokens into micro-windows of 16-50ms using requestAnimationFrame to group updates into a single paint:

const useTokenBatcher = (batchInterval = 32) => {
  const bufferRef = useRef('');
  const frameRef = useRef<number>(0);
  const [content, setContent] = useState('');
 
  const appendToken = useCallback((token: string) => {
    bufferRef.current += token;
    if (!frameRef.current) {
      frameRef.current = requestAnimationFrame(() => {
        setContent(prev => prev + bufferRef.current);
        bufferRef.current = '';
        frameRef.current = 0;
      });
    }
  }, []);
 
  return { content, appendToken };
};

Reducing Bundle Size

The OpenAI SDK is large. For the frontend, never import the full SDK — only send requests to your backend proxy. On the backend, use tree-shaking-friendly imports and consider using openai only for type definitions while making raw HTTP calls for the streaming endpoint if bundle size is a concern.

Using the Vercel AI SDK

The Vercel AI SDK (ai package) provides built-in streaming utilities that handle SSE formatting, token buffering, and React hooks out of the box. On the backend, StreamingTextResponse handles all the SSE boilerplate. On the frontend, the useChat hook manages the entire streaming lifecycle including error recovery and message state. This can reduce your custom streaming code by 80% or more.

// Frontend: useChat hook from the AI SDK
import { useChat } from 'ai/react';
 
export function Chat() {
  const { messages, input, handleInputChange, handleSubmit, isLoading, stop } =
    useChat({ api: '/api/chat' });
  // The hook handles all streaming, state, and error recovery automatically
}

Comparison with Alternatives

Feature	SSE (Server-Sent Events)	WebSockets	HTTP Polling	gRPC Streaming
Direction	Unidirectional (server→client)	Bidirectional	Client-initiated	Bidirectional
Protocol	HTTP/1.1 or HTTP/2	WS/WSS	HTTP	HTTP/2
Reconnection	Automatic (built-in)	Manual implementation	N/A	Manual
Browser support	Excellent (EventSource API)	Excellent	Excellent	Limited (requires grpc-web proxy)
Complexity	Low	Medium	Low	High
Latency	Very low	Very low	High (poll interval)	Very low
Best for	LLM streaming, notifications	Chat, gaming, collaboration	Simple status checks	Microservice communication

For LLM chatbot streaming, SSE is the clear winner. It provides the lowest complexity with excellent performance. WebSockets add unnecessary bidirectional complexity for a use case that is fundamentally server-to-client. HTTP polling wastes bandwidth and adds latency. gRPC streaming offers excellent performance but requires additional infrastructure (protobuf, grpc-web proxy) that is rarely justified for a chatbot application.

Advanced Patterns

Function Calling with Streaming

Modern LLMs support function calling, where the model outputs a structured function call instead of natural language. Handling this in a streaming context requires accumulating the function call arguments across multiple tokens, since the JSON arguments arrive incrementally:

let functionCallBuffer = { name: '', arguments: '' };
 
for await (const chunk of stream) {
  const delta = chunk.choices[0]?.delta;
 
  if (delta?.function_call?.name) {
    functionCallBuffer.name += delta.function_call.name;
  }
  if (delta?.function_call?.arguments) {
    functionCallBuffer.arguments += delta.function_call.arguments;
  }
 
  if (chunk.choices[0]?.finish_reason === 'function_call') {
    const args = JSON.parse(functionCallBuffer.arguments);
    const result = await executeFunction(functionCallBuffer.name, args);
    messages.push({
      role: 'function',
      name: functionCallBuffer.name,
      content: JSON.stringify(result),
    });
    const newStream = await openai.chat.completions.create({
      model, messages, stream: true,
    });
    functionCallBuffer = { name: '', arguments: '' };
    // Continue forwarding tokens from newStream...
  }
}

Markdown Rendering During Streaming

As tokens arrive, the accumulated content may contain partial Markdown that is syntactically incomplete (e.g., an opening code fence without its closing counterpart). A robust approach uses an incremental Markdown renderer that gracefully handles incomplete syntax by auto-closing open code blocks during rendering:

import ReactMarkdown from 'react-markdown';
import remarkGfm from 'remark-gfm';
 
function StreamingMessage({ content, isStreaming }: {
  content: string; isStreaming: boolean;
}) {
  const renderableContent = isStreaming
    ? closeFencedCodeBlocks(content)
    : content;
  return (
    <ReactMarkdown remarkPlugins={[remarkGfm]}>
      {renderableContent}
    </ReactMarkdown>
  );
}
 
function closeFencedCodeBlocks(content: string): string {
  const backtickCount = (content.match(/```/g) || []).length;
  if (backtickCount % 2 !== 0) return content + '\n```';
  return content;
}

Multi-Model Routing

Advanced chatbots route requests to different models based on query complexity. Simple questions go to a fast, cheap model (GPT-3.5-turbo), while complex queries are routed to a more capable model (GPT-4). The streaming interface remains identical regardless of which model generates the response, so the client code does not need to change.

Testing Strategies

Unit Testing the Streaming Hook

Test the streaming hook by mocking the fetch API and simulating an SSE stream:

import { renderHook, act } from '@testing-library/react-hooks';
import { useChat } from './useChat';
 
function createMockStream(chunks: string[]) {
  const encoder = new TextEncoder();
  return new ReadableStream({
    start(controller) {
      for (const chunk of chunks) {
        controller.enqueue(encoder.encode(`data: ${chunk}\n\n`));
      }
      controller.close();
    },
  });
}
 
describe('useChat', () => {
  it('should accumulate tokens from stream', async () => {
    const stream = createMockStream([
      JSON.stringify({ type: 'token', content: 'Hello' }),
      JSON.stringify({ type: 'token', content: ' world' }),
      JSON.stringify({ type: 'done' }),
    ]);
    global.fetch = jest.fn().mockResolvedValue({ ok: true, body: stream });
 
    const { result } = renderHook(() => useChat());
    await act(async () => {
      await result.current.sendMessage('Hi');
    });
 
    const assistantMsg = result.current.messages.find(m => m.role === 'assistant');
    expect(assistantMsg?.content).toBe('Hello world');
    expect(assistantMsg?.status).toBe('complete');
  });
});

Integration Testing with Playwright

Test the full streaming experience in a real browser:

import { test, expect } from '@playwright/test';
 
test('chatbot streams response token by token', async ({ page }) => {
  await page.goto('/chat');
  await page.fill('input[placeholder="Type a message..."]', 'Hello');
  await page.click('button:has-text("Send")');
  await expect(page.locator('.cursor-blink')).toBeVisible();
  await expect(page.locator('.cursor-blink')).toBeHidden({ timeout: 30000 });
  const assistantContent = await page.locator('.message.assistant .content')
    .last().textContent();
  expect(assistantContent!.length).toBeGreaterThan(0);
});

Future Outlook

The streaming chatbot landscape is evolving rapidly. Speculative decoding is reducing latency by 2-3x, meaning streams will arrive faster and the token-by-token experience will feel even more natural. Multimodal streaming is emerging, where the model streams not just text but also image tokens, code execution results, and structured data visualizations in real time.

The rise of local and edge-deployed models (via Ollama, llama.cpp, and WebGPU) means streaming chatbots will increasingly run without any cloud dependency. This eliminates network latency entirely and enables offline-first AI experiences. The Vercel AI SDK and similar frameworks are already abstracting over local vs. remote models, making the transition seamless.

Finally, agentic streaming represents the next frontier. Instead of a simple question-answer stream, the model will stream its reasoning process, tool calls, and intermediate results as it works through complex tasks. Users will see the agent's "thought process" in real time, building trust and enabling course correction before the final output arrives.

Conclusion

Building a streaming chatbot is one of the highest-impact features you can add to a modern web application. The key takeaways from this guide are:

Use SSE over WebSockets for LLM streaming — it is simpler, automatically handles reconnection, and is perfectly suited for server-to-client data flow
Implement the proxy pattern to keep API keys secure and enable server-side processing of the stream
Manage streaming state carefully with a clear state machine (sending → streaming → complete/error) to prevent race conditions and UI glitches
Batch token rendering using requestAnimationFrame to prevent excessive React re-renders and maintain smooth 60fps performance
Handle errors at every layer — network failures, LLM errors, and content filtering all require graceful degradation
Test both unit and integration — mock the stream for fast unit tests and use Playwright for full browser-based integration tests

Start by implementing the basic SSE proxy pattern described in this guide, then incrementally add features like function calling, markdown rendering, and conversation persistence. The streaming experience will fundamentally change how users perceive your application's responsiveness and intelligence.

Minh Vo

Slaying code & making it lit fr fr 🔥 tagline