MinhVo

Minh Vo

rss feed

Slaying code & making it lit fr fr 🔥 tagline

Hey there 👋 I'm an AI Engineer with 7 years of experience building scalable web and mobile applications. Currently at Neurond AI (May 2025 — present), architecting an Enterprise AI Assistant Platform with multi-tenant RAG on pgvector, multi-provider LLM orchestration, and Azure-native infrastructure. Previously spent 5+ years at SNAPTEC (Sep 2019 — Apr 2025), leading SaaS themes, admin dashboards, and e-commerce platforms — earned the Hero of the Year award in 2021. I specialize in TypeScript, React, Next.js, and AI-Native engineering with Claude Code and Cursor.bio

Back to blogs

Building Production LLM Applications

Production LLM patterns: guardrails, evaluation, cost optimization, and reliability.

AILLMProductionMachine Learning

By MinhVo

Introduction

Getting a large language model to produce a impressive demo is easy. Getting it to work reliably, safely, and cost-effectively in production is one of the hardest engineering challenges in software today. The gap between a demo and a production LLM application spans guardrails (preventing harmful outputs), evaluation (measuring quality systematically), cost optimization (managing unpredictable API bills), reliability (handling rate limits, timeouts, and model failures), observability (understanding what the model is doing and why), and user experience (managing latency and setting expectations).

Production LLM applications fail in ways that traditional software does not. The same prompt can produce different outputs on different days. A model upgrade can silently change the quality of your application. A user can craft an input that bypasses your safety filters. Costs can spike unpredictably when a viral moment drives traffic. These failure modes require new engineering patterns and disciplines that go beyond traditional software engineering.

In this guide, we will cover the essential patterns for building production LLM applications: prompt engineering at scale, output validation and guardrails, evaluation frameworks, cost management, reliability patterns (retries, fallbacks, circuit breakers), observability and monitoring, and deployment strategies. These patterns are battle-tested across companies shipping LLM-powered features to millions of users.

AI production systems and monitoring

Understanding Production LLM Challenges: Core Concepts

The Non-Determinism Problem

Unlike traditional APIs that return the same output for the same input, LLMs are probabilistic. The same prompt can produce different outputs across calls due to sampling randomness (temperature, top-p), model updates (providers periodically update their models), and floating-point non-determinism (parallel GPU computations can vary across runs). This means you cannot rely on snapshot testing, you must evaluate outputs semantically rather than exactly, and you need regression testing frameworks that compare output quality rather than output content.

The Latency Spectrum

LLM latency varies dramatically based on model size, prompt length, output length, and provider load. A simple GPT-3.5-turbo call might return in 200ms, while a complex GPT-4 call with a long prompt might take 30 seconds. Streaming mitigates perceived latency but does not reduce actual latency. Production applications must handle this variability with timeout policies, fallback models, and user experience patterns that manage expectations.

The Cost Structure

LLM costs are proportional to token count, which varies unpredictably per request. A user asking "What is 2+2?" costs a fraction of a cent, while a user asking "Write a 2000-word essay about climate change" costs several cents. At scale, this variability makes cost forecasting difficult. Production applications need per-user cost tracking, prompt optimization to reduce token usage, and tiered model strategies (cheap models for simple tasks, expensive models for complex ones).

Architecture and Design Patterns

The LLM Gateway Pattern

Introduce a gateway layer between your application code and the LLM provider. This gateway handles provider selection, retries, fallbacks, caching, rate limiting, and observability. Your application code never calls the LLM provider directly — it always goes through the gateway:

interface LLMGateway {
  complete(request: LLMRequest): Promise<LLMResponse>;
  stream(request: LLMRequest): AsyncIterable<LLMChunk>;
}
 
class ProductionLLMGateway implements LLMGateway {
  private providers: LLMProvider[];
  private cache: ResponseCache;
  private rateLimiter: RateLimiter;
  private metrics: MetricsCollector;
 
  async complete(request: LLMRequest): Promise<LLMResponse> {
    // Check cache first
    const cached = await this.cache.get(request);
    if (cached) {
      this.metrics.increment('llm.cache_hit');
      return cached;
    }
 
    // Check rate limits
    await this.rateLimiter.acquire(request.userId);
 
    // Try providers in order with fallback
    for (const provider of this.providers) {
      try {
        const response = await this.withTimeout(
          provider.complete(request),
          request.timeout || 30000
        );
 
        // Validate output
        const validated = await this.validateOutput(response, request);
        await this.cache.set(request, validated);
        this.metrics.record('llm.latency', provider.latency);
        return validated;
      } catch (error) {
        this.metrics.increment('llm.error', { provider: provider.name });
        continue; // Try next provider
      }
    }
 
    throw new Error('All LLM providers failed');
  }
}

The Guardrails Pipeline

Every LLM output must pass through a guardrails pipeline before reaching the user. The pipeline consists of validators that check for harmful content, factual accuracy, format compliance, and business rules:

interface Guardrail {
  name: string;
  validate(output: string, context: RequestContext): Promise<GuardrailResult>;
}
 
class GuardrailsPipeline {
  private guards: Guardrail[];
 
  async validate(output: string, context: RequestContext): Promise<string> {
    let current = output;
 
    for (const guard of this.guards) {
      const result = await guard.validate(current, context);
      if (!result.passed) {
        if (result.action === 'reject') {
          throw new GuardrailViolation(guard.name, result.reason);
        } else if (result.action === 'transform') {
          current = result.transformed!;
        }
      }
    }
 
    return current;
  }
}
 
// Content safety guardrail
class ContentSafetyGuard implements Guardrail {
  name = 'content_safety';
 
  async validate(output: string): Promise<GuardrailResult> {
    const response = await openai.moderations.create({ input: output });
    const flagged = response.results[0].flagged;
    return {
      passed: !flagged,
      action: flagged ? 'reject' : 'pass',
      reason: flagged ? 'Content flagged by safety filter' : undefined,
    };
  }
}
 
// Format compliance guardrail
class JSONFormatGuard implements Guardrail {
  name = 'json_format';
 
  async validate(output: string): Promise<GuardrailResult> {
    try {
      JSON.parse(output);
      return { passed: true, action: 'pass' };
    } catch {
      // Try to extract JSON from markdown code blocks
      const match = output.match(/```(?:json)?\n([\s\S]*?)```/);
      if (match) {
        try {
          JSON.parse(match[1]);
          return { passed: true, action: 'transform', transformed: match[1] };
        } catch {}
      }
      return { passed: false, action: 'reject', reason: 'Invalid JSON output' };
    }
  }
}

Evaluation Framework

Build an evaluation framework that measures LLM output quality systematically:

interface TestCase {
  id: string;
  input: string;
  expectedOutput?: string;
  expectedBehavior: string;
  category: string;
}
 
interface EvalResult {
  testId: string;
  passed: boolean;
  score: number;
  metrics: Record<string, number>;
  output: string;
  latency: number;
}
 
class EvaluationRunner {
  async runSuite(
    testCases: TestCase[],
    config: EvalConfig
  ): Promise<EvalReport> {
    const results: EvalResult[] = [];
 
    for (const testCase of testCases) {
      const start = Date.now();
      const output = await this.gateway.complete({
        prompt: testCase.input,
        model: config.model,
      });
      const latency = Date.now() - start;
 
      const score = await this.evaluate(output.content, testCase);
      results.push({
        testId: testCase.id,
        passed: score >= config.passThreshold,
        score,
        metrics: { latency },
        output: output.content,
        latency,
      });
    }
 
    return this.generateReport(results);
  }
 
  private async evaluate(output: string, testCase: TestCase): Promise<number> {
    // Use a judge model to evaluate quality
    const judgeResponse = await this.gateway.complete({
      prompt: `Evaluate this response on a scale of 0-10.
 
Question: ${testCase.input}
Expected behavior: ${testCase.expectedBehavior}
Actual response: ${output}
 
Score (0-10):`,
      model: 'gpt-4',
      temperature: 0,
    });
 
    const score = parseInt(judgeResponse.content.match(/\d+/)?.[0] || '0');
    return score / 10;
  }
}

Evaluation and monitoring dashboard

Step-by-Step Implementation

Prompt Template Management

Organize prompts as versioned templates rather than inline strings:

import Handlebars from 'handlebars';
 
interface PromptTemplate {
  id: string;
  version: number;
  template: string;
  model: string;
  temperature: number;
  maxTokens: number;
  guardrails: string[];
}
 
class PromptRegistry {
  private templates = new Map<string, PromptTemplate>();
 
  register(template: PromptTemplate): void {
    this.templates.set(`${template.id}:v${template.version}`, template);
  }
 
  render(id: string, variables: Record<string, any>, version?: number): LLMRequest {
    const key = version ? `${id}:v${version}` : this.getLatestVersion(id);
    const template = this.templates.get(key);
    if (!template) throw new Error(`Template ${key} not found`);
 
    const compiled = Handlebars.compile(template.template);
    return {
      prompt: compiled(variables),
      model: template.model,
      temperature: template.temperature,
      maxTokens: template.maxTokens,
      metadata: { templateId: id, templateVersion: template.version },
    };
  }
}
 
// Register templates
const registry = new PromptRegistry();
registry.register({
  id: 'summarize',
  version: 3,
  template: `Summarize the following text in {{style}} style.
Target length: {{length}} words.
 
Text:
{{text}}
 
Summary:`,
  model: 'gpt-4',
  temperature: 0.3,
  maxTokens: 500,
  guardrails: ['content_safety', 'length_check'],
});

Cost Tracking and Optimization

Implement per-request cost tracking and optimization strategies:

interface CostTracker {
  recordUsage(userId: string, usage: TokenUsage, model: string): Promise<void>;
  getDailyCost(userId: string): Promise<number>;
  getMonthlyCost(): Promise<CostReport>;
}
 
class TokenCostCalculator {
  private pricing: Record<string, { input: number; output: number }> = {
    'gpt-4': { input: 0.03, output: 0.06 },
    'gpt-4-turbo': { input: 0.01, output: 0.03 },
    'gpt-3.5-turbo': { input: 0.0005, output: 0.0015 },
  };
 
  calculate(model: string, usage: TokenUsage): number {
    const rates = this.pricing[model];
    return (usage.promptTokens * rates.input +
            usage.completionTokens * rates.output) / 1000;
  }
}
 
// Cost-aware model selection
function selectModel(task: TaskDescriptor): string {
  const complexity = estimateComplexity(task);
 
  if (complexity < 0.3) return 'gpt-3.5-turbo';      // $0.001/1K tokens
  if (complexity < 0.7) return 'gpt-4-turbo';          // $0.01/1K tokens
  return 'gpt-4';                                       // $0.06/1K tokens
}

Retry and Fallback Strategy

Implement resilient retry logic with model fallbacks:

class ResilientLLMClient {
  private retryConfig = {
    maxRetries: 3,
    baseDelay: 1000,
    maxDelay: 10000,
  };
 
  async complete(request: LLMRequest): Promise<LLMResponse> {
    const models = [request.model, 'gpt-4-turbo', 'gpt-3.5-turbo'];
 
    for (const model of models) {
      let lastError: Error | null = null;
 
      for (let attempt = 0; attempt <= this.retryConfig.maxRetries; attempt++) {
        try {
          return await this.callProvider({ ...request, model });
        } catch (error: any) {
          lastError = error;
 
          if (!this.isRetryable(error)) break;
 
          const delay = Math.min(
            this.retryConfig.baseDelay * Math.pow(2, attempt) +
            Math.random() * 1000,
            this.retryConfig.maxDelay
          );
          await new Promise(r => setTimeout(r, delay));
        }
      }
    }
 
    throw new Error('All models and retries exhausted');
  }
 
  private isRetryable(error: any): boolean {
    return error.status === 429 || error.status === 503 || error.status === 500;
  }
}

Real-World Use Cases

Customer Support Automation

Production LLM customer support systems handle thousands of conversations simultaneously. The key challenges are maintaining response quality across diverse topics, escalating to human agents when the LLM is uncertain, and providing consistent brand voice. Guardrails prevent the LLM from making promises the company cannot keep (refunds, policy exceptions) and ensure responses cite actual product documentation rather than hallucinating features.

Code Generation Platforms

Code generation tools like GitHub Copilot and Cursor process millions of code completions daily. Production challenges include latency (completions must appear within 200ms to be useful), accuracy (incorrect suggestions erode trust), and cost (each keystroke potentially triggers an API call). These platforms use aggressive caching (similar code contexts produce similar suggestions), tiered models (fast small models for inline suggestions, large models for complex generation), and feedback loops (accepted suggestions improve future models).

Document Processing and Extraction

LLM-powered document processing extracts structured data from unstructured documents (invoices, contracts, medical records). Production requirements include high accuracy (errors have financial or legal consequences), format consistency (outputs must be valid JSON/XML), and auditability (every extraction must be traceable to the source document and the prompt used). Guardrails validate extracted data against schemas and flag low-confidence extractions for human review.

Content Personalization

LLM-powered content personalization generates tailored marketing copy, product descriptions, and email content for each user. Production challenges include brand consistency (all generated content must match the brand voice), A/B testing (comparing LLM-generated vs human-written content), and regulatory compliance (generated content must not include prohibited claims or discriminating language).

Best Practices for Production

  1. Version your prompts: Treat prompts as code. Version them, review changes, and maintain a history. When a prompt change causes quality regression, you need to be able to roll back quickly. Store prompts in a registry with semantic versioning.

  2. Implement output validation: Never trust LLM output directly. Validate it against expected schemas, check for prohibited content, verify factual claims against a knowledge base, and ensure it meets length and format requirements. Use Zod or Pydantic for schema validation.

  3. Set up comprehensive observability: Log every LLM request and response (redacting PII), track latency percentiles (p50, p95, p99), monitor token usage and costs per model and feature, and alert on quality degradation (increasing error rates, decreasing evaluation scores).

  4. Use structured output: When you need the LLM to produce structured data (JSON, function calls), use the provider's structured output features (JSON mode, function calling) rather than parsing free-form text. This reduces parsing failures and makes validation straightforward.

  5. Implement circuit breakers: If an LLM provider starts returning errors at a rate above a threshold, stop sending requests temporarily and fall back to a cached response, a simpler model, or a graceful degradation. This prevents cascading failures and protects costs.

  6. Cache aggressively with semantic similarity: Traditional caching (exact prompt match) has low hit rates because prompts vary slightly. Use embedding-based semantic caching: if a new prompt is semantically similar to a cached prompt (cosine similarity > 0.95), return the cached response. This can reduce costs by 30-50%.

  7. Implement per-user rate limiting: Prevent abuse and cost spikes by limiting the number of LLM requests per user per time period. Differentiate between interactive requests (lower limit, lower latency) and batch requests (higher limit, flexible latency).

  8. Test with adversarial inputs: Regularly test your guardrails with prompt injection attempts, jailbreaks, and edge cases. Red team exercises where security engineers try to make the LLM produce harmful outputs are essential for production readiness.

Common Pitfalls and Solutions

PitfallImpactSolution
No output validationHallucinated data, broken JSON, harmful content reaches usersAlways validate output against schemas and guardrails before returning to users
Inline prompt stringsCannot version, review, or A/B test prompts; changes require code deploymentUse a prompt registry with versioning and hot-reload capability
No cost trackingUnexpected $10K+ API bills at end of monthTrack costs per request, per user, per feature; set budget alerts
Single model dependencyProvider outage takes down your entire applicationImplement multi-provider fallback with at least two providers
No evaluation frameworkCannot detect quality regressions after model updatesBuild automated eval suites that run on every prompt change and model update
Ignoring latency p9999% of users have great experience, 1% wait 60+ secondsSet aggressive timeouts, implement fallbacks for slow requests

Performance Optimization

Prompt Optimization

Reduce token usage without sacrificing quality. Compress system prompts by removing redundant instructions. Use few-shot examples strategically (2-3 examples are often sufficient). Implement prompt caching (OpenAI's cached prompts cost 50% less). Summarize long conversation histories instead of sending the full transcript.

Batching and Caching

For non-interactive workloads (document processing, content generation), batch requests to take advantage of provider batch APIs (which are typically 50% cheaper). Implement semantic caching for interactive workloads where similar prompts produce similar results.

Comparison with Alternatives

ApproachCostQualityLatencyControl
GPT-4 APIHighExcellentHighLow
GPT-3.5 APILowGoodLowLow
Self-hosted (Llama, Mistal)Moderate (GPU)Good-ExcellentModerateHigh
Fine-tuned modelLow per requestExcellent (domain)LowHigh
Hybrid (routing)OptimizedBest per taskOptimizedModerate

Future Outlook

The LLM production landscape is evolving toward smaller, faster, more specialized models. Instead of routing all requests to GPT-4, production systems will maintain a portfolio of models — small fast models for simple tasks, large models for complex reasoning, and specialized fine-tuned models for domain-specific work. The routing layer will use classifiers to automatically select the right model for each request.

Conclusion

Building production LLM applications requires engineering discipline beyond what demos and prototypes demand. The key takeaways are:

  1. Use an LLM gateway to centralize retries, fallbacks, caching, and observability
  2. Implement guardrails as a pipeline that validates every output before it reaches the user
  3. Version and evaluate prompts systematically with automated evaluation suites
  4. Track costs per request and implement tiered model strategies to optimize spend
  5. Build for reliability with retries, circuit breakers, and multi-provider fallbacks
  6. Monitor everything — latency, costs, quality scores, and error rates

Start with the gateway pattern and basic guardrails, then incrementally add evaluation, cost optimization, and advanced reliability patterns as your application matures and your understanding of failure modes deepens.