Running AI Locally: Ollama, llama.cpp, and vLLM

Introduction

The era of needing a cloud API key to experiment with large language models is over. Thanks to advances in model quantization, optimized inference engines, and the open-source release of powerful models like Llama 3, Mistral, and Gemma, running capable AI models on consumer hardware is not just possible—it's practical. Whether you're a developer who needs offline AI capabilities, a privacy-conscious organization that can't send data to third-party APIs, or a researcher building custom workflows, local LLM inference puts the power back in your hands.

In this comprehensive guide, we'll explore the three dominant tools for local LLM inference: Ollama for its dead-simple setup experience, llama.cpp for its C/C++ performance and broad hardware support, and vLLM for high-throughput serving at scale. You'll learn how model quantization works under the hood, what hardware you actually need for different model sizes, and how to build production applications that seamlessly switch between local and cloud inference.

Understanding Model Quantization: Core Concepts

Quantization is the technique that makes local LLM inference possible. A full-precision 70-billion-parameter model requires approximately 140GB of GPU memory in FP16 format—far beyond any consumer hardware. Quantization reduces the precision of model weights, dramatically cutting memory requirements while preserving most of the model's capability.

How Quantization Works

Neural network weights are typically stored as 16-bit floating-point numbers (FP16). Quantization maps these values to lower-precision representations. The most common quantization methods for LLMs are:

GPTQ: Post-training quantization that uses a calibration dataset to determine optimal weight mappings. Produces 4-bit or 3-bit models with good quality preservation. Requires GPU for both quantization and inference.
GGUF (llama.cpp format): The successor to GGML, supporting multiple quantization levels from Q2_K (2-bit) to Q8_0 (8-bit). Each level trades off model size against quality. Q4_K_M is the sweet spot for most users—roughly 4.5 bits per weight with minimal quality loss.
AWQ (Activation-aware Weight Quantization): Identifies which weights are most important by analyzing activation patterns, then preserves those weights at higher precision. Produces 4-bit models that often outperform GPTQ at the same bit width.
BitsAndBytes (BnB): Runtime quantization that loads full-precision weights and quantizes on-the-fly. Supports 4-bit NormalFloat (NF4) and 8-bit quantization. Used primarily with Hugging Face Transformers.

Quality vs Size Tradeoffs

The impact of quantization varies by model and task. For coding tasks, quantization to Q4 typically causes less than 2% accuracy loss. For mathematical reasoning, the degradation is more noticeable at 3-bit quantization (5-8% loss). Creative writing quality is remarkably resilient to quantization—even Q3 models produce coherent, engaging text. Always test quantized models on your specific use case before committing to a quantization level.

Hardware Requirements and Architecture

GPU Memory Requirements

The primary bottleneck for local LLM inference is GPU VRAM. Here's what you need for popular model sizes:

Model Size	FP16	Q8 (8-bit)	Q4_K_M (4-bit)	Q2_K (2-bit)
7B params	14 GB	7 GB	4 GB	2.5 GB
13B params	26 GB	13 GB	7.5 GB	5 GB
34B params	68 GB	34 GB	20 GB	12 GB
70B params	140 GB	70 GB	40 GB	24 GB

For a comfortable experience, you need a GPU with at least the VRAM shown in the Q4 column, plus 2-4GB for the KV cache and CUDA overhead. NVIDIA RTX 4090 (24GB VRAM) can run 70B models at Q4 quantization with some offloading to system RAM. RTX 3060 12GB is the sweet spot for budget builds, handling 7B models at Q8 or 13B models at Q4 comfortably.

CPU-Only Inference

Not everyone has a powerful GPU. llama.cpp supports CPU-only inference using AVX2/AVX-512 instructions and Apple's Accelerate framework. On a modern CPU (Apple M2 Pro, AMD Ryzen 9 7950X, Intel i9-13900K), expect 5-15 tokens per second for 7B Q4 models—adequate for interactive use but too slow for high-throughput applications. Apple Silicon is particularly impressive due to unified memory architecture, where the GPU and CPU share the same RAM pool.

NVMe Offloading

For models that exceed GPU VRAM, llama.cpp and Ollama support offloading some layers to system RAM or even NVMe storage. This is slower than full GPU inference but allows running models that wouldn't otherwise fit. The performance penalty is roughly proportional to the percentage of layers offloaded—50% GPU offload yields approximately 50% of full GPU speed.

Step-by-Step Implementation

Setting Up Ollama

Ollama provides the simplest path to local LLM inference. Installation is a single command on each platform:

# macOS
brew install ollama
 
# Linux
curl -fsSL https://ollama.com/install.sh | sh
 
# Start the server
ollama serve &
 
# Pull and run a model
ollama pull llama3.1:8b
ollama run llama3.1:8b
 
# Pull a quantized variant
ollama pull llama3.1:70b-q4_K_M

Ollama exposes an OpenAI-compatible API on port 11434, making it a drop-in replacement for cloud APIs:

// Using Ollama with the OpenAI SDK
import OpenAI from 'openai';
 
const ollama = new OpenAI({
  baseURL: 'http://localhost:11434/v1',
  apiKey: 'ollama', // Required by SDK but not used
});
 
async function chat(prompt: string) {
  const response = await ollama.chat.completions.create({
    model: 'llama3.1:8b',
    messages: [
      { role: 'system', content: 'You are a helpful coding assistant.' },
      { role: 'user', content: prompt },
    ],
    temperature: 0.7,
    max_tokens: 2048,
    stream: true,
  });
 
  for await (const chunk of response) {
    process.stdout.write(chunk.choices[0]?.delta?.content || '');
  }
}

Building with llama.cpp Directly

For more control over inference parameters and model loading, use llama.cpp's Python bindings:

# install: pip install llama-cpp-python
from llama_cpp import Llama
 
llm = Llama(
    model_path="./models/llama-3.1-8b-Q4_K_M.gguf",
    n_ctx=8192,         # Context window
    n_gpu_layers=35,    # Number of layers offloaded to GPU
    n_threads=8,        # CPU threads for non-GPU layers
    verbose=False,
)
 
# Generate completion
output = llm.create_chat_completion(
    messages=[
        {"role": "system", "content": "You are a Python expert."},
        {"role": "user", "content": "Write a binary search implementation."},
    ],
    temperature=0.3,
    max_tokens=1024,
)
 
print(output["choices"][0]["message"]["content"])

High-Throughput Serving with vLLM

vLLM is designed for serving LLMs at scale with features like continuous batching, PagedAttention for memory efficiency, and tensor parallelism for multi-GPU setups:

# Install vLLM
pip install vllm
 
# Start an OpenAI-compatible server
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --tensor-parallel-size 2 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.9 \
  --quantization awq

// Client code using vLLM's OpenAI-compatible endpoint
import OpenAI from 'openai';
 
const vllm = new OpenAI({
  baseURL: 'http://localhost:8000/v1',
  apiKey: 'not-needed',
});
 
// vLLM supports much higher concurrency than Ollama
const batch = await Promise.all(
  prompts.map(prompt =>
    vllm.chat.completions.create({
      model: 'meta-llama/Llama-3.1-8B-Instruct',
      messages: [{ role: 'user', content: prompt }],
      max_tokens: 512,
    })
  )
);

Real-World Use Cases and Case Studies

Use Case 1: Healthcare Documentation Assistant

A hospital network deployed Ollama running Llama 3 70B (Q4) on an air-gapped server with dual A100 GPUs. Doctors dictate notes, and the model generates structured clinical summaries, ICD-10 codes, and referral letters—all without any patient data leaving the premises. The system processes 500 documents per day with a human-in-the-loop review step. Latency averages 3 seconds per document, which is acceptable since the previous manual process took 15-20 minutes per note.

Use Case 2: Code Review Pipeline

A software consultancy integrated llama.cpp into their CI/CD pipeline. On each pull request, a 13B code model reviews the diff, identifies potential bugs, suggests improvements, and checks for security vulnerabilities. Running locally eliminated the $2,000/month API cost they'd been spending on cloud LLMs. The model runs on a dedicated build server with an RTX 4090, processing each PR in under 30 seconds.

Use Case 3: Offline Mobile AI

A field service application for oil rig technicians needed AI assistance without reliable internet connectivity. They deployed a 7B model via Ollama on ruggedized laptops with RTX 3060 GPUs. The model answers technical questions, interprets sensor readings, and guides troubleshooting procedures. When connectivity is available, the app syncs conversation logs for quality monitoring and model feedback.

Use Case 4: Multi-Tenant AI Platform

A startup built a multi-tenant AI platform using vLLM with LoRA adapters. Each customer's fine-tuned model is loaded as a lightweight adapter on top of a shared base model. vLLM's LoRA support allows serving hundreds of personalized models with the GPU memory footprint of a single base model. The system handles 1,000 requests per second across all tenants on a 4-GPU server.

Best Practices for Production

Start with Q4_K_M quantization for initial testing: This quantization level offers the best balance of quality, size, and speed for most applications. Only move to Q8 if quality testing reveals meaningful degradation, or to Q2 if you're severely memory-constrained.
Use OpenAI-compatible endpoints for flexibility: Both Ollama and vLLM expose OpenAI-compatible APIs. This means your application code can switch between local and cloud inference by changing a single baseURL parameter, enabling graceful fallback when local resources are insufficient.
Implement request queuing for concurrent users: Local inference handles one request at a time per model. Use a job queue (Bull, BullMQ, or a simple in-memory queue) to serialize requests and prevent memory exhaustion from concurrent inference attempts.
Monitor VRAM usage actively: Set up Prometheus metrics or simple logging to track GPU memory utilization. Models can OOM during long context processing even if they load successfully. Implement a maximum context length that leaves 10-15% VRAM buffer.
Pre-compile models with the right context length: llama.cpp and vLLM allocate memory based on the maximum context length at load time. If your application only needs 2K context, don't load the model with 32K context—it wastes VRAM and reduces the batch size vLLM can serve.
Use model-specific prompt templates: Each model family has a specific prompt format (Llama 3, ChatML, Alpaca, etc.). Using the wrong template degrades output quality significantly. Ollama handles this automatically; with llama.cpp, you must apply templates manually.
Implement streaming for better user experience: Local inference generates tokens at 20-80 tokens per second depending on hardware and model size. Streaming partial responses to the user creates the perception of instant response even when full generation takes several seconds.
Keep models on fast storage: Load model weights from NVMe SSDs, not HDDs or network storage. Initial model loading takes 5-30 seconds depending on model size and storage speed. Keep models loaded in memory between requests rather than reloading for each query.

Common Pitfalls and Solutions

Pitfall	Impact	Solution
Wrong prompt template	Incoherent or poor-quality outputs	Always verify the model's expected prompt format; use Ollama's auto-detection or check Hugging Face model cards
Ignoring KV cache memory	OOM errors with long contexts	Reserve 20-30% of VRAM for KV cache; reduce `max_model_len` if needed
Running quantized models without quality testing	Subtle accuracy degradation in production	Test on representative queries before deploying; compare outputs against the FP16 baseline
Over-provisioning context length	Wasted VRAM and slower inference	Set context length to your actual maximum need, not the model's theoretical limit
No fallback for rate limits	Application downtime when local inference is overloaded	Implement circuit breaker pattern: queue locally, fall back to cloud API when queue depth exceeds threshold
Mixing quantization formats	Model won't load or produces garbage	Ensure the quantization format matches the inference engine: GGUF for llama.cpp/Ollama, GPTQ/AWQ for vLLM

Performance Optimization

Benchmarking Your Hardware

# llama.cpp benchmark
./llama-bench -m models/llama-3.1-8b-Q4_K_M.gguf \
  -t 8 -ngl 99 \
  -n 128 -p 512
 
# Ollama benchmark
time ollama run llama3.1:8b "Write a 500-word essay about AI" --verbose
 
# vLLM throughput test
python -m vllm.entrypoints.openai.api_server \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --disable-log-requests

KV Cache Optimization

The key-value cache stores attention computations from previous tokens and is the primary memory bottleneck for long contexts. vLLM's PagedAttention dynamically allocates KV cache memory in fixed-size blocks, eliminating fragmentation and allowing up to 24x more concurrent requests than traditional implementations.

# vLLM with optimized KV cache
from vllm import LLM, SamplingParams
 
llm = LLM(
    model="meta-llama/Llama-3.1-8B-Instruct",
    gpu_memory_utilization=0.92,
    max_model_len=4096,
    enable_prefix_caching=True,  # Cache system prompts across requests
    swap_space=4,  # GB of CPU swap for KV cache overflow
)

Multi-GPU Scaling

For models that don't fit on a single GPU, vLLM supports tensor parallelism to split the model across multiple GPUs:

vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --quantization awq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.95

This splits the 70B model across 4 GPUs, each needing only ~10GB VRAM with AWQ quantization, well within the capacity of consumer RTX 4090s.

Comparison with Alternatives

Feature	Ollama	llama.cpp	vLLM	Cloud APIs
Setup complexity	One command	Moderate (build from source)	Moderate (pip install)	API key only
Performance	Good	Excellent	Excellent (highest throughput)	Variable (depends on provider)
Multi-GPU	No	Limited	Full tensor parallelism	N/A
LoRA support	Basic	Yes	Advanced (multi-LoRA)	Limited
OpenAI-compatible	Yes	Via server mode	Yes	Yes
Best for	Developers, prototyping	Low-level control, embedded	Production serving, high throughput	When you need no infrastructure

Ollama is ideal for development and prototyping—its one-command setup and automatic model management mean you spend zero time on infrastructure. llama.cpp is the right choice when you need fine-grained control over inference parameters or are targeting resource-constrained environments. vLLM dominates production deployments where throughput, concurrency, and multi-tenant serving matter.

Advanced Patterns and Techniques

RAG with Local Models

Retrieval-Augmented Generation combines local LLMs with your own knowledge base:

import { OllamaEmbeddings } from '@langchain/community/embeddings/ollama';
import { Chroma } from '@langchain/community/vectorstores/chroma';
 
const embeddings = new OllamaEmbeddings({ model: 'nomic-embed-text' });
const vectorStore = await Chroma.fromDocuments(documents, embeddings);
 
const retriever = vectorStore.asRetriever({ k: 5 });
const context = await retriever.getRelevantDocuments(query);
 
const response = await ollama.chat.completions.create({
  model: 'llama3.1:8b',
  messages: [
    {
      role: 'system',
      content: `Answer based on the provided context:\n${context.map(d => d.pageContent).join('\n\n')}`,
    },
    { role: 'user', content: query },
  ],
});

Function Calling with Local Models

Newer models like Llama 3.1 support function calling natively. Ollama exposes this through its API:

const response = await ollama.chat.completions.create({
  model: 'llama3.1:8b',
  messages: [{ role: 'user', content: 'What is the weather in London?' }],
  tools: [{
    type: 'function',
    function: {
      name: 'get_weather',
      description: 'Get current weather for a location',
      parameters: {
        type: 'object',
        properties: {
          location: { type: 'string', description: 'City name' },
          unit: { type: 'string', enum: ['celsius', 'fahrenheit'] },
        },
        required: ['location'],
      },
    },
  }],
});

Future Outlook

The local LLM landscape is evolving rapidly. New quantization methods like QuIP# and AQLM push quality preservation to even lower bit widths, making 70B models viable on 16GB GPUs. Apple's MLX framework is creating a native, high-performance inference stack for Apple Silicon that may eventually rival CUDA performance on M-series chips.

The emergence of specialized AI hardware—Groq's LPU, Cerebras' wafer-scale engine, and Qualcomm's AI accelerators—promises to make local inference even faster and more energy-efficient. Meanwhile, model architectures like Mixture of Experts (MoE) offer the quality of large models with the compute requirements of small ones by activating only a subset of parameters for each token.

Hybrid inference patterns are emerging where simple queries are handled locally while complex reasoning tasks are routed to cloud APIs. This "local-first, cloud-fallback" approach optimizes both cost and latency while maintaining quality for challenging tasks.

Conclusion

Running AI locally has transitioned from a research curiosity to a practical reality. Ollama makes it accessible to any developer with a modern GPU, llama.cpp provides the performance and flexibility for custom deployments, and vLLM delivers production-grade serving for teams building AI-powered products at scale.

Key takeaways:

Start with Ollama for the quickest path to local LLM inference—one command to install, one command to run
Choose Q4_K_M quantization as your default; it preserves 95%+ model quality while halving memory requirements
Use OpenAI-compatible endpoints so your application code works identically with local and cloud models
Plan for VRAM limitations by choosing model sizes that fit comfortably with headroom for the KV cache
Implement streaming from day one—local inference at 20-80 tokens/second feels fast with streaming but slow without it
Use vLLM for production deployments requiring high throughput, concurrent serving, or multi-GPU scaling
Build hybrid architectures that use local models for speed and privacy, with cloud fallback for complex tasks

The ability to run powerful AI models locally gives developers unprecedented control over their data, costs, and capabilities. Whether you're building a privacy-first healthcare application, a cost-optimized code review pipeline, or an offline field service tool, the tools covered in this guide put enterprise-grade AI within reach of any development team.

For further learning, explore the Ollama documentation, the llama.cpp GitHub repository, the vLLM documentation, and the Hugging Face Model Hub for discovering new models.

Minh Vo

Slaying code & making it lit fr fr 🔥 tagline