MinhVo

Minh Vo

rss feed

Slaying code & making it lit fr fr 🔥 tagline

Hey there 👋 I'm an AI Engineer with 7 years of experience building scalable web and mobile applications. Currently at Neurond AI (May 2025 — present), architecting an Enterprise AI Assistant Platform with multi-tenant RAG on pgvector, multi-provider LLM orchestration, and Azure-native infrastructure. Previously spent 5+ years at SNAPTEC (Sep 2019 — Apr 2025), leading SaaS themes, admin dashboards, and e-commerce platforms — earned the Hero of the Year award in 2021. I specialize in TypeScript, React, Next.js, and AI-Native engineering with Claude Code and Cursor.bio

Back to blogs

Fine-Tuning LLMs with LoRA and QLoRA Efficient Custom AI Models

Fine-tuning LLMs with LoRA and QLoRA for custom tasks. Adapter methods, quantization, training data, evaluation, deployment.

fine-tuningLoRAQLoRALLMmachine learningAItransformers

By MinhVo

Introduction

Full fine-tuning of a 70B model requires 280GB of GPU memory. LoRA fine-tunes 0.1-1% of parameters achieving 90-99% of full fine-tuning performance. QLoRA adds quantization enabling 70B fine-tuning on a single 48GB GPU. This democratizes LLM customization for startups, research labs, and individual developers.

Why Fine-Tuning Matters

ai illustration

Full fine-tuning of a 70B model requires 280GB of GPU memory. LoRA fine-tunes 0.1-1% of parameters achieving 90-99% of full fine-tuning performance. QLoRA adds quantization enabling 70B fine-tuning on a single 48GB GPU. This democratizes LLM customization for startups, research labs, and individual developers.

LoRA Low-Rank Adaptation

LoRA decomposes weight updates into low-rank matrices A (d x r) and B (r x d) where r is 8-64. The update is delta W = BA. Original weights are frozen, only A and B are trained. This reduces trainable parameters by d/r factor. For a 7B model with rank 16, only 10-20M parameters are trainable. LoRA is applied to attention layers: query, key, value, output projections.

QLoRA Quantized LoRA

QLoRA combines 4-bit NF4 quantization with LoRA. Base weights are quantized and frozen, adapters trained in 16-bit. Double quantization reduces memory further. Paged optimizers handle memory spikes. A 65B model fine-tuned with QLoRA on a single 48GB GPU matches full 16-bit performance. Uses bitsandbytes for quantization and PEFT for adapters.

Training Data and Configuration

ai illustration

Data quality is the most important factor. Synthetic data generation using larger models for smaller model training is common. Data formatting uses chat templates specific to each base model. Key hyperparameters: learning rate 1e-4 to 5e-4, batch size 4-16, epochs 1-3, LoRA rank 8-64, alpha 2x rank. Gradient checkpointing and Flash Attention 2 reduce memory.

Evaluation and Deployment

MT-Bench and AlpacaEval measure instruction-following quality. The lm-evaluation-harness provides standardized benchmarks. Fine-tuned models deploy as base model plus adapter weights. vLLM and TGI support dynamic adapter switching at request time. One base model serves multiple specialized tasks by swapping adapters. GPTQ or AWQ quantization optimizes inference.

Cost Analysis

7B model on A100: ~4 hours, ~20−40.70BQLoRAonA100: 24hours, 20-40. 70B QLoRA on A100: ~24 hours, ~120-200. Cost advantage 5-10x for small models, 10-50x for large models. Fine-tuning excels at style adaptation, domain terminology, consistent formatting. RAG is better for frequently changing knowledge. Few-shot prompting may suffice for simple tasks.

Conclusion

The topics covered in this article represent important developments in modern software engineering. By understanding these concepts deeply and applying them in your projects, you can build more robust, scalable, and maintainable systems. Continue exploring, experimenting, and building — the technology landscape rewards those who stay curious and keep learning.