Introduction
Full fine-tuning of a 70B model requires 280GB of GPU memory. LoRA fine-tunes 0.1-1% of parameters achieving 90-99% of full fine-tuning performance. QLoRA adds quantization enabling 70B fine-tuning on a single 48GB GPU. This democratizes LLM customization for startups, research labs, and individual developers.
Why Fine-Tuning Matters
Full fine-tuning of a 70B model requires 280GB of GPU memory. LoRA fine-tunes 0.1-1% of parameters achieving 90-99% of full fine-tuning performance. QLoRA adds quantization enabling 70B fine-tuning on a single 48GB GPU. This democratizes LLM customization for startups, research labs, and individual developers.
LoRA Low-Rank Adaptation
LoRA decomposes weight updates into low-rank matrices A (d x r) and B (r x d) where r is 8-64. The update is delta W = BA. Original weights are frozen, only A and B are trained. This reduces trainable parameters by d/r factor. For a 7B model with rank 16, only 10-20M parameters are trainable. LoRA is applied to attention layers: query, key, value, output projections.
QLoRA Quantized LoRA
QLoRA combines 4-bit NF4 quantization with LoRA. Base weights are quantized and frozen, adapters trained in 16-bit. Double quantization reduces memory further. Paged optimizers handle memory spikes. A 65B model fine-tuned with QLoRA on a single 48GB GPU matches full 16-bit performance. Uses bitsandbytes for quantization and PEFT for adapters.
Training Data and Configuration
Data quality is the most important factor. Synthetic data generation using larger models for smaller model training is common. Data formatting uses chat templates specific to each base model. Key hyperparameters: learning rate 1e-4 to 5e-4, batch size 4-16, epochs 1-3, LoRA rank 8-64, alpha 2x rank. Gradient checkpointing and Flash Attention 2 reduce memory.
Evaluation and Deployment
MT-Bench and AlpacaEval measure instruction-following quality. The lm-evaluation-harness provides standardized benchmarks. Fine-tuned models deploy as base model plus adapter weights. vLLM and TGI support dynamic adapter switching at request time. One base model serves multiple specialized tasks by swapping adapters. GPTQ or AWQ quantization optimizes inference.
Cost Analysis
7B model on A100: ~4 hours, ~120-200. Cost advantage 5-10x for small models, 10-50x for large models. Fine-tuning excels at style adaptation, domain terminology, consistent formatting. RAG is better for frequently changing knowledge. Few-shot prompting may suffice for simple tasks.
Conclusion
The topics covered in this article represent important developments in modern software engineering. By understanding these concepts deeply and applying them in your projects, you can build more robust, scalable, and maintainable systems. Continue exploring, experimenting, and building — the technology landscape rewards those who stay curious and keep learning.