MinhVo

Minh Vo

rss feed

Slaying code & making it lit fr fr 🔥 tagline

Hey there 👋 I'm an AI Engineer with 7 years of experience building scalable web and mobile applications. Currently at Neurond AI (May 2025 — present), architecting an Enterprise AI Assistant Platform with multi-tenant RAG on pgvector, multi-provider LLM orchestration, and Azure-native infrastructure. Previously spent 5+ years at SNAPTEC (Sep 2019 — Apr 2025), leading SaaS themes, admin dashboards, and e-commerce platforms — earned the Hero of the Year award in 2021. I specialize in TypeScript, React, Next.js, and AI-Native engineering with Claude Code and Cursor.bio

Back to blogs

AI Fine-Tuning: Adapting LLMs for Your Domain

Fine-tune LLMs: LoRA, QLoRA, dataset preparation, evaluation, and deployment.

AIFine-TuningLLMMachine Learning

By MinhVo

Introduction

Pre-trained large language models are powerful generalists, but production applications often demand specialist-level performance in specific domains. A general-purpose model might handle casual conversation well but struggle with medical diagnosis, legal contract analysis, or your company's proprietary coding standards. Fine-tuning bridges this gap — it adapts a pre-trained model to your specific domain, terminology, and output format using a relatively small dataset of examples. The result is a model that performs your specific task as well as or better than models 10x its size.

Fine-tuning LLMs for domain-specific tasks

The economics of fine-tuning are compelling. Training a foundation model from scratch costs millions of dollars and requires massive datasets. Fine-tuning a pre-trained model costs anywhere from 5to5 to 500 depending on model size and dataset, and takes hours rather than weeks. With techniques like LoRA (Low-Rank Adaptation) and QLoRA (Quantized LoRA), you can fine-tune a 70-billion-parameter model on a single consumer GPU with just 24GB of VRAM.

This guide covers the complete fine-tuning workflow: from deciding whether fine-tuning is right for your use case, through dataset preparation, training, evaluation, and deployment. Whether you're adapting a model for customer support, code generation, content creation, or domain-specific question answering, these patterns will help you achieve production-quality results.

Understanding Fine-Tuning: Core Concepts

Why Fine-Tune?

Fine-tuning is most valuable when prompt engineering and RAG aren't sufficient. If you can get 90% of the way there with a well-crafted prompt, fine-tuning might not be worth the investment. Fine-tuning shines when you need consistent output formatting, domain-specific reasoning, specialized vocabulary, or behavior that's difficult to describe in a prompt.

The decision framework is: Prompt Engineering → Few-Shot Examples → RAG → Fine-Tuning → Training from Scratch. Each step requires more investment but offers more control. Most teams should exhaust prompt engineering and RAG before considering fine-tuning.

Supervised Fine-Tuning (SFT)

The most common fine-tuning approach is supervised fine-tuning, where you provide examples of ideal input-output pairs. The model learns to produce outputs that match your examples. This is effective for style transfer, format compliance, and task-specific behavior.

LoRA and Parameter-Efficient Fine-Tuning

Full fine-tuning updates all model parameters, which requires enormous GPU memory. LoRA (Low-Rank Adaptation) dramatically reduces this by freezing the original model weights and training only small "adapter" matrices injected into each layer. These adapters have far fewer parameters (typically 0.1-1% of the full model), making fine-tuning feasible on consumer hardware.

QLoRA goes further by combining LoRA with 4-bit quantization of the base model. This reduces memory requirements by 4x compared to standard LoRA, enabling fine-tuning of 70B+ models on a single 24GB GPU.

LoRA adapter architecture

RLHF and DPO

Reinforcement Learning from Human Feedback (RLHF) and Direct Preference Optimization (DPO) are techniques for aligning model behavior with human preferences. While SFT teaches the model what to output, RLHF/DPO teaches it what not to output and how to choose between multiple valid responses. These techniques are essential for reducing harmful outputs and improving helpfulness.

Architecture and Design Patterns

The Data Pipeline Pattern

Fine-tuning quality is 80% data quality. A robust data pipeline collects raw examples, cleans and validates them, formats them into the model's expected chat template, and splits them into training and evaluation sets.

The Iterative Training Pattern

Rather than training one model with all available data, train iteratively: start with a small high-quality dataset, evaluate, identify weaknesses, collect targeted data for those weaknesses, and train again. This produces better results than throwing all data at the model at once.

The Adapter Stacking Pattern

With LoRA, you can train multiple adapters for different tasks and swap them at inference time. A base model can serve customer support, content generation, and code review by loading the appropriate adapter for each request.

The Evaluation-Driven Pattern

Define your evaluation metrics before training. Automated metrics (BLEU, ROUGE, accuracy on held-out test set) provide fast feedback, but human evaluation on real-world examples is the ultimate quality signal.

Step-by-Step Implementation

Preparing a Fine-Tuning Dataset

Data format is critical. Most fine-tuning APIs expect JSONL format with messages in the chat template:

// Dataset preparation script
import fs from 'fs';
 
interface TrainingExample {
  messages: {
    role: 'system' | 'user' | 'assistant';
    content: string;
  }[];
}
 
function prepareTrainingData(rawExamples: RawExample[]): TrainingExample[] {
  return rawExamples.map(example => ({
    messages: [
      {
        role: 'system',
        content: 'You are a technical support agent for CloudPlatform. Be concise, accurate, and empathetic. Always provide specific steps the user can follow.'
      },
      {
        role: 'user',
        content: example.customerQuestion,
      },
      {
        role: 'assistant',
        content: example.idealResponse,
      },
    ],
  }));
}
 
// Write to JSONL format
function writeJSONL(examples: TrainingExample[], filepath: string): void {
  const jsonl = examples.map(e => JSON.stringify(e)).join('\n');
  fs.writeFileSync(filepath, jsonl);
}
 
// Split into train/eval (90/10)
function splitDataset(examples: TrainingExample[]): { train: TrainingExample[]; eval: TrainingExample[] } {
  const shuffled = examples.sort(() => Math.random() - 0.5);
  const splitIndex = Math.floor(shuffled.length * 0.9);
  return {
    train: shuffled.slice(0, splitIndex),
    eval: shuffled.slice(splitIndex),
  };
}

Fine-Tuning with OpenAI's API

import OpenAI from 'openai';
import fs from 'fs';
 
const openai = new OpenAI();
 
async function startFineTuning(): Promise<string> {
  // Upload training file
  const file = await openai.files.create({
    file: fs.createReadStream('training_data.jsonl'),
    purpose: 'fine-tune',
  });
 
  // Start fine-tuning job
  const job = await openai.fineTuning.jobs.create({
    training_file: file.id,
    model: 'gpt-4o-mini-2024-07-18',
    hyperparameters: {
      n_epochs: 3,
      batch_size: 'auto',
      learning_rate_multiplier: 'auto',
    },
    suffix: 'support-agent-v1',
  });
 
  console.log(`Fine-tuning job started: ${job.id}`);
  return job.id;
}
 
// Monitor training progress
async function monitorTraining(jobId: string): Promise<void> {
  const interval = setInterval(async () => {
    const job = await openai.fineTuning.jobs.retrieve(jobId);
    console.log(`Status: ${job.status}`);
    
    if (job.result_files) {
      const events = await openai.fineTuning.jobs.listEvents(jobId);
      const latest = events.data[0];
      console.log(`Latest: ${latest.message}`);
    }
    
    if (job.status === 'succeeded' || job.status === 'failed') {
      clearInterval(interval);
      console.log(`Training complete. Model: ${job.fine_tuned_model}`);
    }
  }, 30000);
}

Fine-Tuning with LoRA Locally (Hugging Face Transformers)

# Fine-tune with QLoRA using Hugging Face
from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training
from trl import SFTTrainer
from datasets import load_dataset
import torch
 
# Load base model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.1-8B-Instruct",
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.float16,
    bnb_4bit_quant_type="nf4",
    device_map="auto",
)
 
# Configure LoRA
lora_config = LoraConfig(
    r=16,                    # Rank — higher = more expressive, more memory
    lora_alpha=32,           # Scaling factor
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
 
model = prepare_model_for_kbit_training(model)
model = get_peft_model(model, lora_config)
print(f"Trainable params: {model.print_trainable_parameters()}")

Training pipeline for fine-tuning

Real-World Use Cases

Customer Support Automation

Fine-tuning excels at customer support because the task requires consistent tone, specific product knowledge, and adherence to support policies. A fine-tuned model learns your support playbook, escalation procedures, and common solutions without needing to include them in every prompt.

Code Generation for Internal Frameworks

Every company has internal frameworks, coding standards, and architectural patterns that general-purpose models don't know. Fine-tuning on your codebase produces a model that generates code consistent with your team's conventions, uses your internal APIs correctly, and follows your error handling patterns.

Content Creation in Brand Voice

Marketing and content teams use fine-tuning to produce content that matches their brand voice. By training on examples of approved content, the model learns the specific tone, vocabulary, and style guidelines that define the brand.

Domain-specific fine-tuning is essential for specialized fields where accuracy is critical. A model fine-tuned on medical literature understands clinical terminology, drug interactions, and diagnostic criteria far better than a general-purpose model.

Best Practices for Production

  1. Start with data quality, not quantity — 500 high-quality, diverse examples outperform 10,000 noisy ones. Invest time in curating and validating your training data.

  2. Use a held-out test set from day one — Reserve 10-20% of your data for evaluation and never train on it. This gives you an unbiased measure of model quality.

  3. Train for fewer epochs than you think — Overfitting is the most common failure mode. Start with 1-3 epochs and monitor evaluation loss. More training doesn't always mean better results.

  4. Validate with real-world examples — Automated metrics don't capture everything. Test your fine-tuned model with actual use cases and have domain experts evaluate the outputs.

  5. Version your datasets and models — Track which dataset version produced which model version. This enables reproducibility and rollback when quality regresses.

  6. Use system prompts with fine-tuned models — Fine-tuning and prompt engineering are complementary. Use system prompts to provide dynamic context while the fine-tuned model handles domain-specific behavior.

  7. Monitor for regression — After deploying a fine-tuned model, continuously monitor output quality. Model behavior can drift over time, and new edge cases may require additional training data.

  8. Consider cost vs. benefit — Fine-tuning has upfront costs (data collection, training, evaluation) and ongoing costs (model hosting). Ensure the performance improvement justifies the investment over simpler alternatives.

Common Pitfalls and Solutions

PitfallImpactSolution
Low-quality training dataPoor model outputs, hallucinationsRigorous data validation, human review of all examples
Overfitting to training dataPoor generalization to new inputsUse fewer epochs, more diverse data, regularization
Catastrophic forgettingModel loses general capabilitiesUse LoRA (preserves base weights), include diverse examples
Wrong base model selectionSuboptimal performance for your taskBenchmark multiple base models before committing
Insufficient evaluationDeploying a model that regressesAutomated + human evaluation on diverse test cases
Data leakageInflated evaluation metricsStrict train/eval/test split, no overlap
Ignoring prompt formatModel expects different format in productionUse identical system prompts during training and inference

Detecting and Fixing Overfitting

Overfitting occurs when the model memorizes training examples rather than learning generalizable patterns. The telltale sign is evaluation loss increasing while training loss continues to decrease.

interface TrainingMetrics {
  epoch: number;
  trainLoss: number;
  evalLoss: number;
  learningRate: number;
}
 
function detectOverfitting(metrics: TrainingMetrics[]): boolean {
  // Look for divergence between train and eval loss
  for (let i = 1; i < metrics.length; i++) {
    const trainDecreasing = metrics[i].trainLoss < metrics[i - 1].trainLoss;
    const evalIncreasing = metrics[i].evalLoss > metrics[i - 1].evalLoss;
    
    if (trainDecreasing && evalIncreasing) {
      console.warn(`Overfitting detected at epoch ${metrics[i].epoch}`);
      return true;
    }
  }
  return false;
}

Performance Optimization

Fine-tuning performance depends on hardware, model size, and training configuration. The most impactful optimization is gradient accumulation — simulating larger batch sizes by accumulating gradients over multiple forward passes before updating weights. This enables training with limited GPU memory.

Mixed precision training (FP16 or BF16) reduces memory usage by ~50% with minimal quality loss. Gradient checkpointing trades computation for memory by recomputing activations during backward pass instead of storing them.

// Training configuration optimization
interface TrainingConfig {
  model: string;
  batchSize: number;
  gradientAccumulationSteps: number;
  learningRate: number;
  warmupSteps: number;
  maxSteps: number;
  fp16: boolean;
  gradientCheckpointing: boolean;
}
 
const optimizedConfig: TrainingConfig = {
  model: 'meta-llama/Llama-3.1-8B-Instruct',
  batchSize: 4,
  gradientAccumulationSteps: 8, // Effective batch size = 32
  learningRate: 2e-4,
  warmupSteps: 100,
  maxSteps: 1000,
  fp16: true,
  gradientCheckpointing: true,
};

Comparison with Alternatives

ApproachCostTimeData RequiredQualityBest For
Prompt EngineeringPer-tokenMinutes0 examplesGoodSimple tasks, broad capabilities
Few-Shot LearningPer-tokenMinutes5-20 examplesBetterFormat consistency, style
RAGInfrastructureHoursDocumentsBetterKnowledge-intensive tasks
LoRA Fine-Tuning$5-500Hours100-10K examplesExcellentDomain adaptation, style
Full Fine-Tuning$500-50KDays10K-1M examplesExcellentMaximum customization
Training from Scratch$1M+Weeks-MonthsMillions of examplesVariesNovel architectures

Advanced Patterns

Multi-Task Fine-Tuning

Train a single model on multiple related tasks by including task-specific prefixes in your training data. This produces a model that can handle various tasks without loading separate adapters.

interface MultiTaskExample {
  task: 'summarize' | 'classify' | 'extract' | 'generate';
  input: string;
  output: string;
}
 
function formatMultiTask(example: MultiTaskExample): TrainingExample {
  return {
    messages: [
      {
        role: 'system',
        content: `You are a versatile assistant. Task: ${example.task}. Respond appropriately for the given task type.`
      },
      { role: 'user', content: example.input },
      { role: 'assistant', content: example.output },
    ],
  };
}

Synthetic Data Generation

When real training data is scarce, use a larger model to generate synthetic examples. This "teacher-student" approach can bootstrap a fine-tuning dataset from a small seed of high-quality examples.

Continuous Fine-Tuning

Implement a feedback loop where production model outputs that receive positive user feedback are added to the training dataset. Periodically retrain the model on this growing dataset to continuously improve quality.

Testing Strategies

Evaluate fine-tuned models with a combination of automated metrics and human evaluation. Automated metrics provide fast feedback during development, while human evaluation catches issues that metrics miss.

interface EvaluationResult {
  testCase: string;
  expected: string;
  predicted: string;
  metrics: {
    exactMatch: boolean;
    bleuScore: number;
    containsKeyPhrases: boolean;
    humanRating?: number; // 1-5 scale
  };
}
 
async function evaluateModel(
  modelId: string,
  testSet: { input: string; expected: string }[]
): Promise<void> {
  const results: EvaluationResult[] = [];
  
  for (const testCase of testSet) {
    const predicted = await generateWithModel(modelId, testCase.input);
    results.push({
      testCase: testCase.input,
      expected: testCase.expected,
      predicted,
      metrics: {
        exactMatch: predicted.trim() === testCase.expected.trim(),
        bleuScore: calculateBLEU(predicted, testCase.expected),
        containsKeyPhrases: extractKeyPhrases(testCase.expected)
          .every(phrase => predicted.toLowerCase().includes(phrase.toLowerCase())),
      },
    });
  }
  
  const exactMatchRate = results.filter(r => r.metrics.exactMatch).length / results.length;
  const avgBLEU = results.reduce((sum, r) => sum + r.metrics.bleuScore, 0) / results.length;
  
  console.log(`Exact Match: ${(exactMatchRate * 100).toFixed(1)}%`);
  console.log(`Average BLEU: ${avgBLEU.toFixed(3)}`);
}

Future Outlook

Fine-tuning is becoming more accessible and more powerful. The trend toward parameter-efficient methods continues with new techniques like DoRA (Weight-Decomposed Low-Rank Adaptation) and rsLoRA (rank-stabilized LoRA) that improve quality with fewer trainable parameters.

Automated fine-tuning platforms are emerging that handle the entire pipeline — data preparation, hyperparameter tuning, evaluation, and deployment — with minimal human intervention. These platforms make fine-tuning accessible to teams without ML expertise.

The convergence of fine-tuning with retrieval-augmented generation is creating hybrid systems that combine the specialized knowledge of fine-tuned models with the up-to-date information access of RAG. This combination addresses the fundamental limitation of fine-tuning: the model's knowledge is frozen at training time.

Conclusion

Fine-tuning transforms general-purpose language models into domain-specific experts at a fraction of the cost of training from scratch. With techniques like LoRA and QLoRA, the barrier to entry has dropped dramatically — you can fine-tune powerful models on consumer hardware with hundreds of examples.

Key takeaways:

  1. Exhaust prompt engineering and RAG before considering fine-tuning — it's a bigger investment
  2. Data quality is 80% of fine-tuning success — invest heavily in curating training examples
  3. LoRA and QLoRA make fine-tuning accessible on consumer hardware (24GB GPU for 70B models)
  4. Use 1-3 epochs to avoid overfitting — more training isn't always better
  5. Evaluate with both automated metrics and human review on real-world examples
  6. Version your datasets and models for reproducibility and rollback
  7. Fine-tuning and prompt engineering are complementary — use both together

Start by identifying a specific task where prompt engineering falls short. Collect 100-500 high-quality examples of ideal outputs, and fine-tune a small model (7-8B parameters) using LoRA. Evaluate the results against your baseline, and iterate from there.