MinhVo

Minh Vo

rss feed

Slaying code & making it lit fr fr 🔥 tagline

Hey there 👋 I'm an AI Engineer with 7 years of experience building scalable web and mobile applications. Currently at Neurond AI (May 2025 — present), architecting an Enterprise AI Assistant Platform with multi-tenant RAG on pgvector, multi-provider LLM orchestration, and Azure-native infrastructure. Previously spent 5+ years at SNAPTEC (Sep 2019 — Apr 2025), leading SaaS themes, admin dashboards, and e-commerce platforms — earned the Hero of the Year award in 2021. I specialize in TypeScript, React, Next.js, and AI-Native engineering with Claude Code and Cursor.bio

Back to blogs

Large Language Models How They Really Work Internals

LLM internals: transformer architecture, attention mechanisms, tokenization, training process, inference optimization, scaling laws.

large language modelstransformersattentionAImachine learningdeep learning

By MinhVo

Introduction

The transformer architecture processes input sequences in parallel using self-attention. An LLM consists of an embedding layer, transformer layer stack, and output layer. Knowledge is stored in billions of parameters learned during training. The collective behavior of parameters produces emergent capabilities that appear at certain scales.

Demystifying LLMs

ai illustration

The transformer architecture processes input sequences in parallel using self-attention. An LLM consists of an embedding layer, transformer layer stack, and output layer. Knowledge is stored in billions of parameters learned during training. The collective behavior of parameters produces emergent capabilities that appear at certain scales.

Tokenization

Modern LLMs use subword tokenization: BPE, WordPiece, or SentencePiece. BPE starts with characters and iteratively merges the most frequent adjacent pair. Different models use different tokenizers. Context windows are measured in tokens, not words. Tokenization explains why LLMs struggle with character-level tasks.

Transformer Architecture

Each layer has multi-head self-attention and a feed-forward network. Attention computes QK^T/√d_k to get weights, then multiplies by V. Multi-head attention runs parallel operations for different relationship types. The FFN applies two linear transformations with a nonlinearity. LayerNorm and residual connections stabilize training.

Training Process

ai illustration

Pre-training optimizes next-token prediction on trillions of tokens. Loss is cross-entropy between predicted and actual next token. The model learns grammar, facts, and reasoning patterns. Data quality and mixing are critical. Post-training includes supervised fine-tuning and RLHF. Mixed precision in FP16/BF16 reduces memory.

Scaling Laws and Inference

Chinchilla scaling law: optimal training uses ~20 tokens per parameter. Emergent capabilities appear at 10-100B parameters. Inference optimization: KV caching, Flash Attention (2-4x speedup), Grouped-Query Attention, quantization (8-bit, 4-bit), speculative decoding, and continuous batching.

Interpretability and Future

Mechanistic interpretability analyzes what neurons and attention heads do. Circuit analysis traces input-output paths. Future architectures: state-space models (Mamba), MoE for capacity scaling, multi-token prediction, and test-time compute for improved reasoning quality.

Conclusion

The topics covered in this article represent important developments in modern software engineering. By understanding these concepts deeply and applying them in your projects, you can build more robust, scalable, and maintainable systems. Continue exploring, experimenting, and building — the technology landscape rewards those who stay curious and keep learning.