MinhVo

Minh Vo

rss feed

Slaying code & making it lit fr fr 🔥 tagline

Hey there 👋 I'm an AI Engineer with 7 years of experience building scalable web and mobile applications. Currently at Neurond AI (May 2025 — present), architecting an Enterprise AI Assistant Platform with multi-tenant RAG on pgvector, multi-provider LLM orchestration, and Azure-native infrastructure. Previously spent 5+ years at SNAPTEC (Sep 2019 — Apr 2025), leading SaaS themes, admin dashboards, and e-commerce platforms — earned the Hero of the Year award in 2021. I specialize in TypeScript, React, Next.js, and AI-Native engineering with Claude Code and Cursor.bio

Back to blogs

Introduction to Machine Learning with Python

Master machine learning fundamentals with Python and scikit-learn. Comprehensive guide covering supervised learning, unsupervised learning, model evaluation, feature engineering, and production deployment patterns with real-world code examples.

Machine LearningPythonscikit-learnAIData Science

By MinhVo

Machine learning enables computers to learn patterns from data and make predictions without being explicitly programmed for each task. Python has emerged as the dominant language for ML due to its rich ecosystem of libraries including scikit-learn, pandas, NumPy, TensorFlow, and PyTorch. Whether you're building spam classifiers, recommendation systems, or predictive maintenance models, understanding ML fundamentals with Python gives you the foundation to tackle real-world problems across every industry.

This comprehensive guide covers the complete machine learning workflow from data preparation through production deployment, with practical code examples you can run immediately.

Machine learning concept visualization

Understanding Machine Learning Fundamentals

Machine learning is a subset of artificial intelligence where systems improve their performance on a task through experience rather than explicit programming. Instead of writing rules manually, you provide data and let algorithms discover patterns automatically.

Types of Machine Learning

The three primary categories of machine learning each address different problem types and data availability scenarios.

Supervised Learning trains models on labeled data where both input features and correct outputs are known. The model learns a mapping function from inputs to outputs that generalizes to unseen data. Supervised learning dominates commercial ML applications and includes two main subcategories:

  • Classification predicts discrete categories. Examples include email spam detection (spam vs. not spam), image recognition (cat vs. dog vs. bird), medical diagnosis (positive vs. negative), and sentiment analysis (positive vs. negative vs. neutral). The output is a class label, often accompanied by a probability score indicating confidence.

  • Regression predicts continuous numerical values. Examples include house price prediction, stock price forecasting, temperature prediction, and demand forecasting. The output is a real number, and evaluation metrics differ significantly from classification.

Unsupervised Learning discovers hidden patterns in unlabeled data without predefined outputs. The algorithm must find structure on its own:

  • Clustering groups similar data points together. K-Means, DBSCAN, and hierarchical clustering are popular algorithms. Applications include customer segmentation, anomaly detection, and document grouping.

  • Dimensionality Reduction reduces the number of features while preserving essential information. Principal Component Analysis (PCA) and t-SNE are widely used. This helps with visualization, noise reduction, and combating the curse of dimensionality.

  • Association Rule Learning discovers relationships between variables in large datasets. Market basket analysis (customers who buy X also buy Y) is the classic application.

Reinforcement Learning trains agents through trial and error interactions with an environment. The agent receives rewards for good actions and penalties for bad ones, learning an optimal policy over time. Applications include game playing, robotics, autonomous driving, and resource management.

The Machine Learning Workflow

Every ML project follows a systematic workflow. Understanding this workflow prevents common mistakes and ensures reproducible results:

  1. Problem Definition — Clearly define what you're predicting, why it matters, and how success will be measured. Align ML objectives with business goals.

  2. Data Collection — Gather relevant, representative data. More data generally helps, but quality matters more than quantity. Ensure your data covers the scenarios you want to handle.

  3. Exploratory Data Analysis — Understand distributions, correlations, missing values, and outliers. Visualize your data before building models.

  4. Data Preprocessing — Clean, transform, and prepare data. Handle missing values, encode categorical variables, scale numerical features, and remove duplicates.

  5. Feature Engineering — Create meaningful features from raw data. Domain knowledge here often matters more than algorithm choice. Good features can make simple models outperform complex ones on raw features.

  6. Model Selection — Choose appropriate algorithms based on problem type, data size, interpretability requirements, and computational constraints.

  7. Training and Evaluation — Fit models to training data and evaluate on held-out test data using appropriate metrics. Use cross-validation for robust estimates.

  8. Hyperparameter Tuning — Optimize model configuration using grid search, random search, or Bayesian optimization.

  9. Deployment — Serve predictions in production. Monitor model performance over time and retrain as data distributions shift.

Data science workflow

Setting Up Your Python ML Environment

A well-configured environment ensures reproducibility and prevents dependency conflicts.

Installation

# Create a virtual environment
python -m venv ml-env
source ml-env/bin/activate  # Linux/Mac
# ml-env\Scripts\activate   # Windows
 
# Install core libraries
pip install scikit-learn pandas numpy matplotlib seaborn jupyter
 
# Optional: deep learning frameworks
pip install tensorflow torch
 
# Optional: advanced tools
pip install xgboost lightgbm optuna shap

Essential Imports

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix, roc_auc_score,
    mean_squared_error, r2_score, mean_absolute_error
)
import matplotlib.pyplot as plt
import seaborn as sns
 
# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

Jupyter Notebook Setup

Jupyter notebooks are ideal for ML experimentation because they combine code, visualizations, and narrative in a single document:

# Configure matplotlib for inline display
%matplotlib inline
plt.rcParams['figure.figsize'] = (10, 6)
plt.rcParams['font.size'] = 12
sns.set_style('whitegrid')
 
# Display all columns in pandas
pd.set_option('display.max_columns', None)
pd.set_option('display.width', None)

Data Preparation and Feature Engineering

Data quality determines model quality. The adage "garbage in, garbage out" applies directly to machine learning.

Loading and Exploring Data

from sklearn.datasets import load_iris, fetch_california_housing
 
# Load the Iris dataset (classification)
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.Series(iris.target, name='species')
 
# Basic exploration
print(f"Dataset shape: {X.shape}")
print(f"Class distribution:\n{y.value_counts()}")
print(f"\nFeature statistics:\n{X.describe()}")
print(f"\nMissing values:\n{X.isnull().sum()}")
 
# Visualize feature distributions
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
for idx, col in enumerate(X.columns):
    ax = axes[idx // 2, idx % 2]
    for species in range(3):
        X[y == species][col].hist(alpha=0.5, ax=ax, label=iris.target_names[species])
    ax.set_title(col)
    ax.legend()
plt.tight_layout()
plt.show()
 
# Correlation heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(X.corr(), annot=True, cmap='coolwarm', center=0)
plt.title('Feature Correlations')
plt.show()

Handling Missing Values

Real-world datasets almost always have missing values. The strategy you choose significantly impacts model performance:

from sklearn.impute import SimpleImputer, KNNImputer
 
# Strategy 1: Simple imputation with mean/median/mode
imputer_mean = SimpleImputer(strategy='mean')
X_imputed = pd.DataFrame(
    imputer_mean.fit_transform(X),
    columns=X.columns
)
 
# Strategy 2: KNN imputation (uses nearest neighbors)
knn_imputer = KNNImputer(n_neighbors=5)
X_knn_imputed = pd.DataFrame(
    knn_imputer.fit_transform(X),
    columns=X.columns
)
 
# Strategy 3: Add missing indicator feature
from sklearn.impute import MissingIndicator
indicator = MissingIndicator()
X_missing_mask = indicator.fit_transform(X)

Feature Scaling

Many algorithms perform significantly better with scaled features. Distance-based algorithms (KNN, SVM, K-Means) and gradient-based algorithms (neural networks, logistic regression) are particularly sensitive to feature scales:

from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
 
# StandardScaler: zero mean, unit variance (best for most algorithms)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
 
# MinMaxScaler: scale to [0, 1] range (good for neural networks)
minmax = MinMaxScaler()
X_minmax = minmax.fit_transform(X)
 
# RobustScaler: uses median and IQR (resistant to outliers)
robust = RobustScaler()
X_robust = robust.fit_transform(X)

Encoding Categorical Variables

Machine learning algorithms require numerical input. Categorical variables must be encoded appropriately:

from sklearn.preprocessing import LabelEncoder, OneHotEncoder, OrdinalEncoder
 
# Label encoding for ordinal categories
le = LabelEncoder()
y_encoded = le.fit_transform(['low', 'medium', 'high', 'medium', 'low'])
 
# One-hot encoding for nominal categories
encoder = OneHotEncoder(sparse_output=False, drop='first')
categories = np.array(['red', 'blue', 'green', 'red', 'blue']).reshape(-1, 1)
encoded = encoder.fit_transform(categories)
print(encoder.get_feature_names_out(['color']))
 
# Using pandas get_dummies (convenient for DataFrames)
df = pd.DataFrame({'color': ['red', 'blue', 'green'], 'size': ['S', 'M', 'L']})
df_encoded = pd.get_dummies(df, drop_first=True)
print(df_encoded)

Train-Test Split

Always split data before any preprocessing that uses statistics from the data. This prevents data leakage where information from the test set influences training:

from sklearn.model_selection import train_test_split
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=RANDOM_STATE,
    stratify=y  # Maintain class distribution in both sets
)
 
print(f"Training set: {X_train.shape[0]} samples")
print(f"Test set: {X_test.shape[0]} samples")
 
# Scale AFTER splitting (fit on train, transform both)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

Supervised Learning: Classification

Classification is the most common ML task. Let's explore multiple algorithms and understand when to use each.

Logistic Regression

Despite its name, logistic regression is a classification algorithm. It models the probability of class membership using a logistic function:

from sklearn.linear_model import LogisticRegression
 
lr_model = LogisticRegression(
    max_iter=1000,
    random_state=RANDOM_STATE,
    C=1.0  # Regularization strength (inverse)
)
lr_model.fit(X_train_scaled, y_train)
lr_predictions = lr_model.predict(X_test_scaled)
lr_probabilities = lr_model.predict_proba(X_test_scaled)
 
print(f"Logistic Regression Accuracy: {accuracy_score(y_test, lr_predictions):.4f}")
print(f"\nClassification Report:")
print(classification_report(y_test, lr_predictions, target_names=iris.target_names))

Random Forest

Random Forest builds multiple decision trees on random subsets of data and features, then aggregates their predictions. It handles non-linear relationships well and provides feature importance scores:

from sklearn.ensemble import RandomForestClassifier
 
rf_model = RandomForestClassifier(
    n_estimators=200,
    max_depth=10,
    min_samples_split=5,
    min_samples_leaf=2,
    random_state=RANDOM_STATE,
    n_jobs=-1  # Use all CPU cores
)
rf_model.fit(X_train_scaled, y_train)
rf_predictions = rf_model.predict(X_test_scaled)
 
print(f"Random Forest Accuracy: {accuracy_score(y_test, rf_predictions):.4f}")
 
# Feature importance
feature_importance = pd.DataFrame({
    'feature': iris.feature_names,
    'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
print(f"\nFeature Importance:\n{feature_importance}")
 
# Visualize feature importance
plt.figure(figsize=(8, 4))
sns.barplot(data=feature_importance, x='importance', y='feature')
plt.title('Random Forest Feature Importance')
plt.tight_layout()
plt.show()

Support Vector Machine

SVM finds the optimal hyperplane that maximizes the margin between classes. It's particularly effective in high-dimensional spaces:

from sklearn.svm import SVC
 
svm_model = SVC(
    kernel='rbf',
    C=1.0,
    gamma='scale',
    probability=True,
    random_state=RANDOM_STATE
)
svm_model.fit(X_train_scaled, y_train)
svm_predictions = svm_model.predict(X_test_scaled)
 
print(f"SVM Accuracy: {accuracy_score(y_test, svm_predictions):.4f}")

Gradient Boosting

Gradient Boosting builds trees sequentially, with each tree correcting errors from previous ones. XGBoost and LightGBM are optimized implementations:

from sklearn.ensemble import GradientBoostingClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
 
# Scikit-learn gradient boosting
gb_model = GradientBoostingClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=3,
    random_state=RANDOM_STATE
)
gb_model.fit(X_train_scaled, y_train)
print(f"Gradient Boosting Accuracy: {accuracy_score(y_test, gb_model.predict(X_test_scaled)):.4f}")
 
# XGBoost
xgb_model = XGBClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=3,
    random_state=RANDOM_STATE,
    eval_metric='mlogloss'
)
xgb_model.fit(X_train_scaled, y_train)
print(f"XGBoost Accuracy: {accuracy_score(y_test, xgb_model.predict(X_test_scaled)):.4f}")

Model Evaluation Metrics

Choosing the right evaluation metric is critical. Accuracy alone can be misleading, especially with imbalanced datasets.

Classification Metrics

from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score,
    classification_report, confusion_matrix, roc_auc_score, roc_curve
)
 
y_pred = rf_predictions
y_true = y_test
 
print(f"Accuracy:  {accuracy_score(y_true, y_pred):.4f}")
print(f"Precision: {precision_score(y_true, y_pred, average='weighted'):.4f}")
print(f"Recall:    {recall_score(y_true, y_pred, average='weighted'):.4f}")
print(f"F1 Score:  {f1_score(y_true, y_pred, average='weighted'):.4f}")
 
# Confusion matrix visualization
cm = confusion_matrix(y_true, y_pred)
plt.figure(figsize=(8, 6))
sns.heatmap(cm, annot=True, fmt='d', cmap='Blues',
            xticklabels=iris.target_names,
            yticklabels=iris.target_names)
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()

Regression Metrics

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.datasets import fetch_california_housing
 
housing = fetch_california_housing()
X_reg = pd.DataFrame(housing.data, columns=housing.feature_names)
y_reg = housing.target
 
X_reg_train, X_reg_test, y_reg_train, y_reg_test = train_test_split(
    X_reg, y_reg, test_size=0.2, random_state=RANDOM_STATE
)
 
scaler_reg = StandardScaler()
X_reg_train_scaled = scaler_reg.fit_transform(X_reg_train)
X_reg_test_scaled = scaler_reg.transform(X_reg_test)
 
reg_model = GradientBoostingRegressor(n_estimators=200, max_depth=4, random_state=RANDOM_STATE)
reg_model.fit(X_reg_train_scaled, y_reg_train)
y_reg_pred = reg_model.predict(X_reg_test_scaled)
 
print(f"R² Score:  {r2_score(y_reg_test, y_reg_pred):.4f}")
print(f"RMSE:      {np.sqrt(mean_squared_error(y_reg_test, y_reg_pred)):.4f}")
print(f"MAE:       {mean_absolute_error(y_reg_test, y_reg_pred):.4f}")

Cross-Validation

Cross-validation provides more robust performance estimates than a single train-test split:

from sklearn.model_selection import cross_val_score, StratifiedKFold
 
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=RANDOM_STATE)
cv_scores = cross_val_score(rf_model, X_train_scaled, y_train, cv=cv, scoring='accuracy')
 
print(f"CV Scores: {cv_scores}")
print(f"Mean: {cv_scores.mean():.4f} (+/- {cv_scores.std() * 2:.4f})")
 
# Compare multiple models with cross-validation
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=RANDOM_STATE),
    'Random Forest': RandomForestClassifier(n_estimators=100, random_state=RANDOM_STATE),
    'SVM': SVC(kernel='rbf', random_state=RANDOM_STATE),
    'Gradient Boosting': GradientBoostingClassifier(n_estimators=100, random_state=RANDOM_STATE),
}
 
results = {}
for name, model in models.items():
    scores = cross_val_score(model, X_train_scaled, y_train, cv=cv, scoring='accuracy')
    results[name] = scores
    print(f"{name}: {scores.mean():.4f} (+/- {scores.std() * 2:.4f})")
 
# Visualize model comparison
plt.figure(figsize=(10, 6))
plt.boxplot(results.values(), labels=results.keys())
plt.title('Model Comparison (5-Fold CV)')
plt.ylabel('Accuracy')
plt.xticks(rotation=15)
plt.show()

Model evaluation visualization

Hyperparameter Tuning

Hyperparameters control the learning process itself. Proper tuning can significantly improve model performance.

from sklearn.model_selection import GridSearchCV
 
param_grid = {
    'n_estimators': [100, 200, 300],
    'max_depth': [3, 5, 10, None],
    'min_samples_split': [2, 5, 10],
    'min_samples_leaf': [1, 2, 4],
}
 
grid_search = GridSearchCV(
    RandomForestClassifier(random_state=RANDOM_STATE),
    param_grid,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    verbose=1
)
grid_search.fit(X_train_scaled, y_train)
 
print(f"Best parameters: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")
print(f"Test score: {accuracy_score(y_test, grid_search.predict(X_test_scaled)):.4f}")
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import randint, uniform
 
param_distributions = {
    'n_estimators': randint(50, 500),
    'max_depth': [3, 5, 10, 15, 20, None],
    'min_samples_split': randint(2, 20),
    'min_samples_leaf': randint(1, 10),
    'max_features': ['sqrt', 'log2', None],
}
 
random_search = RandomizedSearchCV(
    RandomForestClassifier(random_state=RANDOM_STATE),
    param_distributions,
    n_iter=50,
    cv=5,
    scoring='accuracy',
    n_jobs=-1,
    random_state=RANDOM_STATE
)
random_search.fit(X_train_scaled, y_train)
 
print(f"Best parameters: {random_search.best_params_}")
print(f"Best CV score: {random_search.best_score_:.4f}")

Handling Imbalanced Datasets

Real-world datasets often have imbalanced class distributions. A fraud detection model might see 99.9% legitimate transactions.

from sklearn.utils.class_weight import compute_class_weight
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline as ImbPipeline
 
# Method 1: Class weights
class_weights = compute_class_weight('balanced', classes=np.unique(y_train), y=y_train)
weight_dict = dict(zip(np.unique(y_train), class_weights))
rf_weighted = RandomForestClassifier(
    n_estimators=200,
    class_weight='balanced',
    random_state=RANDOM_STATE
)
 
# Method 2: SMOTE oversampling
smote_pipeline = ImbPipeline([
    ('smote', SMOTE(random_state=RANDOM_STATE)),
    ('classifier', RandomForestClassifier(n_estimators=200, random_state=RANDOM_STATE))
])
smote_pipeline.fit(X_train_scaled, y_train)

Common Pitfalls

PitfallImpactSolution
Data leakageInflated performance metrics that don't generalizeSplit data before preprocessing; fit scalers only on training data
Not scaling featuresPoor performance for distance/gradient-based algorithmsApply StandardScaler or MinMaxScaler after train-test split
Ignoring class imbalanceModel biased toward majority classUse class weights, SMOTE, or stratified sampling
OverfittingModel memorizes training data, fails on new dataCross-validation, regularization, simpler models, more data
Feature/target leakageUnrealistic accuracy from features that encode the targetCareful feature engineering; verify features are available at prediction time
Evaluating on training dataMisleading performance estimatesAlways evaluate on held-out test set or use cross-validation
Tuning on test setOptimistic bias in final evaluationUse validation set or nested cross-validation

Best Practices

  1. Always split data before preprocessing — fit scalers and encoders only on training data to prevent data leakage
  2. Use cross-validation — single train/test splits can give misleading results due to random variation
  3. Start simple — begin with logistic regression or decision trees before trying complex models; establish a baseline
  4. Feature engineering matters — domain-informed features often improve performance more than algorithm choice
  5. Handle imbalanced data proactively — use class weights, resampling, or appropriate metrics (F1, AUC) instead of accuracy
  6. Monitor for overfitting — large gaps between training and validation performance indicate the model isn't generalizing
  7. Version your experiments — track data versions, hyperparameters, and results for reproducibility

Python ML Environment Setup

Setting up a proper Python machine learning environment requires careful dependency management. Use conda or virtual environments to isolate project dependencies. Install core packages like NumPy, Pandas, scikit-learn, and Matplotlib using pip or conda. For deep learning, install PyTorch or TensorFlow with GPU support if available. Use Jupyter notebooks for exploratory analysis and experimentation, but structure production code as Python modules with proper testing. Version control your data and models using DVC (Data Version Control) alongside your code in Git.

Model Evaluation Metrics

Choose evaluation metrics that align with your problem type and business objectives. For classification tasks, accuracy alone can be misleading with imbalanced datasets — use precision, recall, F1-score, and AUC-ROC instead. For regression tasks, MSE penalizes large errors more than MAE, making it suitable when outlier errors are costly. Use cross-validation to get reliable performance estimates that generalize to unseen data. Always establish a baseline model (like logistic regression or mean prediction) to contextualize your advanced model's performance improvements.

Conclusion

Machine learning with Python and scikit-learn provides a powerful, consistent toolkit for building predictive models. The key to success isn't choosing the fanciest algorithm — it's understanding your data, engineering good features, and evaluating rigorously with cross-validation and appropriate metrics.

Start with supervised learning problems where labeled data is available. Build a baseline model quickly, then iterate by improving features, tuning hyperparameters, and trying different algorithms. Use cross-validation to compare models fairly and watch for overfitting throughout the process. As you gain experience, explore unsupervised learning for pattern discovery and reinforcement learning for sequential decision-making problems. The fundamentals covered here form the foundation for everything else in machine learning, from deep learning to production ML systems.