This document provides a comprehensive reference for the mathematical and theoretical concepts underlying the machine learning curriculum (Days 38-67). Each section builds from first principles to practical applications in modern ML systems.
Linear algebra provides the mathematical language for representing and manipulating data in machine learning.
A vector is an ordered collection of numbers that can represent:
Notation: A vector v in ℝⁿ is written as:
v = [v₁, v₂, ..., vₙ]ᵀ
Key Operations:
Dot Product: v · w = Σᵢ vᵢwᵢ = | v | w | cos(θ) |
Norm (Length): | v | = √(Σᵢ vᵢ²) |
ML Application: Feature vectors represent observations; dot products measure similarity between data points.
A matrix is a rectangular array of numbers with dimensions m × n (m rows, n columns).
Notation:
A = [aᵢⱼ] where i ∈ {1,...,m}, j ∈ {1,...,n}
Key Operations:
ML Application: Datasets are stored as matrices where rows are observations and columns are features. Matrix multiplication implements linear transformations and neural network layers.
For a square matrix A, an eigenvector v and eigenvalue λ satisfy:
Av = λv
This means A stretches v by a factor of λ without changing its direction.
Eigendecomposition:
A = QΛQᵀ
where Q contains eigenvectors and Λ is a diagonal matrix of eigenvalues.
ML Application:
For any m × n matrix A:
A = UΣVᵀ
where:
ML Application:
Calculus enables us to find optimal model parameters by following gradients of loss functions.
The derivative measures the rate of change of a function:
f'(x) = lim(h→0) [f(x + h) - f(x)] / h
Common Derivatives:
ML Application: Derivatives tell us how much the loss changes when we adjust a parameter.
For multivariate functions f(x₁, x₂, …, xₙ), the partial derivative with respect to xᵢ is:
∂f/∂xᵢ = lim(h→0) [f(..., xᵢ + h, ...) - f(..., xᵢ, ...)] / h
The gradient is the vector of all partial derivatives:
∇f = [∂f/∂x₁, ∂f/∂x₂, ..., ∂f/∂xₙ]ᵀ
Properties:
ML Application: Gradient descent algorithms follow -∇L(θ) to minimize the loss function L with respect to parameters θ.
The chain rule enables differentiation of composite functions:
d/dx [f(g(x))] = f'(g(x)) · g'(x)
For multivariate compositions:
∂z/∂x = (∂z/∂y)(∂y/∂x)
ML Application: Backpropagation in neural networks repeatedly applies the chain rule to compute gradients through multiple layers.
Gradient Descent is an iterative optimization algorithm:
θₜ₊₁ = θₜ - α∇L(θₜ)
where:
Variants:
Batch Gradient Descent: Uses entire dataset to compute gradient
∇L(θ) = (1/n)Σᵢ ∇Lᵢ(θ)
Stochastic Gradient Descent (SGD): Uses single random example
θₜ₊₁ = θₜ - α∇Lᵢ(θₜ)
Mini-batch Gradient Descent: Uses small batch of examples
∇L(θ) = (1/|B|)Σᵢ∈B ∇Lᵢ(θ)
Advanced Optimizers:
Momentum: Accumulates velocity to smooth updates
vₜ = βvₜ₋₁ + ∇L(θₜ)
θₜ₊₁ = θₜ - αvₜ
Adam: Adaptive learning rates with momentum
mₜ = β₁mₜ₋₁ + (1-β₁)∇L(θₜ)
vₜ = β₂vₜ₋₁ + (1-β₂)(∇L(θₜ))²
θₜ₊₁ = θₜ - α·mₜ/(√vₜ + ε)
A function f is convex if:
f(λx + (1-λ)y) ≤ λf(x) + (1-λ)f(y) for all λ ∈ [0,1]
Properties:
ML Application: Many ML loss functions are convex, guaranteeing convergence to optimal solutions.
Probability theory provides the foundation for reasoning about uncertainty in data and predictions.
Probability Axioms:
Conditional Probability:
P(A|B) = P(A ∩ B) / P(B)
Bayes’ Theorem:
P(A|B) = P(B|A)P(A) / P(B)
ML Application: Bayesian inference, Naive Bayes classifiers, probabilistic modeling.
A random variable X maps outcomes to real numbers.
Expectation (Mean):
E[X] = Σᵢ xᵢP(X = xᵢ) (discrete)
E[X] = ∫ xf(x)dx (continuous)
Variance:
Var(X) = E[(X - E[X])²] = E[X²] - (E[X])²
Common Distributions:
Normal (Gaussian):
f(x) = (1/√(2πσ²))exp(-(x-μ)²/(2σ²))
Bernoulli:
P(X = 1) = p, P(X = 0) = 1-p
Multinomial:
P(X = k) = pₖ where Σₖ pₖ = 1
Poisson:
P(X = k) = (λᵏe⁻ᵏ) / k!
Find parameters θ that maximize the likelihood of observed data:
θ̂_MLE = argmax_θ L(θ|x) = argmax_θ P(x|θ)
Often use log-likelihood for convenience:
θ̂_MLE = argmax_θ log L(θ|x) = argmax_θ Σᵢ log P(xᵢ|θ)
ML Application: Training most ML models is MLE under specific distributional assumptions.
Hypothesis Testing:
Confidence Intervals:
CI = θ̂ ± z_(α/2) · SE(θ̂)
where SE is the standard error.
ML Application: Evaluating model significance, A/B testing, feature selection.
Supervised learning involves learning a mapping from inputs X to outputs Y given labeled training data.
Training Data: {(x₁, y₁), (x₂, y₂), …, (xₙ, yₙ)}
Goal: Learn function f: X → Y such that f(x) ≈ y
Types:
Model:
ŷ = θ₀ + θ₁x₁ + θ₂x₂ + ... + θₚxₚ = θᵀx
Loss Function (Mean Squared Error):
L(θ) = (1/n)Σᵢ (yᵢ - θᵀxᵢ)²
Normal Equation (Closed Form):
θ̂ = (XᵀX)⁻¹Xᵀy
Gradient:
∇L(θ) = (2/n)Xᵀ(Xθ - y)
Assumptions:
Coefficient Interpretation:
Model (Binary Classification):
P(y = 1|x) = σ(θᵀx) = 1 / (1 + e⁻ᶿᵀˣ)
where σ is the sigmoid function.
Decision Boundary:
Predict y = 1 if P(y = 1|x) ≥ 0.5
⟺ θᵀx ≥ 0
Loss Function (Binary Cross-Entropy):
L(θ) = -(1/n)Σᵢ [yᵢ log(ŷᵢ) + (1-yᵢ)log(1-ŷᵢ)]
Gradient:
∇L(θ) = (1/n)Xᵀ(σ(Xθ) - y)
Multi-class Extension (Softmax):
P(y = k|x) = exp(θₖᵀx) / Σⱼ exp(θⱼᵀx)
Goal: Find hyperplane that maximally separates classes with largest margin.
Hard Margin SVM:
minimize (1/2)||w||²
subject to yᵢ(wᵀxᵢ + b) ≥ 1 for all i
Soft Margin SVM (with slack variables ξᵢ):
minimize (1/2)||w||² + C·Σᵢ ξᵢ
subject to yᵢ(wᵀxᵢ + b) ≥ 1 - ξᵢ, ξᵢ ≥ 0
Kernel Trick: Map data to higher dimensions using kernel functions:
RBF (Gaussian): K(x, x’) = exp(-γ | x - x’ | ²) |
ML Application: Effective for high-dimensional spaces, memory-efficient (uses support vectors only).
Splitting Criterion:
For regression, use Mean Squared Error:
MSE = (1/n)Σᵢ (yᵢ - ȳ)²
For classification, use Gini Impurity or Entropy:
Gini = 1 - Σₖ pₖ²
Entropy = -Σₖ pₖ log(pₖ)
Information Gain:
IG = H(parent) - Σ(|child|/|parent|)H(child)
Algorithm:
Advantages:
Disadvantages:
Neural networks are compositions of simple non-linear functions that can approximate complex mappings.
The basic building block:
z = Σᵢ wᵢxᵢ + b = wᵀx + b
a = g(z)
where:
Purpose: Introduce non-linearity (without them, deep networks collapse to linear models).
Common Activations:
Sigmoid:
σ(z) = 1 / (1 + e⁻ᶻ)
σ'(z) = σ(z)(1 - σ(z))
Tanh:
tanh(z) = (eᶻ - e⁻ᶻ) / (eᶻ + e⁻ᶻ)
tanh'(z) = 1 - tanh²(z)
ReLU (Rectified Linear Unit):
ReLU(z) = max(0, z)
ReLU'(z) = 1 if z > 0, else 0
Leaky ReLU:
LeakyReLU(z) = max(αz, z) where α ≈ 0.01
Softmax (Output Layer for Multi-class):
softmax(zᵢ) = exp(zᵢ) / Σⱼ exp(zⱼ)
Architecture:
Layer 1: a⁽¹⁾ = g⁽¹⁾(W⁽¹⁾x + b⁽¹⁾)
Layer 2: a⁽²⁾ = g⁽²⁾(W⁽²⁾a⁽¹⁾ + b⁽²⁾)
...
Output: ŷ = a⁽ᴸ⁾
Universal Approximation Theorem: A neural network with a single hidden layer and enough neurons can approximate any continuous function arbitrarily well.
Forward Pass: Compute predictions layer by layer.
Backward Pass: Compute gradients using chain rule.
Output Layer Gradient:
δ⁽ᴸ⁾ = ∂L/∂z⁽ᴸ⁾ = (a⁽ᴸ⁾ - y) ⊙ g'(z⁽ᴸ⁾)
Hidden Layer Gradient:
δ⁽ˡ⁾ = (W⁽ˡ⁺¹⁾ᵀδ⁽ˡ⁺¹⁾) ⊙ g'(z⁽ˡ⁾)
Parameter Gradients:
∂L/∂W⁽ˡ⁾ = δ⁽ˡ⁾a⁽ˡ⁻¹⁾ᵀ
∂L/∂b⁽ˡ⁾ = δ⁽ˡ⁾
Update Rule:
W⁽ˡ⁾ ← W⁽ˡ⁾ - α(∂L/∂W⁽ˡ⁾)
b⁽ˡ⁾ ← b⁽ˡ⁾ - α(∂L/∂b⁽ˡ⁾)
Regression:
Mean Squared Error (MSE):
L = (1/n)Σᵢ (yᵢ - ŷᵢ)²
Mean Absolute Error (MAE):
L = (1/n)Σᵢ |yᵢ - ŷᵢ|
Classification:
Binary Cross-Entropy:
L = -(1/n)Σᵢ [yᵢ log(ŷᵢ) + (1-yᵢ)log(1-ŷᵢ)]
Categorical Cross-Entropy:
L = -(1/n)Σᵢ Σₖ yᵢₖ log(ŷᵢₖ)
Problem: Poor initialization can cause vanishing/exploding gradients.
Xavier/Glorot Initialization:
W ~ U[-√(6/(nᵢₙ + nₒᵤₜ)), √(6/(nᵢₙ + nₒᵤₜ))]
He Initialization (for ReLU):
W ~ N(0, 2/nᵢₙ)
Normalize activations within each mini-batch:
x̂ᵢ = (xᵢ - μ_B) / √(σ²_B + ε)
yᵢ = γx̂ᵢ + β
Benefits:
Regularization prevents overfitting by constraining model complexity.
Modified Loss:
L_ridge(θ) = L(θ) + λ||θ||₂² = L(θ) + λΣⱼ θⱼ²
Effect:
θ̂ = (XᵀX + λI)⁻¹Xᵀy
Bayesian Interpretation: Equivalent to placing Gaussian prior on parameters.
Modified Loss:
L_lasso(θ) = L(θ) + λ||θ||₁ = L(θ) + λΣⱼ |θⱼ|
Effect:
Bayesian Interpretation: Equivalent to placing Laplace prior on parameters.
Combines L1 and L2:
L_elastic(θ) = L(θ) + λ₁||θ||₁ + λ₂||θ||₂²
Advantages:
Training: Randomly set a fraction p of neurons to zero in each iteration.
Effect:
Inference: Scale activations by (1-p) or train with inverted dropout.
Method: Monitor validation loss during training; stop when it starts increasing.
Equivalent to: Implicit regularization by limiting model capacity to fit training data.
Increase effective dataset size by applying transformations:
Effect: Regularizes by exposing model to variations.
Ensemble methods combine multiple models to achieve better performance than individual models.
Algorithm:
Prediction:
ŷ = (1/m)Σᵢ fᵢ(x) (regression)
ŷ = mode{f₁(x), ..., fₘ(x)} (classification)
Random Forest: Bagging with decision trees + random feature selection at each split.
Feature Importance:
Importance(xⱼ) = Σ_trees Σ_splits ΔImpurity(xⱼ)
Out-of-Bag (OOB) Error:
Idea: Train models sequentially, each correcting errors of previous ones.
AdaBoost (Adaptive Boosting):
Gradient Boosting:
General framework for boosting any differentiable loss function.
rᵢₜ = -∂L(yᵢ, F(xᵢ))/∂F(xᵢ) |_(F=Fₜ₋₁)
γₜ = argmin_γ Σᵢ L(yᵢ, Fₜ₋₁(xᵢ) + γhₜ(xᵢ))
XGBoost (Extreme Gradient Boosting):
Adds regularization to gradient boosting:
Obj = Σᵢ L(yᵢ, ŷᵢ) + Σₜ Ω(fₜ)
where Ω(f) = γT + (1/2)λ | w | ² (T = number of leaves) |
Two-Level Architecture:
Level 0 (Base Models):
Level 1 (Meta-Model):
Algorithm:
Unsupervised learning discovers patterns in unlabeled data.
Objective: Minimize within-cluster variance:
minimize Σₖ Σ_{xᵢ∈Cₖ} ||xᵢ - μₖ||²
Algorithm (Lloyd’s Algorithm):
Cₖ = {xᵢ : ||xᵢ - μₖ|| ≤ ||xᵢ - μⱼ|| for all j}
μₖ = (1/|Cₖ|)Σ_{xᵢ∈Cₖ} xᵢ
Choosing k:
Limitations:
Agglomerative (Bottom-Up):
Linkage Criteria:
Dendrogram: Tree diagram showing merge sequence.
Parameters:
Point Types:
Advantages:
Goal: Find orthogonal directions of maximum variance.
Mathematical Formulation:
Σ = (1/n)X_centeredᵀX_centered
Σ = QΛQᵀ
Z = X_centered · Q[:, :k] (keep top k components)
Variance Explained:
Variance explained by PC_j = λⱼ / Σᵢ λᵢ
Applications:
Goal: Preserve local structure in low-dimensional embedding.
Algorithm:
Compute pairwise similarities in high-dimensional space:
p_{j|i} = exp(-||xᵢ - xⱼ||²/(2σᵢ²)) / Σₖ≠ᵢ exp(-||xᵢ - xₖ||²/(2σᵢ²))
Compute symmetrized similarities:
pᵢⱼ = (p_{j|i} + p_{i|j}) / (2n)
Define similarities in low-dimensional space using t-distribution:
qᵢⱼ = (1 + ||yᵢ - yⱼ||²)⁻¹ / Σₖ≠ˡ (1 + ||yₖ - yˡ||²)⁻¹
Minimize KL divergence using gradient descent:
KL(P||Q) = Σᵢ Σⱼ pᵢⱼ log(pᵢⱼ/qᵢⱼ)
Properties:
Similar to t-SNE but:
Probabilistic models explicitly represent uncertainty using probability distributions.
Assumption: Features are conditionally independent given class:
P(x₁, ..., xₚ | y) = ∏ⱼ P(xⱼ | y)
Classification Rule:
ŷ = argmax_y P(y) ∏ⱼ P(xⱼ | y)
Variants:
Gaussian Naive Bayes: P(xⱼ | y) ~ N(μⱼᵧ, σ²ⱼᵧ) |
Advantages:
Model: Data generated from mixture of k Gaussian distributions:
p(x) = Σₖ πₖ N(x | μₖ, Σₖ)
where πₖ are mixing coefficients (Σₖ πₖ = 1).
Latent Variable: zᵢ indicates which Gaussian generated xᵢ.
Complete-Data Likelihood:
p(x, z | θ) = ∏ᵢ ∏ₖ [πₖ N(xᵢ | μₖ, Σₖ)]^{zᵢₖ}
General Framework for Latent Variable Models:
Maximize: p(x | θ) = Σ_z p(x, z | θ) |
E-Step: Compute posterior over latents:
Q(θ | θ⁽ᵗ⁾) = E_{z|x,θ⁽ᵗ⁾}[log p(x, z | θ)]
M-Step: Maximize expected complete-data log-likelihood:
θ⁽ᵗ⁺¹⁾ = argmax_θ Q(θ | θ⁽ᵗ⁾)
For GMM:
E-Step (Responsibility):
γᵢₖ = πₖ N(xᵢ | μₖ, Σₖ) / Σⱼ πⱼ N(xᵢ | μⱼ, Σⱼ)
M-Step:
Nₖ = Σᵢ γᵢₖ
πₖ = Nₖ / n
μₖ = (1/Nₖ)Σᵢ γᵢₖxᵢ
Σₖ = (1/Nₖ)Σᵢ γᵢₖ(xᵢ - μₖ)(xᵢ - μₖ)ᵀ
Components:
Transition probabilities: A (aᵢⱼ = P(sₜ₊₁=j | sₜ=i)) |
Emission probabilities: B (bⱼ(oₜ) = P(oₜ | sₜ=j)) |
Forward Algorithm (Compute Likelihood):
α_t(i) = P(o₁, ..., oₜ, sₜ=i)
α_t(j) = [Σᵢ α_{t-1}(i)aᵢⱼ]bⱼ(oₜ)
p(O) = Σᵢ α_T(i)
Viterbi Algorithm (Most Likely State Sequence):
δ_t(i) = max_{s₁,...,s_{t-1}} P(s₁,...,s_{t-1},sₜ=i,o₁,...,oₜ)
δ_t(j) = [maxᵢ δ_{t-1}(i)aᵢⱼ]bⱼ(oₜ)
Applications:
Bayes’ Theorem for Parameters:
p(θ | D) = p(D | θ)p(θ) / p(D)
where:
p(D | θ): Likelihood |
p(θ | D): Posterior distribution |
Posterior Predictive Distribution:
p(x_new | D) = ∫ p(x_new | θ)p(θ | D)dθ
Maximum A Posteriori (MAP) Estimation:
θ̂_MAP = argmax_θ p(θ | D) = argmax_θ [p(D | θ)p(θ)]
Conjugate Priors: Prior and posterior have same functional form.
Examples:
Time series data consists of observations ordered in time, exhibiting temporal dependencies.
Decomposition:
Y_t = T_t + S_t + R_t
where:
Multiplicative Model:
Y_t = T_t × S_t × R_t
A time series is stationary if:
Tests:
Making Non-Stationary Data Stationary:
Autocorrelation Function (ACF):
ρ_k = Cov(Y_t, Y_{t-k}) / Var(Y_t)
Partial Autocorrelation Function (PACF):
Correlation between Y_t and Y_{t-k} after removing linear dependence on Y_{t-1}, …, Y_{t-k+1}.
AR(p) - Autoregressive:
Y_t = c + Σᵢ φᵢY_{t-i} + ε_t
MA(q) - Moving Average:
Y_t = μ + ε_t + Σᵢ θᵢε_{t-i}
ARMA(p,q):
Y_t = c + Σᵢ φᵢY_{t-i} + Σⱼ θⱼε_{t-j} + ε_t
ARIMA(p,d,q): ARMA on differenced series (d differences)
Model Selection:
SARIMA(p,d,q)(P,D,Q)_s:
Combines non-seasonal and seasonal components:
φ(B)Φ(Bˢ)(1-B)ᵈ(1-Bˢ)ᴰY_t = θ(B)Θ(Bˢ)ε_t
where:
SARIMAX: SARIMA + exogenous variables
Simple Exponential Smoothing:
ŷ_{t+1} = αy_t + (1-α)ŷ_t
Holt’s Linear Trend:
Level: ℓ_t = αy_t + (1-α)(ℓ_{t-1} + b_{t-1})
Trend: b_t = β(ℓ_t - ℓ_{t-1}) + (1-β)b_{t-1}
Forecast: ŷ_{t+h} = ℓ_t + hb_t
Holt-Winters (Seasonal):
Adds seasonal component:
Level: ℓ_t = α(y_t - s_{t-m}) + (1-α)(ℓ_{t-1} + b_{t-1})
Trend: b_t = β(ℓ_t - ℓ_{t-1}) + (1-β)b_{t-1}
Season: s_t = γ(y_t - ℓ_t) + (1-γ)s_{t-m}
Forecast: ŷ_{t+h} = ℓ_t + hb_t + s_{t+h-m}
Additive Model:
y(t) = g(t) + s(t) + h(t) + ε_t
where:
Advantages:
Mean Absolute Error (MAE):
MAE = (1/n)Σ|y_t - ŷ_t|
Root Mean Squared Error (RMSE):
RMSE = √[(1/n)Σ(y_t - ŷ_t)²]
Mean Absolute Percentage Error (MAPE):
MAPE = (100/n)Σ|y_t - ŷ_t|/|y_t|
Symmetric MAPE (sMAPE):
sMAPE = (100/n)Σ 2|y_t - ŷ_t|/(|y_t| + |ŷ_t|)
Convolution Operation:
(f * g)[i,j] = ΣₘΣₙ f[m,n]g[i-m, j-n]
Convolutional Layer:
Output[i,j,k] = σ(Σₘ Σₙ Σ_c Input[i+m, j+n, c] × Kernel[m,n,c,k] + Bias[k])
Key Concepts:
Pooling Layers:
Max Pooling:
Output[i,j] = max_{m,n ∈ window} Input[i×s+m, j×s+n]
Average Pooling:
Output[i,j] = mean_{m,n ∈ window} Input[i×s+m, j×s+n]
Architecture Patterns:
Input → [Conv → ReLU → Pool]×N → Flatten → [FC → ReLU]×M → Output
Classic Architectures:
Vanilla RNN:
h_t = tanh(W_hh h_{t-1} + W_xh x_t + b_h)
y_t = W_hy h_t + b_y
Problems:
Gradient Flow:
∂L/∂h_0 = ∂L/∂h_T × ∏_{t=1}^T ∂h_t/∂h_{t-1}
Gates:
Forget gate: f_t = σ(W_f[h_{t-1}, x_t] + b_f)
Input gate: i_t = σ(W_i[h_{t-1}, x_t] + b_i)
Output gate: o_t = σ(W_o[h_{t-1}, x_t] + b_o)
Cell candidate: C̃_t = tanh(W_C[h_{t-1}, x_t] + b_C)
Cell state: C_t = f_t ⊙ C_{t-1} + i_t ⊙ C̃_t
Hidden state: h_t = o_t ⊙ tanh(C_t)
Key Innovation: Cell state C_t provides uninterrupted gradient flow.
Simplified LSTM with fewer parameters:
Reset gate: r_t = σ(W_r[h_{t-1}, x_t])
Update gate: z_t = σ(W_z[h_{t-1}, x_t])
Candidate: h̃_t = tanh(W[r_t ⊙ h_{t-1}, x_t])
Hidden state: h_t = (1 - z_t) ⊙ h_{t-1} + z_t ⊙ h̃_t
Problem: Fixed-length context vector is information bottleneck.
Solution: Allow decoder to “attend” to different parts of input.
Attention Scores:
e_{t,i} = score(h_t, h̄_i) = h_t^T W h̄_i (or dot product, or MLP)
Attention Weights (via softmax):
α_{t,i} = exp(e_{t,i}) / Σⱼ exp(e_{t,j})
Context Vector:
c_t = Σᵢ α_{t,i} h̄_i
Decoder with Attention:
h_t = RNN(h_{t-1}, [y_{t-1}; c_t])
Core Idea: Replace recurrence with self-attention.
Scaled Dot-Product Attention:
Attention(Q, K, V) = softmax(QK^T / √d_k)V
where:
Multi-Head Attention:
MultiHead(Q,K,V) = Concat(head₁, ..., head_h)W^O
where head_i = Attention(QW^Q_i, KW^K_i, VW^V_i)
Transformer Block:
1. Multi-head self-attention
2. Add & Normalize (residual connection)
3. Feed-forward network (2-layer MLP)
4. Add & Normalize
Positional Encoding:
Since no recurrence, inject position information:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Advantages:
Popular Models:
Variational Autoencoders (VAE):
Encoder (Recognition Model):
q_φ(z|x) ≈ p(z|x)
Decoder (Generative Model):
p_θ(x|z)
Loss (ELBO - Evidence Lower Bound):
L = E_q[log p_θ(x|z)] - KL(q_φ(z|x) || p(z))
Reconstruction loss + Regularization term
Reparameterization Trick:
z = μ + σ ⊙ ε where ε ~ N(0, I)
Allows backpropagation through sampling.
Generative Adversarial Networks (GANs):
Two Networks:
Minimax Game:
min_G max_D E_x[log D(x)] + E_z[log(1 - D(G(z)))]
Training:
Challenges:
Improvements:
Prediction Error Decomposition:
E[(y - ŷ)²] = Bias² + Variance + Irreducible Error
Bias: Error from approximating complex functions with simple models
Variance: Error from sensitivity to training set fluctuations
Tradeoff: Increasing model complexity reduces bias but increases variance.
K-Fold Cross-Validation:
Stratified K-Fold: Maintain class proportions in each fold.
Leave-One-Out (LOO): K = n (expensive but low-bias estimate)
Time Series Split: Respect temporal ordering
Fold 1: Train[1:100], Test[101:150]
Fold 2: Train[1:150], Test[151:200]
...
Confusion Matrix:
Predicted
Pos Neg
Actual Pos TP FN
Neg FP TN
Metrics:
Accuracy:
Accuracy = (TP + TN) / (TP + TN + FP + FN)
Precision (Positive Predictive Value):
Precision = TP / (TP + FP)
Recall (Sensitivity, True Positive Rate):
Recall = TP / (TP + FN)
F1 Score (Harmonic Mean of Precision and Recall):
F1 = 2 × (Precision × Recall) / (Precision + Recall)
Specificity (True Negative Rate):
Specificity = TN / (TN + FP)
ROC Curve: Plot TPR vs FPR at various thresholds
TPR = TP / (TP + FN) (y-axis)
FPR = FP / (FP + TN) (x-axis)
AUC (Area Under Curve):
Precision-Recall Curve: Better for imbalanced datasets
Mean Squared Error (MSE):
MSE = (1/n)Σ(yᵢ - ŷᵢ)²
Root Mean Squared Error (RMSE):
RMSE = √MSE
Mean Absolute Error (MAE):
MAE = (1/n)Σ|yᵢ - ŷᵢ|
R² Score (Coefficient of Determination):
R² = 1 - SS_res/SS_tot = 1 - Σ(yᵢ - ŷᵢ)²/Σ(yᵢ - ȳ)²
Adjusted R²:
R²_adj = 1 - (1 - R²)(n - 1)/(n - p - 1)
Penalizes adding features that don’t improve fit.
Akaike Information Criterion (AIC):
AIC = 2k - 2log(L̂)
where k = number of parameters, L̂ = maximum likelihood
Bayesian Information Criterion (BIC):
BIC = k·log(n) - 2log(L̂)
Stronger penalty for model complexity than AIC.
Principle: Lower is better. Balance fit quality with model simplicity.
Grid Search:
Random Search:
Bayesian Optimization:
Learning Curves:
Plot training and validation performance vs:
This theory document covers the mathematical foundations underlying the ML curriculum from Days 38-67. Each concept builds on previous ones, forming a coherent framework for understanding modern machine learning:
For practical implementations of these concepts, refer to the corresponding lesson days (38-67) in the curriculum. Each lesson provides executable code, worked examples, and exercises to solidify understanding.
Linear Algebra:
Calculus and Optimization:
Probability and Statistics:
Machine Learning:
Deep Learning:
This theory document is maintained as part of the Coding for MBA curriculum. For questions or suggestions, please open an issue on the GitHub repository.