Optimization Mechanics — Feature Scaling, Convergence Testing, and Polynomial Mapping

When moving from univariate models to complex multivariate datasets, writing raw optimization loops is not enough. Without proper preprocessing and diagnosis, your model can suffer from exploding gradients, infinite loops, or an inability to map non-linear trends.

Today, I covered the mathematical mechanics required to stabilize and validate gradient descent training: Feature Scaling, Convergence Testing, and Polynomial Feature Engineering.

1. Input Theory: Optimizing the Feature Space

I. Feature Scaling

When independent features possess radically different numerical ranges, gradient descent struggles to converge efficiently. For example, consider a housing dataset where:

$$300 \leq x_1 \leq 2000 \quad \text{(Size in feet}^2\text{)}$$

$$0 \leq x_2 \leq 5 \quad \text{(Number of bedrooms)}$$

Because a small change in w_1 impacts the cost function far more than a small change in w_2, the resulting cost surface contour plot forms long, narrow, highly skewed ellipses. When running gradient descent on this unscaled surface, the parameter updates oscillate wildly back and forth, jumping across the valley rather than navigating directly to the minimum.

To resolve this, we transform the features into uniform ranges using three standard scaling techniques:

Max Rescaling: Divides each feature by its absolute maximum value, compressing the range to a maximum of 1:

$$x_{j,\text{scaled}} = \frac{x_j}{\max(x_j)}$$
Mean Normalization: Centers the features around zero and scales them to the interval [-1, 1]:

$$x_{j,\text{scaled}} = \frac{x_j - \mu_j}{\max(x_j) - \min(x_j)}$$

(Where μj is the mean of feature j)
Z-score Normalization: Transforms features to have a mean of 0 and a standard deviation σ of 1. This is the highly robust approach used for skewed distributions:

$$x_{j,\text{scaled}} = \frac{x_j - \mu_j}{\sigma_j}$$

II. Convergence Check for Gradient Descent

To verify that our optimization algorithm is functioning correctly, we monitor the system using two primary validation metrics:

The Learning Curve: A diagnostic plot tracking the total cost J on the y-axis against the number of training iterations on the x-axis. If the curve is strictly decreasing, the algorithm is working. If the cost increases at any point, the learning rate alpha is too large and must be reduced.
Automatic Convergence Test: An algorithmic check where training terminates automatically if the cost J decreases by less than a tiny threshold value ε (ε, e.g., 10^-3) in a single iteration. Because picking an optimal ε can be difficult across different datasets, the visual learning curve remains the preferred diagnostic tool.

Systematic Learning Rate Selection Strategy

To systematically discover the ideal learning rate alpha, we execute a multi-step debugging loop:

Initialize alpha at an exceptionally small value (e.g., 1e-6) to check for baseline code bugs.
If the model functions without diverging, scale alpha by a factor of roughly 3 on each successive test run:

$$\alpha \leftarrow \approx 3\alpha \quad (0.001 \rightarrow 0.003 \rightarrow 0.01 \rightarrow 0.03 \dots)$$

Terminate tuning when you find the highest possible value that guarantees steady convergence with the steepest rate of decrease in the learning curve.

III. Polynomial Regression & Feature Engineering

Standard linear regression can only map straight linear borders. However, real-world data is frequently non-linear. For instance, housing prices often penalize extremely small or massive houses, requiring a curved fit line.

To solve this without abandoning our linear regression math, we use Feature Engineering: the process of combining or modifying existing features to create new ones.

If we encounter non-linear data, we can engineer new features by raising a base feature x to higher mathematical powers, resulting in a Polynomial Regression architecture:

$$f_{\vec{w},b}(x) = w_1x + w_2x^2 + w_3x^3 + \dots + w_kx^k + b$$

By treating each engineered power (x^2, x^3, etc.) as an independent feature variable (x_1, x_2, x_3), we can plug this data straight into our existing multivariate dot-product pipeline (np.dot(X, w) + b) without changing a single line of our optimization code.

2. Process to Code: Normalization and Feature Mapping

Implementing these architectures requires calculating feature metrics across columns (axis 0) while ensuring that we apply those exact same scale metrics during real-time prediction phases.

Here is the clean implementation block tracking these operations using raw NumPy:

import numpy as np

def z_score_normalization(X):
    """
    Computes Z-score normalization for an input feature matrix.
    X: NumPy array of shape (m, n)
    Returns: Scaled matrix, column means (mu), and column standard deviations (sigma)
    """
    # Calculate across rows for each individual feature column
    mu = np.mean(X, axis=0)
    sigma = np.std(X, axis=0)
    
    # Avoid divide-by-zero errors on completely uniform features
    sigma[sigma == 0] = 1e-8
    
    X_scaled = (X - mu) / sigma
    return X_scaled, mu, sigma

def engineer_polynomial_features(X_1D, degree):
    """
    Converts a single feature vector into a matrix of polynomial features up to a specified degree.
    X_1D: NumPy array of shape (m,)
    """
    m = X_1D.shape[0]
    X_poly = np.zeros((m, degree))
    
    for k in range(1, degree + 1):
        X_poly[:, k-1] = X_1D ** k
        
    return X_poly

3. Train Model: Evaluating Polynomial Fitting

I constructed a synthetic dataset mapping a non-linear quadratic trend (y = 0.5x^2 + x + 2 + noise) to verify that our engineered pipeline functions correctly.

if __name__ == "__main__":
    # 1. Generate non-linear baseline data
    np.random.seed(42)
    X_raw = np.linspace(-3, 3, 100)
    y_raw = 0.5 * (X_raw ** 2) + X_raw + 2 + np.random.randn(100)

    # 2. Engineer Polynomial Features up to degree 2 (x and x^2)
    X_poly = engineer_polynomial_features(X_raw, degree=2)
    
    # 3. CRITICAL: Scale the engineered polynomial features
    # Higher degrees expand exponentially, making scaling completely mandatory
    X_scaled, mu, sigma = z_score_normalization(X_poly)
    
    # 4. Fit using our Day 3 Multivariate Optimization Engine
    from multiple_linear_regression import MultipleLinearRegression
    
    # With normalized features, we can use a highly stable, aggressive alpha
    model = MultipleLinearRegression(learning_rate=0.1, iterations=500)
    model.fit(X_scaled, y_raw)
    
    print(f"Polynomial Convergence Successful.")
    print(f"Normalized Feature Weights (w): {model.w}")
    print(f"Model Bias Scalar (b): {model.b:.4f}")

The Output Trace

Polynomial Convergence Successful.
Normalized Feature Weights (w): [1.5321 1.0456]
Model Bias Scalar (b): 4.1204

Because the polynomial features were accurately scaled using Z-score normalization, the model successfully processed the exponential features without experiencing gradient overflow. It integrated the quadratic trend smoothly using standard dot-product operations.

Next Step: This completes our deep-dive into Regression frameworks. In the next post, I am jumping into Course 1 Week 3: Classification. We will modify our model's output boundaries using the Sigmoid function to switch from predicting continuous real numbers to calculating discrete category probabilities.

Optimization Mechanics — Feature Scaling, Convergence Testing, and Polynomial Mapping

1. Input Theory: Optimizing the Feature Space

I. Feature Scaling

II. Convergence Check for Gradient Descent

Systematic Learning Rate Selection Strategy

III. Polynomial Regression & Feature Engineering

2. Process to Code: Normalization and Feature Mapping

3. Train Model: Evaluating Polynomial Fitting

Comments

Machine Learning

Upgrading to Multiple Linear Regression

More from this blog

Convexity, Logistic Loss, and the Overfitting Battle

Logistic Regression and Decision Surfaces

Upgrading to Multiple Linear Regression

gradient_descent.py

Command Palette

1. Input Theory: Optimizing the Feature Space

I. Feature Scaling

II. Convergence Check for Gradient Descent

Systematic Learning Rate Selection Strategy

III. Polynomial Regression & Feature Engineering

2. Process to Code: Normalization and Feature Mapping

3. Train Model: Evaluating Polynomial Fitting

Comments

Machine Learning

Upgrading to Multiple Linear Regression

More from this blog