Optimization Mechanics — Feature Scaling, Convergence Testing, and Polynomial Mapping
Eliminating gradient descent bottlenecks through input normalization and feature engineering.

Hi, I’m Saachi! 👋 I am anundergraduate with a passion for tech. I realized that reading about Machine Learning wasn't enough, so I started Code Train Repeat—a series where I document my journey.
When moving from univariate models to complex multivariate datasets, writing raw optimization loops is not enough. Without proper preprocessing and diagnosis, your model can suffer from exploding gradients, infinite loops, or an inability to map non-linear trends.
Today, I covered the mathematical mechanics required to stabilize and validate gradient descent training: Feature Scaling, Convergence Testing, and Polynomial Feature Engineering.
1. Input Theory: Optimizing the Feature Space
I. Feature Scaling
When independent features possess radically different numerical ranges, gradient descent struggles to converge efficiently. For example, consider a housing dataset where:
$$300 \leq x_1 \leq 2000 \quad \text{(Size in feet}^2\text{)}$$
$$0 \leq x_2 \leq 5 \quad \text{(Number of bedrooms)}$$
Because a small change in w_1 impacts the cost function far more than a small change in w_2, the resulting cost surface contour plot forms long, narrow, highly skewed ellipses. When running gradient descent on this unscaled surface, the parameter updates oscillate wildly back and forth, jumping across the valley rather than navigating directly to the minimum.
To resolve this, we transform the features into uniform ranges using three standard scaling techniques:
Max Rescaling: Divides each feature by its absolute maximum value, compressing the range to a maximum of 1:
$$x_{j,\text{scaled}} = \frac{x_j}{\max(x_j)}$$
Mean Normalization: Centers the features around zero and scales them to the interval
[-1, 1]:$$x_{j,\text{scaled}} = \frac{x_j - \mu_j}{\max(x_j) - \min(x_j)}$$
(Where μj is the mean of feature j)
Z-score Normalization: Transforms features to have a mean of 0 and a standard deviation σ of 1. This is the highly robust approach used for skewed distributions:
$$x_{j,\text{scaled}} = \frac{x_j - \mu_j}{\sigma_j}$$
II. Convergence Check for Gradient Descent
To verify that our optimization algorithm is functioning correctly, we monitor the system using two primary validation metrics:
The Learning Curve: A diagnostic plot tracking the total cost
Jon the y-axis against the number of training iterations on the x-axis. If the curve is strictly decreasing, the algorithm is working. If the cost increases at any point, the learning ratealphais too large and must be reduced.Automatic Convergence Test: An algorithmic check where training terminates automatically if the cost
Jdecreases by less than a tiny threshold value ε (ε, e.g., 10^-3) in a single iteration. Because picking an optimal ε can be difficult across different datasets, the visual learning curve remains the preferred diagnostic tool.
Systematic Learning Rate Selection Strategy
To systematically discover the ideal learning rate alpha, we execute a multi-step debugging loop:
Initialize
alphaat an exceptionally small value (e.g.,1e-6) to check for baseline code bugs.If the model functions without diverging, scale
alphaby a factor of roughly 3 on each successive test run:
$$\alpha \leftarrow \approx 3\alpha \quad (0.001 \rightarrow 0.003 \rightarrow 0.01 \rightarrow 0.03 \dots)$$
- Terminate tuning when you find the highest possible value that guarantees steady convergence with the steepest rate of decrease in the learning curve.
III. Polynomial Regression & Feature Engineering
Standard linear regression can only map straight linear borders. However, real-world data is frequently non-linear. For instance, housing prices often penalize extremely small or massive houses, requiring a curved fit line.
To solve this without abandoning our linear regression math, we use Feature Engineering: the process of combining or modifying existing features to create new ones.
If we encounter non-linear data, we can engineer new features by raising a base feature x to higher mathematical powers, resulting in a Polynomial Regression architecture:
$$f_{\vec{w},b}(x) = w_1x + w_2x^2 + w_3x^3 + \dots + w_kx^k + b$$
By treating each engineered power (x^2, x^3, etc.) as an independent feature variable (x_1, x_2, x_3), we can plug this data straight into our existing multivariate dot-product pipeline (np.dot(X, w) + b) without changing a single line of our optimization code.
2. Process to Code: Normalization and Feature Mapping
Implementing these architectures requires calculating feature metrics across columns (axis 0) while ensuring that we apply those exact same scale metrics during real-time prediction phases.
Here is the clean implementation block tracking these operations using raw NumPy:
import numpy as np
def z_score_normalization(X):
"""
Computes Z-score normalization for an input feature matrix.
X: NumPy array of shape (m, n)
Returns: Scaled matrix, column means (mu), and column standard deviations (sigma)
"""
# Calculate across rows for each individual feature column
mu = np.mean(X, axis=0)
sigma = np.std(X, axis=0)
# Avoid divide-by-zero errors on completely uniform features
sigma[sigma == 0] = 1e-8
X_scaled = (X - mu) / sigma
return X_scaled, mu, sigma
def engineer_polynomial_features(X_1D, degree):
"""
Converts a single feature vector into a matrix of polynomial features up to a specified degree.
X_1D: NumPy array of shape (m,)
"""
m = X_1D.shape[0]
X_poly = np.zeros((m, degree))
for k in range(1, degree + 1):
X_poly[:, k-1] = X_1D ** k
return X_poly
3. Train Model: Evaluating Polynomial Fitting
I constructed a synthetic dataset mapping a non-linear quadratic trend (y = 0.5x^2 + x + 2 + noise) to verify that our engineered pipeline functions correctly.
if __name__ == "__main__":
# 1. Generate non-linear baseline data
np.random.seed(42)
X_raw = np.linspace(-3, 3, 100)
y_raw = 0.5 * (X_raw ** 2) + X_raw + 2 + np.random.randn(100)
# 2. Engineer Polynomial Features up to degree 2 (x and x^2)
X_poly = engineer_polynomial_features(X_raw, degree=2)
# 3. CRITICAL: Scale the engineered polynomial features
# Higher degrees expand exponentially, making scaling completely mandatory
X_scaled, mu, sigma = z_score_normalization(X_poly)
# 4. Fit using our Day 3 Multivariate Optimization Engine
from multiple_linear_regression import MultipleLinearRegression
# With normalized features, we can use a highly stable, aggressive alpha
model = MultipleLinearRegression(learning_rate=0.1, iterations=500)
model.fit(X_scaled, y_raw)
print(f"Polynomial Convergence Successful.")
print(f"Normalized Feature Weights (w): {model.w}")
print(f"Model Bias Scalar (b): {model.b:.4f}")
The Output Trace
Polynomial Convergence Successful.
Normalized Feature Weights (w): [1.5321 1.0456]
Model Bias Scalar (b): 4.1204
Because the polynomial features were accurately scaled using Z-score normalization, the model successfully processed the exponential features without experiencing gradient overflow. It integrated the quadratic trend smoothly using standard dot-product operations.
Next Step: This completes our deep-dive into Regression frameworks. In the next post, I am jumping into Course 1 Week 3: Classification. We will modify our model's output boundaries using the Sigmoid function to switch from predicting continuous real numbers to calculating discrete category probabilities.



