Upgrading to Multiple Linear Regression

Univariate linear regression is a great sandbox, but real-world datasets never rely on just a single feature. If you want to predict something complex—like house prices—you need to account for multiple independent variables: size, number of bedrooms, floors, and age.

Today, I upgraded my ML-Journey repository to handle Multiple Linear Regression. This forced me to ditch basic scalar math and transition fully into linear algebra, matrix operations, and vectorized execution using NumPy on my M1 Mac.

1. Input Theory: The Multi-Feature Transition

When we introduce multiple features, our notation must expand. Here is the strict terminology framework I am using to track our data structures:

n: The total number of features (independent variables) in our model.
x_j: The j-th individual feature (where j ranges from 1 to n).
x_vec_i: A row vector containing all feature values for the i-th specific training example, represented mathematically as x^i.
x_j_i: The specific value of feature j within the $i$-th training example, represented mathematically as xj^i.

Our model function f can no longer simply be wx + b. It must scale to handle every feature with its own dedicated weight:

$$f_{\vec{w},b}(\vec{x}) = w_1x_1 + w_2x_2 + \dots + w_n x_n + b$$

Where:

w_vec is a weight vector of shape (n,) containing parameters $[w_1, w_2, \dots, w_n]$, represented mathematically as $\vec{w}$.
x_vec is a feature vector of shape (n,), represented mathematically as $\vec{x}$.
b remains a single scalar bias parameter.

Expressing the Model as a Dot Product

Writing out every feature explicitly becomes mathematically messy and highly inefficient to compute. By grouping our weights into vector w_vec and our features into vector x_vec, we can compress this entire equation into a single vector dot product:

$$f_{\vec{w},b}(\vec{x}) = \vec{w} \cdot \vec{x} + b$$

2. Process to Code: Ditching For-Loops for Vectorization

The biggest bottleneck in Python is loop execution speed. If we implement our new hypothesis function using a naive for loop, Python has to step through every feature sequentially.

Naive vs. Vectorized Architectures

# Architecture A: Naive For-Loop (Sequential CPU overhead)
def f_naive(w, x, b, n):
    f = 0
    for i in range(n):
        f += w[i] * x[i]
    return f + b

# Architecture B: Vectorized Dot Product (Parallel Processing)
import numpy as np
def f_vectorized(w, x, b):
    return np.dot(w, x) + b

Vectorization leverages parallel processing hardware. Instead of calculating one multiplication at a time, NumPy taps directly into low-level routines that handle these calculations simultaneously. On an M1 chip, this completely bypasses massive overhead, converting loop steps into highly optimized operations.

Vectorized Gradient Descent Math

The cost function returns a single scalar error value. However, because we now have n separate weights, we must calculate the partial derivative for every single weight w_j from 1 to n.

The update step for every parameter j during an iteration becomes:

$$w_j = w_j - \alpha \frac{1}{m} \sum_{i=1}^{m} (\vec{w} \cdot \vec{x}^{(i)} + b - y^{(i)})x_j^{(i)}$$

The bias parameter b updates exactly as it did before, utilizing the identical error residual term.

Train Model

Vectorized Execution I implemented this multi-feature model inside my local environment. To match the mathematics, the training loop computes the predictions across all m examples simultaneously using matrix-vector multiplication via np.dot.

Here is the functional script added to multiple_linear_regression.py:

import numpy as np

class MultipleLinearRegression:
    def __init__(self, learning_rate=0.01, iterations=1000):
        self.lr = learning_rate
        self.iterations = iterations
        self.w = None
        self.b = None

    def fit(self, X, y):
        # m = examples, n = features
        m, n = X.shape
        
        # Initialize weights to zero vector, bias to scalar zero
        self.w = np.zeros(n)
        self.b = 0.0
        
        for _ in range(self.iterations):
            # Calculate predictions for all m examples simultaneously
            # X is shape (m, n), self.w is shape (n,) -> y_pred is shape (m,)
            y_pred = np.dot(X, self.w) + self.b
            errors = y_pred - y
            
            # Vectorized gradient calculations
            dw = (1 / m) * np.dot(X.T, errors)
            db = (1 / m) * np.sum(errors)
            
            # Simultaneous parameter adjustment
            self.w -= self.lr * dw
            self.b -= self.lr * db

    def predict(self, X):
        return np.dot(X, self.w) + self.b

if __name__ == "__main__":
    # Mock Dataset: 3 features per example (Size, Bedrooms, Age)
    X_train = np.array([
        [2104, 5, 45],
        [1416, 3, 40],
        [852, 2, 35]
    ], dtype=float)
    
    # Ground truth targets (Prices in thousands)
    y_train = np.array([460, 232, 178], dtype=float)

    # Note: Features are unscaled, so we use a very conservative learning rate
    model = MultipleLinearRegression(learning_rate=1e-7, iterations=100)
    model.fit(X_train, y_train)
    
    print(f"Trained Weight Vector (w): {model.w}")
    print(f"Trained Bias Scalar (b): {model.b:.4f}")

Observation Note: Running this mock data with completely raw, unscaled features requires an incredibly tiny learning rate like 1e-7. Because the house sizes are in the thousands while bedrooms are single digits, our gradients can explode instantly if alpha is even slightly too large.

Upgrading to Multiple Linear Regression