Cost function in linear regression

I am moving past pure definitions. To truly understand how a model evaluates its performance, I need to look closely at the mechanics of the mathematical loss functions used in optimization.

Today, I broke down Univariate Linear Regression and the exact geometry of the Mean Squared Error (MSE) Cost Function. The goal is simple: understand the calculus conceptually so that implementing it in vectorized code later doesn't result in hidden dimension alignment bugs.

1. Input Theory: Defining the Engine

When a model makes a prediction, it relies on a specific mathematical function. For a univariate linear regression model—meaning a model with exactly one input variable (x)—the hypothesis function f is written as:

$$f_{w,b}(x) = wx + b$$

The parameters that control the behavior of this straight line are:

w (Weight): Controls the slope of the line.
b (Bias): Controls the y-intercept.

To evaluate how accurately these parameters fit a given training set of size m, we measure the average distance between the model's predictions (ŷ) and the true target values (y). This evaluation is handled by the squared error cost function, denoted as J(w,b):

$$J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})^2$$

Why the 2 in the denominator? The extra 1/2 term is a deliberate mathematical convenience. When we take the partial derivative of J(w,b)with respect to our parameters later during gradient descent, the exponent 2 brings down a multiplier that cancels out perfectly with the 2 in the denominator, leaving behind a much cleaner gradient update formula.

2. Process to Code: Calculating the Error Surface

To visualize exactly why this cost function behaves predictable way, I mapped a basic, clean linear dataset containing four points where y = x:

$${(x,y)} = {(0,0), (1,1), (2,2), (3,3)}$$

If we calculate the exact cost polynomial by plugging these coordinates into our equation, the summation expands to a direct function of w and b:

$$J(w,b) = \frac{1}{4}(2b^2 + 6b(w-1) + 7(w-1)^2)$$

When you map out this cost function across various values of $w$ and $b$, you get an ideal mathematical surface known as a convex function (or a bowl shape).

Notice the structure of the parameter space below:

Here is the clean implementation I ended up with in VS Code:

import numpy as np

class LinearRegressionScratch:
    def __init__(self, learning_rate=0.01, iterations=1000):
        self.lr = learning_rate
        self.iterations = iterations
        self.w = None
        self.b = None

    def fit(self, X, y):
        # m = number of training examples
        m = X.shape[0]
        self.w = 0.0
        self.b = 0.0

        for _ in range(self.iterations):
            # 1. Predict
            y_pred = (self.w * X) + self.b
            
            # 2. Calculate Gradients (The Calculus Part)
            dw = (1 / m) * np.sum((y_pred - y) * X)
            db = (1 / m) * np.sum(y_pred - y)
            
            # 3. Update Parameters
            self.w -= self.lr * dw
            self.b -= self.lr * db

    def predict(self, X):
        return (self.w * X) + self.b

Train Model

if __name__ == "__main__":
    # Generate mock data: y = 2x + 5 + some noise
    np.random.seed(42)
    X = 2 * np.random.rand(100, 1)
    y = 5 + 2 * X + np.random.randn(100, 1)

    # Reshape to 1D arrays for our simple implementation
    X = X.flatten()
    y = y.flatten()

    # Train
    model = LinearRegressionScratch(learning_rate=0.1, iterations=500)
    model.fit(X, y)

    print(f"Trained Weight (w): {model.w:.4f} (Expected close to 2.0)")
    print(f"Trained Bias (b): {model.b:.4f} (Expected close to 5.0)")

3. Train Model: Analyzing the Geometry

When you slice this 3D bowl horizontally, you generate a contour plot. Each elliptical line represents a collection of coordinates $(w, b)$ that share the exact same cost value (w,b).

The Outer Rings: Represent pairs of w and b that yield a high cost value. These lines translate to a poor line-of-best-fit that misses our training points by a wide margin.
The Center Orbit: As the ellipses shrink toward the center point, the total cost value decreases toward zero.
The Global Minimum: Located at precisely (w,b) = (1,0). At this point, the cost function drops to zero, yielding the mathematically perfect linear model: y = 1x + 0.

$$\frac{\partial}{\partial b} J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})$$

Understanding this topology is critical. Because the squared error cost function is strictly convex, it guarantees exactly one global minimum. There are no local traps or false dead-ends.