Cost function in linear regression
Dissecting the Squared Error Cost Function in Linear Regression

Hi, I’m Saachi! 👋 I am anundergraduate with a passion for tech. I realized that reading about Machine Learning wasn't enough, so I started Code Train Repeat—a series where I document my journey.
I am moving past pure definitions. To truly understand how a model evaluates its performance, I need to look closely at the mechanics of the mathematical loss functions used in optimization.
Today, I broke down Univariate Linear Regression and the exact geometry of the Mean Squared Error (MSE) Cost Function. The goal is simple: understand the calculus conceptually so that implementing it in vectorized code later doesn't result in hidden dimension alignment bugs.
1. Input Theory: Defining the Engine
When a model makes a prediction, it relies on a specific mathematical function. For a univariate linear regression model—meaning a model with exactly one input variable (x)—the hypothesis function f is written as:
$$f_{w,b}(x) = wx + b$$
The parameters that control the behavior of this straight line are:
w (Weight): Controls the slope of the line.
b (Bias): Controls the y-intercept.
To evaluate how accurately these parameters fit a given training set of size m, we measure the average distance between the model's predictions (ŷ) and the true target values (y). This evaluation is handled by the squared error cost function, denoted as J(w,b):
$$J(w,b) = \frac{1}{2m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})^2$$
Why the 2 in the denominator? The extra 1/2 term is a deliberate mathematical convenience. When we take the partial derivative of J(w,b)with respect to our parameters later during gradient descent, the exponent 2 brings down a multiplier that cancels out perfectly with the 2 in the denominator, leaving behind a much cleaner gradient update formula.
2. Process to Code: Calculating the Error Surface
To visualize exactly why this cost function behaves predictable way, I mapped a basic, clean linear dataset containing four points where y = x:
$${(x,y)} = {(0,0), (1,1), (2,2), (3,3)}$$
If we calculate the exact cost polynomial by plugging these coordinates into our equation, the summation expands to a direct function of w and b:
$$J(w,b) = \frac{1}{4}(2b^2 + 6b(w-1) + 7(w-1)^2)$$
When you map out this cost function across various values of $w$ and $b$, you get an ideal mathematical surface known as a convex function (or a bowl shape).
Notice the structure of the parameter space below:
Here is the clean implementation I ended up with in VS Code:
import numpy as np
class LinearRegressionScratch:
def __init__(self, learning_rate=0.01, iterations=1000):
self.lr = learning_rate
self.iterations = iterations
self.w = None
self.b = None
def fit(self, X, y):
# m = number of training examples
m = X.shape[0]
self.w = 0.0
self.b = 0.0
for _ in range(self.iterations):
# 1. Predict
y_pred = (self.w * X) + self.b
# 2. Calculate Gradients (The Calculus Part)
dw = (1 / m) * np.sum((y_pred - y) * X)
db = (1 / m) * np.sum(y_pred - y)
# 3. Update Parameters
self.w -= self.lr * dw
self.b -= self.lr * db
def predict(self, X):
return (self.w * X) + self.b
Train Model
if __name__ == "__main__":
# Generate mock data: y = 2x + 5 + some noise
np.random.seed(42)
X = 2 * np.random.rand(100, 1)
y = 5 + 2 * X + np.random.randn(100, 1)
# Reshape to 1D arrays for our simple implementation
X = X.flatten()
y = y.flatten()
# Train
model = LinearRegressionScratch(learning_rate=0.1, iterations=500)
model.fit(X, y)
print(f"Trained Weight (w): {model.w:.4f} (Expected close to 2.0)")
print(f"Trained Bias (b): {model.b:.4f} (Expected close to 5.0)")
3. Train Model: Analyzing the Geometry
When you slice this 3D bowl horizontally, you generate a contour plot. Each elliptical line represents a collection of coordinates $(w, b)$ that share the exact same cost value (w,b).
The Outer Rings: Represent pairs of w and b that yield a high cost value. These lines translate to a poor line-of-best-fit that misses our training points by a wide margin.
The Center Orbit: As the ellipses shrink toward the center point, the total cost value decreases toward zero.
The Global Minimum: Located at precisely (w,b) = (1,0). At this point, the cost function drops to zero, yielding the mathematically perfect linear model: y = 1x + 0.
$$\frac{\partial}{\partial b} J(w,b) = \frac{1}{m} \sum_{i=1}^{m} (f_{w,b}(x^{(i)}) - y^{(i)})$$
Understanding this topology is critical. Because the squared error cost function is strictly convex, it guarantees exactly one global minimum. There are no local traps or false dead-ends.




