chitresh.codes/writing/ml-foundations
Machine Learning
15 min read

Machine Learning Foundations - The Way It Finally Clicked

Weights, bias, cost functions, gradients, gradient descent, and the Normal Equation explained from first principles.


Machine Learning Foundations - The Way It Finally Clicked

The Big Picture

A machine learning model is trying to learn a relationship between inputs and outputs.

House Size → House Price
Years of Experience → Salary
Study Hours → Exam Marks

The goal is to make predictions that are as close as possible to the actual values.


1. Weights

Definition

A weight determines how much an input affects the prediction.

Salary=w×Experience\text{Salary} = w \times \text{Experience}

If w=5000w = 5000, every year of experience adds ₹5,000 to the salary.

Intuition

Think of weight as importance or rate of change:

Context Weight means
Taxi Cost per kilometer
Salary Increase per year of experience
House price Price per square foot

2. Bias

Definition

Bias is a learnable starting value added to the prediction:

y^=wx+b\hat{y} = w \cdot x + b

Why Do We Need It?

Real-world relationships rarely start at zero.

Without bias:

Salary=5000×Experience\text{Salary} = 5000 \times \text{Experience}

At Experience=0\text{Experience} = 0, the predicted salary is ₹0. Unrealistic.

With bias:

Salary=5000×Experience+30000\text{Salary} = 5000 \times \text{Experience} + 30000

Now a fresh graduate earns ₹30,000 — much more realistic.

Intuition

w=slope of the lineb=where the line startsw = \text{slope of the line} \qquad b = \text{where the line starts}

Or: ww is the per-kilometer rate, bb is the base taxi fare.


3. Prediction Function

A linear model with nn features:

y^=β0+β1x1+β2x2++βnxn\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n

Where β0\beta_0 is the bias and β1,,βn\beta_1, \ldots, \beta_n are the weights. For example:

House Price=50+0.2×Size+10×Bedrooms\text{House Price} = 50 + 0.2 \times \text{Size} + 10 \times \text{Bedrooms}


4. Error

After making a prediction:

Error=yy^\text{Error} = y - \hat{y}

If the actual price is 300 and the model predicts 280, the error is 20.


5. Cost Function

Purpose

The cost function measures how bad the current weights and bias are. The model wants to minimise it.

Mean Squared Error

J(β)=12mi=1m(y(i)y^(i))2J(\beta) = \frac{1}{2m} \sum_{i=1}^{m} \left(y^{(i)} - \hat{y}^{(i)}\right)^2

Why Square the Errors?

Errors can be positive or negative. Squaring makes them all positive, so they don't cancel each other out. It also penalises large mistakes much more heavily than small ones — 102=10010^2 = 100 versus 22=42^2 = 4.

Why Divide by mm?

Without averaging, a model trained on 100 examples would always look worse than one trained on 10, even if both make the same average mistake. Dividing by mm means the cost measures model quality, not dataset size.


6. Gradient

The gradient tells us how the cost changes when we nudge the weights:

J=Jw\nabla J = \frac{\partial J}{\partial w}

Think of it as direction + steepness — it answers "which way should I move to reduce error?"


7. Gradient Descent

Gradient descent uses the gradient to update weights iteratively:

wwαJww \leftarrow w - \alpha \frac{\partial J}{\partial w}

Where α\alpha is the learning rate — how large a step we take on each update.

Intuition: You're standing on a mountain and want to reach the valley. The gradient tells you which direction is uphill. Gradient descent tells you to walk the other way. Repeat until you're at the bottom.


8. How Learning Actually Happens

The model starts with random weights, e.g. w=0.73w = 0.73, b=1.42b = -1.42.

Training loop:

Random weights    y^    Error    J(β)    J    Update    Repeat\text{Random weights} \;\to\; \hat{y} \;\to\; \text{Error} \;\to\; J(\beta) \;\to\; \nabla J \;\to\; \text{Update} \;\to\; \text{Repeat}

Eventually the cost converges to a small value and the model has learned useful weights.


9. Matrix Form of the Cost Function

For a dataset of mm examples and nn features, predictions can be written as the matrix product XβX\beta, where XRm×(n+1)X \in \mathbb{R}^{m \times (n+1)} is the design matrix. The cost function becomes:

J(β)=12m(yXβ)(yXβ)J(\beta) = \frac{1}{2m}(y - X\beta)^\top(y - X\beta)

This is identical to the scalar form — just written in linear algebra notation. The vector yXβy - X\beta is the error vector, and (yXβ)(yXβ)(y - X\beta)^\top(y - X\beta) is its sum of squared elements.


10. Batch, Stochastic, and Mini-Batch Gradient Descent

Variant Data per update Pros Cons
Batch GD Full dataset Stable, accurate gradient Slow
Stochastic GD (SGD) 1 example Fast Noisy updates
Mini-Batch GD kk examples (e.g. 32) Fast + stable Needs tuning of kk

Mini-batch is the default in neural networks and deep learning.


11. Evaluation Metrics

R2R^2 Score

Measures how much variation in the data the model explains:

R2=1(yy^)2(yyˉ)2R^2 = 1 - \frac{\sum(y - \hat{y})^2}{\sum(y - \bar{y})^2}

  • R2=1R^2 = 1: perfect fit
  • R2=0R^2 = 0: model is no better than predicting the mean
  • R2<0R^2 < 0: model is worse than predicting the mean

Adjusted R2R^2

Penalises adding features that don't improve the fit:

Rˉ2=1(1R2)m1mn1\bar{R}^2 = 1 - (1 - R^2)\frac{m - 1}{m - n - 1}

RMSE

Root Mean Squared Error — typical prediction error in the same units as yy:

RMSE=1mi=1m(y(i)y^(i))2\text{RMSE} = \sqrt{\frac{1}{m}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2}

Penalises large errors heavily. Lower is better.

MAE

Mean Absolute Error — average prediction error, treating all mistakes equally:

MAE=1mi=1my(i)y^(i)\text{MAE} = \frac{1}{m}\sum_{i=1}^{m}\left|y^{(i)} - \hat{y}^{(i)}\right|

Lower is better.


12. Normal Equation

Instead of iterating with gradient descent, we can solve directly for the optimal weights in closed form:

β=(XX)1Xy\beta = (X^\top X)^{-1} X^\top y

This gives the exact β\beta that minimises the cost function — no iterations, no learning rate, no gradient descent required.

Pros: exact solution, simple.

Cons: requires computing (XX)1(X^\top X)^{-1}, which has complexity O(n3)O(n^3) in the number of features. Impractical for large nn.


Final Mental Model

Features  w,b  y^    Error    J(β)  J  Update    Repeat\text{Features} \;\xrightarrow{w, b}\; \hat{y} \;\xrightarrow{}\; \text{Error} \;\xrightarrow{}\; J(\beta) \;\xrightarrow{\nabla J}\; \text{Update} \;\xrightarrow{}\; \text{Repeat}

Everything in basic machine learning is built on this pipeline.