Chitresh

Machine Learning Foundations - The Way It Finally Clicked

The Big Picture

A machine learning model is trying to learn a relationship between inputs and outputs.

House Size → House Price
Years of Experience → Salary
Study Hours → Exam Marks

The goal is to make predictions that are as close as possible to the actual values.

1. Weights

Definition

A weight determines how much an input affects the prediction.

$\text{Salary} = w \times \text{Experience}$

If $w = 5000$ , every year of experience adds ₹5,000 to the salary.

Intuition

Think of weight as importance or rate of change:

Context	Weight means
Taxi	Cost per kilometer
Salary	Increase per year of experience
House price	Price per square foot

2. Bias

Definition

Bias is a learnable starting value added to the prediction:

$\hat{y} = w \cdot x + b$

Why Do We Need It?

Real-world relationships rarely start at zero.

Without bias:

$\text{Salary} = 5000 \times \text{Experience}$

At $\text{Experience} = 0$ , the predicted salary is ₹0. Unrealistic.

With bias:

$\text{Salary} = 5000 \times \text{Experience} + 30000$

Now a fresh graduate earns ₹30,000 — much more realistic.

Intuition

$w = \text{slope of the line} \qquad b = \text{where the line starts}$

Or: $w$ is the per-kilometer rate, $b$ is the base taxi fare.

3. Prediction Function

A linear model with $n$ features:

$\hat{y} = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \cdots + \beta_n x_n$

Where $\beta_0$ is the bias and $\beta_1, \ldots, \beta_n$ are the weights. For example:

$\text{House Price} = 50 + 0.2 \times \text{Size} + 10 \times \text{Bedrooms}$

4. Error

After making a prediction:

$\text{Error} = y - \hat{y}$

If the actual price is 300 and the model predicts 280, the error is 20.

5. Cost Function

Purpose

The cost function measures how bad the current weights and bias are. The model wants to minimise it.

Mean Squared Error

$J(\beta) = \frac{1}{2m} \sum_{i=1}^{m} \left(y^{(i)} - \hat{y}^{(i)}\right)^2$

Why Square the Errors?

Errors can be positive or negative. Squaring makes them all positive, so they don't cancel each other out. It also penalises large mistakes much more heavily than small ones — $10^2 = 100$ versus $2^2 = 4$ .

Why Divide by $m$ ?

Without averaging, a model trained on 100 examples would always look worse than one trained on 10, even if both make the same average mistake. Dividing by $m$ means the cost measures model quality, not dataset size.

6. Gradient

The gradient tells us how the cost changes when we nudge the weights:

$\nabla J = \frac{\partial J}{\partial w}$

Think of it as direction + steepness — it answers "which way should I move to reduce error?"

7. Gradient Descent

Gradient descent uses the gradient to update weights iteratively:

$w \leftarrow w - \alpha \frac{\partial J}{\partial w}$

Where $\alpha$ is the learning rate — how large a step we take on each update.

Intuition: You're standing on a mountain and want to reach the valley. The gradient tells you which direction is uphill. Gradient descent tells you to walk the other way. Repeat until you're at the bottom.

8. How Learning Actually Happens

The model starts with random weights, e.g. $w = 0.73$ , $b = -1.42$ .

Training loop:

$\text{Random weights} \;\to\; \hat{y} \;\to\; \text{Error} \;\to\; J(\beta) \;\to\; \nabla J \;\to\; \text{Update} \;\to\; \text{Repeat}$

Eventually the cost converges to a small value and the model has learned useful weights.

9. Matrix Form of the Cost Function

For a dataset of $m$ examples and $n$ features, predictions can be written as the matrix product $X\beta$ , where $X \in \mathbb{R}^{m \times (n+1)}$ is the design matrix. The cost function becomes:

$J(\beta) = \frac{1}{2m}(y - X\beta)^\top(y - X\beta)$

This is identical to the scalar form — just written in linear algebra notation. The vector $y - X\beta$ is the error vector, and $(y - X\beta)^\top(y - X\beta)$ is its sum of squared elements.

10. Batch, Stochastic, and Mini-Batch Gradient Descent

Variant	Data per update	Pros	Cons
Batch GD	Full dataset	Stable, accurate gradient	Slow
Stochastic GD (SGD)	1 example	Fast	Noisy updates
Mini-Batch GD	$k$ examples (e.g. 32)	Fast + stable	Needs tuning of $k$

Mini-batch is the default in neural networks and deep learning.

11. Evaluation Metrics

$R^2$ Score

Measures how much variation in the data the model explains:

$R^2 = 1 - \frac{\sum(y - \hat{y})^2}{\sum(y - \bar{y})^2}$

$R^2 = 1$ : perfect fit
$R^2 = 0$ : model is no better than predicting the mean
$R^2 < 0$ : model is worse than predicting the mean

Adjusted $R^2$

Penalises adding features that don't improve the fit:

$\bar{R}^2 = 1 - (1 - R^2)\frac{m - 1}{m - n - 1}$

RMSE

Root Mean Squared Error — typical prediction error in the same units as $y$ :

$\text{RMSE} = \sqrt{\frac{1}{m}\sum_{i=1}^{m}(y^{(i)} - \hat{y}^{(i)})^2}$

Penalises large errors heavily. Lower is better.

MAE

Mean Absolute Error — average prediction error, treating all mistakes equally:

$\text{MAE} = \frac{1}{m}\sum_{i=1}^{m}\left|y^{(i)} - \hat{y}^{(i)}\right|$

Lower is better.

12. Normal Equation

Instead of iterating with gradient descent, we can solve directly for the optimal weights in closed form:

$\beta = (X^\top X)^{-1} X^\top y$

This gives the exact $\beta$ that minimises the cost function — no iterations, no learning rate, no gradient descent required.

Pros: exact solution, simple.

Cons: requires computing $(X^\top X)^{-1}$ , which has complexity $O(n^3)$ in the number of features. Impractical for large $n$ .

Final Mental Model

$\text{Features} \;\xrightarrow{w, b}\; \hat{y} \;\xrightarrow{}\; \text{Error} \;\xrightarrow{}\; J(\beta) \;\xrightarrow{\nabla J}\; \text{Update} \;\xrightarrow{}\; \text{Repeat}$

Everything in basic machine learning is built on this pipeline.

Machine Learning Foundations - The Way It Finally Clicked

The Big Picture

1. Weights

Definition

Intuition

2. Bias

Definition

Why Do We Need It?

Intuition

3. Prediction Function

4. Error

5. Cost Function

Purpose

Mean Squared Error

Why Square the Errors?

Why Divide by mmm?

6. Gradient

7. Gradient Descent

8. How Learning Actually Happens

9. Matrix Form of the Cost Function

10. Batch, Stochastic, and Mini-Batch Gradient Descent

11. Evaluation Metrics

R2R^2R2 Score

Adjusted R2R^2R2

RMSE

MAE

12. Normal Equation

Final Mental Model

Why Divide by $m$ ?

$R^2$ Score

Adjusted $R^2$