Machine Learning Foundations - The Way It Finally Clicked
Weights, bias, cost functions, gradients, gradient descent, and the Normal Equation explained from first principles.
Machine Learning Foundations - The Way It Finally Clicked
The Big Picture
A machine learning model is trying to learn a relationship between inputs and outputs.
House Size → House Price
Years of Experience → Salary
Study Hours → Exam Marks
The goal is to make predictions that are as close as possible to the actual values.
1. Weights
Definition
A weight determines how much an input affects the prediction.
If , every year of experience adds ₹5,000 to the salary.
Intuition
Think of weight as importance or rate of change:
| Context | Weight means |
|---|---|
| Taxi | Cost per kilometer |
| Salary | Increase per year of experience |
| House price | Price per square foot |
2. Bias
Definition
Bias is a learnable starting value added to the prediction:
Why Do We Need It?
Real-world relationships rarely start at zero.
Without bias:
At , the predicted salary is ₹0. Unrealistic.
With bias:
Now a fresh graduate earns ₹30,000 — much more realistic.
Intuition
Or: is the per-kilometer rate, is the base taxi fare.
3. Prediction Function
A linear model with features:
Where is the bias and are the weights. For example:
4. Error
After making a prediction:
If the actual price is 300 and the model predicts 280, the error is 20.
5. Cost Function
Purpose
The cost function measures how bad the current weights and bias are. The model wants to minimise it.
Mean Squared Error
Why Square the Errors?
Errors can be positive or negative. Squaring makes them all positive, so they don't cancel each other out. It also penalises large mistakes much more heavily than small ones — versus .
Why Divide by ?
Without averaging, a model trained on 100 examples would always look worse than one trained on 10, even if both make the same average mistake. Dividing by means the cost measures model quality, not dataset size.
6. Gradient
The gradient tells us how the cost changes when we nudge the weights:
Think of it as direction + steepness — it answers "which way should I move to reduce error?"
7. Gradient Descent
Gradient descent uses the gradient to update weights iteratively:
Where is the learning rate — how large a step we take on each update.
Intuition: You're standing on a mountain and want to reach the valley. The gradient tells you which direction is uphill. Gradient descent tells you to walk the other way. Repeat until you're at the bottom.
8. How Learning Actually Happens
The model starts with random weights, e.g. , .
Training loop:
Eventually the cost converges to a small value and the model has learned useful weights.
9. Matrix Form of the Cost Function
For a dataset of examples and features, predictions can be written as the matrix product , where is the design matrix. The cost function becomes:
This is identical to the scalar form — just written in linear algebra notation. The vector is the error vector, and is its sum of squared elements.
10. Batch, Stochastic, and Mini-Batch Gradient Descent
| Variant | Data per update | Pros | Cons |
|---|---|---|---|
| Batch GD | Full dataset | Stable, accurate gradient | Slow |
| Stochastic GD (SGD) | 1 example | Fast | Noisy updates |
| Mini-Batch GD | examples (e.g. 32) | Fast + stable | Needs tuning of |
Mini-batch is the default in neural networks and deep learning.
11. Evaluation Metrics
Score
Measures how much variation in the data the model explains:
- : perfect fit
- : model is no better than predicting the mean
- : model is worse than predicting the mean
Adjusted
Penalises adding features that don't improve the fit:
RMSE
Root Mean Squared Error — typical prediction error in the same units as :
Penalises large errors heavily. Lower is better.
MAE
Mean Absolute Error — average prediction error, treating all mistakes equally:
Lower is better.
12. Normal Equation
Instead of iterating with gradient descent, we can solve directly for the optimal weights in closed form:
This gives the exact that minimises the cost function — no iterations, no learning rate, no gradient descent required.
Pros: exact solution, simple.
Cons: requires computing , which has complexity in the number of features. Impractical for large .
Final Mental Model
Everything in basic machine learning is built on this pipeline.