graph TD
A["🌳 Age ← ROOT (IG = 0.246)"]
A -- Young --> B["? (5 records)"]
A -- Middle --> C["? (4 records)"]
A -- Senior --> D["? (5 records)"]
style A fill:#f9f,stroke:#333,stroke-width:3px
Fundamentals of Classification
Goal: Predict whether a customer buys a computer.
| # | Age | Income | Student | Credit | Buys |
|---|---|---|---|---|---|
| 1 | Young | High | No | Excellent | No |
| 2 | Young | High | No | Excellent | No |
| 3 | Middle | High | No | Excellent | Yes |
| 4 | Senior | Medium | No | Excellent | Yes |
| 5 | Senior | Low | Yes | Excellent | Yes |
| 6 | Senior | Low | Yes | Fair | No |
| 7 | Middle | Low | Yes | Fair | Yes |
| 8 | Young | Medium | No | Excellent | No |
| 9 | Young | Low | Yes | Excellent | Yes |
| 10 | Senior | Medium | Yes | Excellent | Yes |
| 11 | Young | Medium | Yes | Fair | Yes |
| 12 | Middle | Medium | No | Fair | Yes |
| 13 | Middle | High | Yes | Excellent | Yes |
| 14 | Senior | Medium | No | Fair | No |
Attributes to test: Age, Income, Student, Credit
Target: Buys (Yes / No) — 9 Yes, 5 No
ID3 Algorithm — How it works
Starting from the full dataset \(S\), the algorithm recursively builds the tree:
Key Formulas
Entropy — measures disorder: \[H(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)\]
Information Gain — entropy reduction after split: \[IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)\]
Full dataset: 14 records — 9 Yes, 5 No
\[p_{yes} = \frac{9}{14} \approx 0.643 \qquad p_{no} = \frac{5}{14} \approx 0.357\]
\[H(S) = -(0.643 \times \log_2 0.643 + 0.357 \times \log_2 0.357)\]
\[= -(0.643 \times (-0.637) + 0.357 \times (-1.485))\]
\[= -(-0.410 - 0.530)\]
\[\boxed{H(S) = 0.940}\]
This is our baseline. Every IG is computed relative to this value.
Subsets from the 14 records:
| Value | Records | Yes | No | \(H(S_v)\) |
|---|---|---|---|---|
| Young | 5 (1,2,8,9,11) | 2 | 3 | \(-(\frac{2}{5}\log_2\frac{2}{5}+\frac{3}{5}\log_2\frac{3}{5})=0.971\) |
| Middle | 4 (3,7,12,13) | 4 | 0 | \(0\) (pure) |
| Senior | 5 (4,5,6,10,14) | 3 | 2 | \(-(\frac{3}{5}\log_2\frac{3}{5}+\frac{2}{5}\log_2\frac{2}{5})=0.971\) |
Weighted entropy: \[H(S, Age) = \frac{5}{14}(0.971) + \frac{4}{14}(0) + \frac{5}{14}(0.971)\] \[= 0.347 + 0 + 0.347 = 0.694\]
\[\boxed{IG(Age) = 0.940 - 0.694 = \mathbf{0.246}}\]
| Value | Records | Yes | No | \(H(S_v)\) |
|---|---|---|---|---|
| High | 4 (1,2,3,13) | 2 | 2 | \(-(\frac{2}{4}\log_2\frac{2}{4}+\frac{2}{4}\log_2\frac{2}{4})=1.0\) |
| Medium | 6 (4,8,10,11,12,14) | 4 | 2 | \(-(\frac{4}{6}\log_2\frac{4}{6}+\frac{2}{6}\log_2\frac{2}{6})=0.918\) |
| Low | 4 (5,6,7,9) | 3 | 1 | \(-(\frac{3}{4}\log_2\frac{3}{4}+\frac{1}{4}\log_2\frac{1}{4})=0.811\) |
Weighted entropy: \[H(S, Income) = \frac{4}{14}(1.0) + \frac{6}{14}(0.918) + \frac{4}{14}(0.811)\] \[= 0.286 + 0.393 + 0.232 = 0.911\]
\[\boxed{IG(Income) = 0.940 - 0.911 = \mathbf{0.029}}\]
| Value | Records | Yes | No | \(H(S_v)\) |
|---|---|---|---|---|
| No | 7 (1,2,3,4,8,12,14) | 3 | 4 | \(-(\frac{3}{7}\log_2\frac{3}{7}+\frac{4}{7}\log_2\frac{4}{7})=0.985\) |
| Yes | 7 (5,6,7,9,10,11,13) | 6 | 1 | \(-(\frac{6}{7}\log_2\frac{6}{7}+\frac{1}{7}\log_2\frac{1}{7})=0.592\) |
Weighted entropy: \[H(S, Student) = \frac{7}{14}(0.985) + \frac{7}{14}(0.592)\] \[= 0.493 + 0.296 = 0.789\]
\[\boxed{IG(Student) = 0.940 - 0.789 = \mathbf{0.151}}\]
| Value | Records | Yes | No | \(H(S_v)\) |
|---|---|---|---|---|
| Excellent | 8 (1,2,3,4,5,9,10,13) | 6 | 2 | \(-(\frac{6}{8}\log_2\frac{6}{8}+\frac{2}{8}\log_2\frac{2}{8})=0.811\) |
| Fair | 6 (6,7,8,11,12,14) | 3 | 3 | \(-(\frac{3}{6}\log_2\frac{3}{6}+\frac{3}{6}\log_2\frac{3}{6})=1.0\) |
Weighted entropy: \[H(S, Credit) = \frac{8}{14}(0.811) + \frac{6}{14}(1.0)\] \[= 0.463 + 0.429 = 0.892\]
\[\boxed{IG(Credit) = 0.940 - 0.892 = \mathbf{0.048}}\]
Summary of all Information Gains at the root:
| Attribute | \(IG\) | |
|---|---|---|
| Age | 0.246 | ✅ Maximum → Root node |
| Student | 0.151 | |
| Credit | 0.048 | |
| Income | 0.029 |
graph TD
A["🌳 Age ← ROOT (IG = 0.246)"]
A -- Young --> B["? (5 records)"]
A -- Middle --> C["? (4 records)"]
A -- Senior --> D["? (5 records)"]
style A fill:#f9f,stroke:#333,stroke-width:3px
Now we recurse on each of the 3 branches independently.
| # | Age | Income | Student | Credit | Buys |
|---|---|---|---|---|---|
| 3 | Middle | High | No | Excellent | Yes |
| 7 | Middle | Low | Yes | Fair | Yes |
| 12 | Middle | Medium | No | Fair | Yes |
| 13 | Middle | High | Yes | Excellent | Yes |
All 4 records → Yes. \[H(S_{Middle}) = 0 \quad \text{(pure node)}\]
🍃 Leaf: YES — No further splitting needed.
Young subset (Records 1, 2, 8, 9, 11): 2 Yes / 3 No
\[H(S_{Young}) = -\!\left(\tfrac{2}{5}\log_2\tfrac{2}{5}+\tfrac{3}{5}\log_2\tfrac{3}{5}\right) = 0.971\]
Not pure → we must test the remaining attributes: Income, Student, Credit.
| # | Income | Student | Credit | Buys |
|---|---|---|---|---|
| 1 | High | No | Excellent | No |
| 2 | High | No | Excellent | No |
| 8 | Medium | No | Excellent | No |
| 9 | Low | Yes | Excellent | Yes |
| 11 | Medium | Yes | Fair | Yes |
Test Income on the 5 Young records:
| Value | Records | Yes | No | \(H(S_v)\) |
|---|---|---|---|---|
| High | 1, 2 | 0 | 2 | \(0\) (pure No) |
| Medium | 8, 11 | 1 | 1 | \(1.0\) (max disorder) |
| Low | 9 | 1 | 0 | \(0\) (pure Yes) |
\[H(S_{Young}, Income) = \frac{2}{5}(0)+\frac{2}{5}(1.0)+\frac{1}{5}(0) = 0.400\]
\[\boxed{IG(Income \mid Young) = 0.971 - 0.400 = 0.571}\]
Test Student on the 5 Young records:
| Value | Records | Yes | No | \(H(S_v)\) |
|---|---|---|---|---|
| No | 1, 2, 8 | 0 | 3 | \(0\) (pure No) |
| Yes | 9, 11 | 2 | 0 | \(0\) (pure Yes) |
\[H(S_{Young}, Student) = \frac{3}{5}(0)+\frac{2}{5}(0) = 0\]
\[\boxed{IG(Student \mid Young) = 0.971 - 0 = \mathbf{0.971}}\]
→ Perfect split! Both subsets are pure.
Test Credit on the 5 Young records:
| Value | Records | Yes | No | \(H(S_v)\) |
|---|---|---|---|---|
| Excellent | 1, 2, 8, 9 | 1 | 3 | \(-(\frac{1}{4}\log_2\frac{1}{4}+\frac{3}{4}\log_2\frac{3}{4})=0.811\) |
| Fair | 11 | 1 | 0 | \(0\) (pure Yes) |
\[H(S_{Young}, Credit) = \frac{4}{5}(0.811)+\frac{1}{5}(0) = 0.649\]
\[\boxed{IG(Credit \mid Young) = 0.971 - 0.649 = 0.322}\]
| Attribute | \(IG\) | |
|---|---|---|
| Student | 0.971 | ✅ Maximum |
| Income | 0.571 | |
| Credit | 0.322 |
→ Student splits the Young branch perfectly!
Student = No → Records 1, 2, 8 → all No → \(H=0\) → 🍃 Leaf: NO
Student = Yes → Records 9, 11 → all Yes → \(H=0\) → 🍃 Leaf: YES
graph TD
A["Age = Young"] --> B{Student?}
B -- No --> C(["🍃 NO (3/3)"])
B -- Yes --> D(["🍃 YES (2/2)"])
style C fill:#f99,stroke:#333
style D fill:#9f9,stroke:#333
Senior subset (Records 4, 5, 6, 10, 14): 3 Yes / 2 No
\[H(S_{Senior}) = -\!\left(\tfrac{3}{5}\log_2\tfrac{3}{5}+\tfrac{2}{5}\log_2\tfrac{2}{5}\right) = 0.971\]
Not pure → we test the remaining attributes: Income, Student, Credit.
| # | Income | Student | Credit | Buys |
|---|---|---|---|---|
| 4 | Medium | No | Excellent | Yes |
| 5 | Low | Yes | Excellent | Yes |
| 6 | Low | Yes | Fair | No |
| 10 | Medium | Yes | Excellent | Yes |
| 14 | Medium | No | Fair | No |
Test Income on the 5 Senior records:
| Value | Records | Yes | No | \(H(S_v)\) |
|---|---|---|---|---|
| Medium | 4, 10, 14 | 2 | 1 | \(-(\frac{2}{3}\log_2\frac{2}{3}+\frac{1}{3}\log_2\frac{1}{3})=0.918\) |
| Low | 5, 6 | 1 | 1 | \(1.0\) (max disorder) |
\[H(S_{Senior}, Income) = \frac{3}{5}(0.918)+\frac{2}{5}(1.0) = 0.951\]
\[\boxed{IG(Income \mid Senior) = 0.971 - 0.951 = 0.020}\]
Test Student on the 5 Senior records:
| Value | Records | Yes | No | \(H(S_v)\) |
|---|---|---|---|---|
| No | 4, 14 | 1 | 1 | \(1.0\) (max disorder) |
| Yes | 5, 6, 10 | 2 | 1 | \(-(\frac{2}{3}\log_2\frac{2}{3}+\frac{1}{3}\log_2\frac{1}{3})=0.918\) |
\[H(S_{Senior}, Student) = \frac{2}{5}(1.0)+\frac{3}{5}(0.918) = 0.951\]
\[\boxed{IG(Student \mid Senior) = 0.971 - 0.951 = 0.020}\]
Test Credit on the 5 Senior records:
| Value | Records | Yes | No | \(H(S_v)\) |
|---|---|---|---|---|
| Excellent | 4, 5, 10 | 3 | 0 | \(0\) (pure Yes) |
| Fair | 6, 14 | 0 | 2 | \(0\) (pure No) |
\[H(S_{Senior}, Credit) = \frac{3}{5}(0)+\frac{2}{5}(0) = 0\]
\[\boxed{IG(Credit \mid Senior) = 0.971 - 0 = \mathbf{0.971}}\]
→ Perfect split! Both subsets are pure.
| Attribute | \(IG\) | |
|---|---|---|
| Credit | 0.971 | ✅ Maximum |
| Income | 0.020 | |
| Student | 0.020 |
→ Credit splits the Senior branch perfectly!
Credit = Excellent → Records 4, 5, 10 → all Yes → \(H=0\) → 🍃 Leaf: YES
Credit = Fair → Records 6, 14 → all No → \(H=0\) → 🍃 Leaf: NO
graph TD
A["Age = Senior"] --> B{Credit?}
B -- Excellent --> C(["🍃 YES (3/3)"])
B -- Fair --> D(["🍃 NO (2/2)"])
style C fill:#9f9,stroke:#333
style D fill:#f99,stroke:#333
Assembling all branches:
graph TD
ROOT["🌳 Age?"]
ROOT -- Young --> SY{Student?}
ROOT -- Middle --> LM(["🍃 YES<br/>4/4"])
ROOT -- Senior --> SC{Credit?}
SY -- No --> LN(["🍃 NO<br/>3/3"])
SY -- Yes --> LY(["🍃 YES<br/>2/2"])
SC -- Excellent --> LE(["🍃 YES<br/>3/3"])
SC -- Fair --> LF(["🍃 NO<br/>2/2"])
style ROOT fill:#f9f,stroke:#333,stroke-width:3px
style LM fill:#9f9,stroke:#333,stroke-width:2px
style LY fill:#9f9,stroke:#333,stroke-width:2px
style LE fill:#9f9,stroke:#333,stroke-width:2px
style LN fill:#f99,stroke:#333,stroke-width:2px
style LF fill:#f99,stroke:#333,stroke-width:2px
All 14 records correctly classified:
| Branch | Path | Records | Prediction | Correct? |
|---|---|---|---|---|
| Middle | Age=Middle | 3,7,12,13 | YES | ✅ 4/4 |
| Young+Student=No | Age=Young, Student=No | 1,2,8 | NO | ✅ 3/3 |
| Young+Student=Yes | Age=Young, Student=Yes | 9,11 | YES | ✅ 2/2 |
| Senior+Credit=Excellent | Age=Senior, Credit=Excellent | 4,5,10 | YES | ✅ 3/3 |
| Senior+Credit=Fair | Age=Senior, Credit=Fair | 6,14 | NO | ✅ 2/2 |
Tip
Tree depth = 2. Only 2 of 4 attributes were needed (Age + one of Student/Credit). Income was never selected — it carried the least information gain at every level.
Admission model:
Sigmoid Function:
\[P(y=1 \mid x) = \sigma(z) = \frac{1}{1 + e^{-z}}\]
where \(z = 0.05 \times \text{score} - 3\)
Questions:
Why Logistic Regression?
The sigmoid maps any real value to \((0, 1)\), making it ideal for binary classification. The decision boundary is the score where \(\sigma(z) = 0.5\), i.e. \(z = 0\).
1. Score = 60: \[z = 0.05 \times 60 - 3 = 0\]
\[P(y=1 \mid x=60) = \frac{1}{1 + e^{0}} = \frac{1}{2} = \boxed{0.5}\]
→ Exactly on the decision boundary.
2. Score = 80: \[z = 0.05 \times 80 - 3 = 1\]
\[P(y=1 \mid x=80) = \frac{1}{1 + e^{-1}} = \frac{1}{1 + 0.368} \approx \boxed{0.731}\]
→ 73% chance of admission.
3. Score for \(P = 0.5\):
\(\sigma(z) = 0.5 \Leftrightarrow z = 0 \Leftrightarrow 0.05 \times \text{score} - 3 = 0\)
\[\text{score} = \frac{3}{0.05} = \boxed{60}\]
Points in 2D space:
Target point: P(4, 5).
What is its class for \(k = 3\)?
k-NN Algorithm
Euclidean Distance:
\[d(P, Q) = \sqrt{(x_P - x_Q)^2 + (y_P - y_Q)^2}\]
Distances from P(4, 5):
\[d(P, A) = \sqrt{(4-2)^2 + (5-3)^2} = \sqrt{4+4} = \sqrt{8} \approx \boxed{2.828}\]
\[d(P, B) = \sqrt{(4-3)^2 + (5-4)^2} = \sqrt{1+1} = \sqrt{2} \approx \boxed{1.414}\]
\[d(P, C) = \sqrt{(4-5)^2 + (5-6)^2} = \sqrt{1+1} = \sqrt{2} \approx \boxed{1.414}\]
\[d(P, D) = \sqrt{(4-5)^2 + (5-4)^2} = \sqrt{1+1} = \sqrt{2} \approx \boxed{1.414}\]
\[d(P, E) = \sqrt{(4-7)^2 + (5-8)^2} = \sqrt{9+9} = \sqrt{18} \approx \boxed{4.243}\]
Sorted by ascending distance:
| Rank | Point | Distance | Class |
|---|---|---|---|
| 1 | B(3,4) | 1.414 | Red |
| 2 | C(5,6) | 1.414 | Blue |
| 3 | D(5,4) | 1.414 | Blue |
| 4 | A(2,3) | 2.828 | Red |
| 5 | E(7,8) | 4.243 | Blue |
For k = 3, neighbors are B, C, D.
Vote count:
Result: 2 Blue vs 1 Red.
\[\boxed{\text{P is classified as Blue}}\]
Predict whether an email is Spam or Ham.
The email contains: “free” and “money”.
| “free” | “money” | Class | |
|---|---|---|---|
| 1 | Yes | No | Spam |
| 2 | Yes | Yes | Spam |
| 3 | No | No | Ham |
| 4 | No | No | Ham |
| 5 | Yes | Yes | Spam |
Bayes’ Theorem:
\[P(C \mid X) = \frac{P(X \mid C) \cdot P(C)}{P(X)}\]
For Naive Bayes classification (independence assumption):
\[P(C \mid x_1, x_2, \dots, x_n) \propto P(C) \prod_{i=1}^{n} P(x_i \mid C)\]
Terminology:
Naïve Assumption
Features \(x_1, \dots, x_n\) are assumed conditionally independent given the class. This simplifies computation enormously, and works surprisingly well in practice (e.g., spam filtering).
Step 1 — Prior Probabilities
\[P(Spam) = \frac{3}{5} = 0.6 \qquad P(Ham) = \frac{2}{5} = 0.4\]
Step 2 — Likelihoods for {“free”, “money”}
For Spam (3 emails):
\[P(\text{free} \mid Spam) = \frac{3}{3} = 1.0 \qquad P(\text{money} \mid Spam) = \frac{2}{3} \approx 0.667\]
For Ham (2 emails):
\[P(\text{free} \mid Ham) = \frac{0}{2} = 0 \qquad P(\text{money} \mid Ham) = \frac{0}{2} = 0\]
Step 3 — Posterior Probabilities
\[P(Spam \mid \text{free, money}) \propto P(Spam) \times P(\text{free} \mid Spam) \times P(\text{money} \mid Spam)\]
\[= 0.6 \times 1.0 \times 0.667 = \boxed{0.4}\]
\[P(Ham \mid \text{free, money}) \propto P(Ham) \times P(\text{free} \mid Ham) \times P(\text{money} \mid Ham)\]
\[= 0.4 \times 0 \times 0 = \boxed{0}\]
Verdict: The email is classified as SPAM
\[\frac{P(Spam \mid X)}{P(Spam \mid X) + P(Ham \mid X)} = \frac{0.4}{0.4 + 0} = 100\%\]
📌 What does this formula measure?
This is a normalization step — since Naive Bayes computes scores that are only proportional to probabilities (\(\propto\)), we divide each class score by the sum of all scores to obtain a true probability between 0 and 1.
A single zero likelihood eliminates the class entirely. Here Ham scores 0 because neither “free” nor “money” ever appeared in Ham emails → one missing word is enough to discard the whole class.
⚠️ Zero Probability Problem → Laplace Smoothing
To avoid this, add 1 to every count before computing likelihoods: \[P(x_i \mid C) = \frac{\text{count}(x_i, C) + 1}{\text{count}(C) + |V|}\] where \(|V|\) is the vocabulary size. This ensures no probability is ever exactly 0.
Essential Formulas
Entropy: \(H(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)\)
Information Gain: \(IG(S,A) = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v)\)
Sigmoid: \(\sigma(z) = \frac{1}{1 + e^{-z}}\)
Euclidean Distance: \(d(P,Q) = \sqrt{\sum_{i=1}^{n}(p_i - q_i)^2}\)
Bayes: \(P(C \mid X) \propto P(C) \prod_{i=1}^{n} P(x_i \mid C)\)
Thank you for your attention!
📧 Contact: abdallah.khemais@example.com
📚 Resources: GitHub/TD-Classification