Detailed Correction - TD1

Fundamentals of Classification

Abdallah Khemais

Part 1 : Decision Trees

Exercise 1.1 : Problem Statement

Goal: Predict whether a customer buys a computer.

# Age Income Student Credit Buys
1 Young High No Excellent No
2 Young High No Excellent No
3 Middle High No Excellent Yes
4 Senior Medium No Excellent Yes
5 Senior Low Yes Excellent Yes
6 Senior Low Yes Fair No
7 Middle Low Yes Fair Yes
8 Young Medium No Excellent No
9 Young Low Yes Excellent Yes
10 Senior Medium Yes Excellent Yes
11 Young Medium Yes Fair Yes
12 Middle Medium No Fair Yes
13 Middle High Yes Excellent Yes
14 Senior Medium No Fair No

Attributes to test: Age, Income, Student, Credit
Target: Buys (Yes / No) — 9 Yes, 5 No

1.1 : Algorithm & Definitions

ID3 Algorithm — How it works

Starting from the full dataset \(S\), the algorithm recursively builds the tree:

  1. Stop condition (leaf): All examples same class → leaf with that class. No attributes left → leaf with majority class.
  2. Attribute selection: Compute \(IG(S, A)\) for every remaining attribute \(A\), pick \(A^* = \arg\max_A\, IG(S, A)\).
  3. Branching: For each value \(v\) of \(A^*\), create a branch for \(S_v\) and recurse without \(A^*\).

Key Formulas

Entropy — measures disorder: \[H(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)\]

Information Gain — entropy reduction after split: \[IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)\]

1.1 : Step 0 — Initial Entropy \(H(S)\)

Full dataset: 14 records — 9 Yes, 5 No

\[p_{yes} = \frac{9}{14} \approx 0.643 \qquad p_{no} = \frac{5}{14} \approx 0.357\]

\[H(S) = -(0.643 \times \log_2 0.643 + 0.357 \times \log_2 0.357)\]

\[= -(0.643 \times (-0.637) + 0.357 \times (-1.485))\]

\[= -(-0.410 - 0.530)\]

\[\boxed{H(S) = 0.940}\]

This is our baseline. Every IG is computed relative to this value.

1.1 : Step 1 — IG for Attribute Age

Subsets from the 14 records:

Value Records Yes No \(H(S_v)\)
Young 5 (1,2,8,9,11) 2 3 \(-(\frac{2}{5}\log_2\frac{2}{5}+\frac{3}{5}\log_2\frac{3}{5})=0.971\)
Middle 4 (3,7,12,13) 4 0 \(0\) (pure)
Senior 5 (4,5,6,10,14) 3 2 \(-(\frac{3}{5}\log_2\frac{3}{5}+\frac{2}{5}\log_2\frac{2}{5})=0.971\)

Weighted entropy: \[H(S, Age) = \frac{5}{14}(0.971) + \frac{4}{14}(0) + \frac{5}{14}(0.971)\] \[= 0.347 + 0 + 0.347 = 0.694\]

\[\boxed{IG(Age) = 0.940 - 0.694 = \mathbf{0.246}}\]

1.1 : Step 1 — IG for Attribute Income

Value Records Yes No \(H(S_v)\)
High 4 (1,2,3,13) 2 2 \(-(\frac{2}{4}\log_2\frac{2}{4}+\frac{2}{4}\log_2\frac{2}{4})=1.0\)
Medium 6 (4,8,10,11,12,14) 4 2 \(-(\frac{4}{6}\log_2\frac{4}{6}+\frac{2}{6}\log_2\frac{2}{6})=0.918\)
Low 4 (5,6,7,9) 3 1 \(-(\frac{3}{4}\log_2\frac{3}{4}+\frac{1}{4}\log_2\frac{1}{4})=0.811\)

Weighted entropy: \[H(S, Income) = \frac{4}{14}(1.0) + \frac{6}{14}(0.918) + \frac{4}{14}(0.811)\] \[= 0.286 + 0.393 + 0.232 = 0.911\]

\[\boxed{IG(Income) = 0.940 - 0.911 = \mathbf{0.029}}\]

1.1 : Step 1 — IG for Attribute Student

Value Records Yes No \(H(S_v)\)
No 7 (1,2,3,4,8,12,14) 3 4 \(-(\frac{3}{7}\log_2\frac{3}{7}+\frac{4}{7}\log_2\frac{4}{7})=0.985\)
Yes 7 (5,6,7,9,10,11,13) 6 1 \(-(\frac{6}{7}\log_2\frac{6}{7}+\frac{1}{7}\log_2\frac{1}{7})=0.592\)

Weighted entropy: \[H(S, Student) = \frac{7}{14}(0.985) + \frac{7}{14}(0.592)\] \[= 0.493 + 0.296 = 0.789\]

\[\boxed{IG(Student) = 0.940 - 0.789 = \mathbf{0.151}}\]

1.1 : Step 1 — IG for Attribute Credit

Value Records Yes No \(H(S_v)\)
Excellent 8 (1,2,3,4,5,9,10,13) 6 2 \(-(\frac{6}{8}\log_2\frac{6}{8}+\frac{2}{8}\log_2\frac{2}{8})=0.811\)
Fair 6 (6,7,8,11,12,14) 3 3 \(-(\frac{3}{6}\log_2\frac{3}{6}+\frac{3}{6}\log_2\frac{3}{6})=1.0\)

Weighted entropy: \[H(S, Credit) = \frac{8}{14}(0.811) + \frac{6}{14}(1.0)\] \[= 0.463 + 0.429 = 0.892\]

\[\boxed{IG(Credit) = 0.940 - 0.892 = \mathbf{0.048}}\]

1.1 : Step 1 — Root Node Selection

Summary of all Information Gains at the root:

Attribute \(IG\)
Age 0.246 Maximum → Root node
Student 0.151
Credit 0.048
Income 0.029

graph TD
    A["🌳 Age  ← ROOT (IG = 0.246)"]
    A -- Young --> B["? (5 records)"]
    A -- Middle --> C["? (4 records)"]
    A -- Senior --> D["? (5 records)"]
    style A fill:#f9f,stroke:#333,stroke-width:3px

Now we recurse on each of the 3 branches independently.

1.1 : Step 2 — Branch Middle (Records 3, 7, 12, 13)

# Age Income Student Credit Buys
3 Middle High No Excellent Yes
7 Middle Low Yes Fair Yes
12 Middle Medium No Fair Yes
13 Middle High Yes Excellent Yes

All 4 records → Yes. \[H(S_{Middle}) = 0 \quad \text{(pure node)}\]

🍃 Leaf: YES — No further splitting needed.

1.1 : Step 2 — Branch Young — Entropy

Young subset (Records 1, 2, 8, 9, 11): 2 Yes / 3 No

\[H(S_{Young}) = -\!\left(\tfrac{2}{5}\log_2\tfrac{2}{5}+\tfrac{3}{5}\log_2\tfrac{3}{5}\right) = 0.971\]

Not pure → we must test the remaining attributes: Income, Student, Credit.

# Income Student Credit Buys
1 High No Excellent No
2 High No Excellent No
8 Medium No Excellent No
9 Low Yes Excellent Yes
11 Medium Yes Fair Yes

1.1 : Step 2 — Branch Young — IG(Income)

Test Income on the 5 Young records:

Value Records Yes No \(H(S_v)\)
High 1, 2 0 2 \(0\) (pure No)
Medium 8, 11 1 1 \(1.0\) (max disorder)
Low 9 1 0 \(0\) (pure Yes)

\[H(S_{Young}, Income) = \frac{2}{5}(0)+\frac{2}{5}(1.0)+\frac{1}{5}(0) = 0.400\]

\[\boxed{IG(Income \mid Young) = 0.971 - 0.400 = 0.571}\]

1.1 : Step 2 — Branch Young — IG(Student)

Test Student on the 5 Young records:

Value Records Yes No \(H(S_v)\)
No 1, 2, 8 0 3 \(0\) (pure No)
Yes 9, 11 2 0 \(0\) (pure Yes)

\[H(S_{Young}, Student) = \frac{3}{5}(0)+\frac{2}{5}(0) = 0\]

\[\boxed{IG(Student \mid Young) = 0.971 - 0 = \mathbf{0.971}}\]

→ Perfect split! Both subsets are pure.

1.1 : Step 2 — Branch Young — IG(Credit)

Test Credit on the 5 Young records:

Value Records Yes No \(H(S_v)\)
Excellent 1, 2, 8, 9 1 3 \(-(\frac{1}{4}\log_2\frac{1}{4}+\frac{3}{4}\log_2\frac{3}{4})=0.811\)
Fair 11 1 0 \(0\) (pure Yes)

\[H(S_{Young}, Credit) = \frac{4}{5}(0.811)+\frac{1}{5}(0) = 0.649\]

\[\boxed{IG(Credit \mid Young) = 0.971 - 0.649 = 0.322}\]

1.1 : Step 2 — Branch Young — Best Split

Attribute \(IG\)
Student 0.971 Maximum
Income 0.571
Credit 0.322

Student splits the Young branch perfectly!

Student = No → Records 1, 2, 8 → all No\(H=0\) → 🍃 Leaf: NO

Student = Yes → Records 9, 11 → all Yes\(H=0\) → 🍃 Leaf: YES

graph TD
    A["Age = Young"] --> B{Student?}
    B -- No --> C(["🍃 NO (3/3)"])
    B -- Yes --> D(["🍃 YES (2/2)"])
    style C fill:#f99,stroke:#333
    style D fill:#9f9,stroke:#333

1.1 : Step 2 — Branch Senior — Entropy

Senior subset (Records 4, 5, 6, 10, 14): 3 Yes / 2 No

\[H(S_{Senior}) = -\!\left(\tfrac{3}{5}\log_2\tfrac{3}{5}+\tfrac{2}{5}\log_2\tfrac{2}{5}\right) = 0.971\]

Not pure → we test the remaining attributes: Income, Student, Credit.

# Income Student Credit Buys
4 Medium No Excellent Yes
5 Low Yes Excellent Yes
6 Low Yes Fair No
10 Medium Yes Excellent Yes
14 Medium No Fair No

1.1 : Step 2 — Branch Senior — IG(Income)

Test Income on the 5 Senior records:

Value Records Yes No \(H(S_v)\)
Medium 4, 10, 14 2 1 \(-(\frac{2}{3}\log_2\frac{2}{3}+\frac{1}{3}\log_2\frac{1}{3})=0.918\)
Low 5, 6 1 1 \(1.0\) (max disorder)

\[H(S_{Senior}, Income) = \frac{3}{5}(0.918)+\frac{2}{5}(1.0) = 0.951\]

\[\boxed{IG(Income \mid Senior) = 0.971 - 0.951 = 0.020}\]

1.1 : Step 2 — Branch Senior — IG(Student)

Test Student on the 5 Senior records:

Value Records Yes No \(H(S_v)\)
No 4, 14 1 1 \(1.0\) (max disorder)
Yes 5, 6, 10 2 1 \(-(\frac{2}{3}\log_2\frac{2}{3}+\frac{1}{3}\log_2\frac{1}{3})=0.918\)

\[H(S_{Senior}, Student) = \frac{2}{5}(1.0)+\frac{3}{5}(0.918) = 0.951\]

\[\boxed{IG(Student \mid Senior) = 0.971 - 0.951 = 0.020}\]

1.1 : Step 2 — Branch Senior — IG(Credit)

Test Credit on the 5 Senior records:

Value Records Yes No \(H(S_v)\)
Excellent 4, 5, 10 3 0 \(0\) (pure Yes)
Fair 6, 14 0 2 \(0\) (pure No)

\[H(S_{Senior}, Credit) = \frac{3}{5}(0)+\frac{2}{5}(0) = 0\]

\[\boxed{IG(Credit \mid Senior) = 0.971 - 0 = \mathbf{0.971}}\]

→ Perfect split! Both subsets are pure.

1.1 : Step 2 — Branch Senior — Best Split

Attribute \(IG\)
Credit 0.971 Maximum
Income 0.020
Student 0.020

Credit splits the Senior branch perfectly!

Credit = Excellent → Records 4, 5, 10 → all Yes\(H=0\) → 🍃 Leaf: YES

Credit = Fair → Records 6, 14 → all No\(H=0\) → 🍃 Leaf: NO

graph TD
    A["Age = Senior"] --> B{Credit?}
    B -- Excellent --> C(["🍃 YES (3/3)"])
    B -- Fair --> D(["🍃 NO (2/2)"])
    style C fill:#9f9,stroke:#333
    style D fill:#f99,stroke:#333

1.1 : Final Decision Tree

Assembling all branches:

graph TD
    ROOT["🌳 Age?"]
    ROOT -- Young --> SY{Student?}
    ROOT -- Middle --> LM(["🍃 YES<br/>4/4"])
    ROOT -- Senior --> SC{Credit?}
    SY -- No --> LN(["🍃 NO<br/>3/3"])
    SY -- Yes --> LY(["🍃 YES<br/>2/2"])
    SC -- Excellent --> LE(["🍃 YES<br/>3/3"])
    SC -- Fair --> LF(["🍃 NO<br/>2/2"])
    style ROOT fill:#f9f,stroke:#333,stroke-width:3px
    style LM fill:#9f9,stroke:#333,stroke-width:2px
    style LY fill:#9f9,stroke:#333,stroke-width:2px
    style LE fill:#9f9,stroke:#333,stroke-width:2px
    style LN fill:#f99,stroke:#333,stroke-width:2px
    style LF fill:#f99,stroke:#333,stroke-width:2px

1.1 : Tree Summary & Verification

All 14 records correctly classified:

Branch Path Records Prediction Correct?
Middle Age=Middle 3,7,12,13 YES ✅ 4/4
Young+Student=No Age=Young, Student=No 1,2,8 NO ✅ 3/3
Young+Student=Yes Age=Young, Student=Yes 9,11 YES ✅ 2/2
Senior+Credit=Excellent Age=Senior, Credit=Excellent 4,5,10 YES ✅ 3/3
Senior+Credit=Fair Age=Senior, Credit=Fair 6,14 NO ✅ 2/2

Tip

Tree depth = 2. Only 2 of 4 attributes were needed (Age + one of Student/Credit). Income was never selected — it carried the least information gain at every level.

Part 2 : Logistic Regression

Exercise 2.1 : Problem Statement

Admission model:

Sigmoid Function:

\[P(y=1 \mid x) = \sigma(z) = \frac{1}{1 + e^{-z}}\]

where \(z = 0.05 \times \text{score} - 3\)

Questions:

  1. Probability for score = 60?
  2. Probability for score = 80?
  3. Score required for \(P(y=1) = 0.5\)?

Why Logistic Regression?

The sigmoid maps any real value to \((0, 1)\), making it ideal for binary classification. The decision boundary is the score where \(\sigma(z) = 0.5\), i.e. \(z = 0\).

2.1 : Sigmoid Function — Visualization

2.1 : Step-by-step Calculations

1. Score = 60: \[z = 0.05 \times 60 - 3 = 0\]

\[P(y=1 \mid x=60) = \frac{1}{1 + e^{0}} = \frac{1}{2} = \boxed{0.5}\]

→ Exactly on the decision boundary.

2. Score = 80: \[z = 0.05 \times 80 - 3 = 1\]

\[P(y=1 \mid x=80) = \frac{1}{1 + e^{-1}} = \frac{1}{1 + 0.368} \approx \boxed{0.731}\]

→ 73% chance of admission.

3. Score for \(P = 0.5\):

\(\sigma(z) = 0.5 \Leftrightarrow z = 0 \Leftrightarrow 0.05 \times \text{score} - 3 = 0\)

\[\text{score} = \frac{3}{0.05} = \boxed{60}\]

Part 3 : k-Nearest Neighbors (k-NN)

Exercise 3.1 : Problem Statement

Points in 2D space:

  • A(2, 3) [Red], B(3, 4) [Red]
  • C(5, 6) [Blue], D(5, 4) [Blue], E(7, 8) [Blue]

Target point: P(4, 5).

What is its class for \(k = 3\)?

k-NN Algorithm

  1. Compute the distance from P to every training point.
  2. Keep the k nearest neighbors.
  3. Assign the majority class among those neighbors.

3.1 : Graphical Representation

3.1 : Formula & Calculations

Euclidean Distance:

\[d(P, Q) = \sqrt{(x_P - x_Q)^2 + (y_P - y_Q)^2}\]

Distances from P(4, 5):

\[d(P, A) = \sqrt{(4-2)^2 + (5-3)^2} = \sqrt{4+4} = \sqrt{8} \approx \boxed{2.828}\]

\[d(P, B) = \sqrt{(4-3)^2 + (5-4)^2} = \sqrt{1+1} = \sqrt{2} \approx \boxed{1.414}\]

\[d(P, C) = \sqrt{(4-5)^2 + (5-6)^2} = \sqrt{1+1} = \sqrt{2} \approx \boxed{1.414}\]

\[d(P, D) = \sqrt{(4-5)^2 + (5-4)^2} = \sqrt{1+1} = \sqrt{2} \approx \boxed{1.414}\]

\[d(P, E) = \sqrt{(4-7)^2 + (5-8)^2} = \sqrt{9+9} = \sqrt{18} \approx \boxed{4.243}\]

3.1 : Final Decision

Sorted by ascending distance:

Rank Point Distance Class
1 B(3,4) 1.414 Red
2 C(5,6) 1.414 Blue
3 D(5,4) 1.414 Blue
4 A(2,3) 2.828 Red
5 E(7,8) 4.243 Blue

For k = 3, neighbors are B, C, D.

Vote count:

  • B → Red (1 vote)
  • C → Blue (1 vote)
  • D → Blue (1 vote)

Result: 2 Blue vs 1 Red.

\[\boxed{\text{P is classified as Blue}}\]

Part 4 : Naive Bayes

Exercise 4.1 : Problem Statement

Predict whether an email is Spam or Ham.

The email contains: “free” and “money”.

Email “free” “money” Class
1 Yes No Spam
2 Yes Yes Spam
3 No No Ham
4 No No Ham
5 Yes Yes Spam

4.1 : Formula & Key Terms

Bayes’ Theorem:

\[P(C \mid X) = \frac{P(X \mid C) \cdot P(C)}{P(X)}\]

For Naive Bayes classification (independence assumption):

\[P(C \mid x_1, x_2, \dots, x_n) \propto P(C) \prod_{i=1}^{n} P(x_i \mid C)\]

Terminology:

  • \(P(C)\)Prior probability (global class frequency)
  • \(P(X \mid C)\)Likelihood (feature frequency per class)
  • \(P(C \mid X)\)Posterior probability (our prediction target)

Naïve Assumption

Features \(x_1, \dots, x_n\) are assumed conditionally independent given the class. This simplifies computation enormously, and works surprisingly well in practice (e.g., spam filtering).

4.1 : Step-by-step Calculations

Step 1 — Prior Probabilities

\[P(Spam) = \frac{3}{5} = 0.6 \qquad P(Ham) = \frac{2}{5} = 0.4\]

Step 2 — Likelihoods for {“free”, “money”}

For Spam (3 emails):

\[P(\text{free} \mid Spam) = \frac{3}{3} = 1.0 \qquad P(\text{money} \mid Spam) = \frac{2}{3} \approx 0.667\]

For Ham (2 emails):

\[P(\text{free} \mid Ham) = \frac{0}{2} = 0 \qquad P(\text{money} \mid Ham) = \frac{0}{2} = 0\]

4.1 : Final Computation

Step 3 — Posterior Probabilities

\[P(Spam \mid \text{free, money}) \propto P(Spam) \times P(\text{free} \mid Spam) \times P(\text{money} \mid Spam)\]

\[= 0.6 \times 1.0 \times 0.667 = \boxed{0.4}\]

\[P(Ham \mid \text{free, money}) \propto P(Ham) \times P(\text{free} \mid Ham) \times P(\text{money} \mid Ham)\]

\[= 0.4 \times 0 \times 0 = \boxed{0}\]

Verdict: The email is classified as SPAM

\[\frac{P(Spam \mid X)}{P(Spam \mid X) + P(Ham \mid X)} = \frac{0.4}{0.4 + 0} = 100\%\]

📌 What does this formula measure?

This is a normalization step — since Naive Bayes computes scores that are only proportional to probabilities (\(\propto\)), we divide each class score by the sum of all scores to obtain a true probability between 0 and 1.

  • \(P(Spam \mid X) \propto 0.4\) → score for Spam
  • \(P(Ham \mid X) \propto 0\) → score for Ham, wiped out by a zero likelihood

A single zero likelihood eliminates the class entirely. Here Ham scores 0 because neither “free” nor “money” ever appeared in Ham emails → one missing word is enough to discard the whole class.

⚠️ Zero Probability Problem → Laplace Smoothing

To avoid this, add 1 to every count before computing likelihoods: \[P(x_i \mid C) = \frac{\text{count}(x_i, C) + 1}{\text{count}(C) + |V|}\] where \(|V|\) is the vocabulary size. This ensures no probability is ever exactly 0.

Lab Session : Python Experiments

💻 k-NN with scikit-learn

💻 Logistic Regression

👋 Conclusion

  • Decision Trees — Maximize information gain at each split
    • Criterion: Entropy & Gain (ID3 algorithm)
    • Strength: Interpretable, visual, no feature scaling needed
  • Logistic Regression — Simple probabilistic baseline
    • Sigmoid: \(\sigma(z) = \frac{1}{1+e^{-z}}\)
    • Strength: Calibrated probabilities, fast training
  • k-NN — Geometric, instance-based learning
    • Always normalize features before use
    • Sensitive to the choice of \(k\) and to outliers
  • Naive Bayes — Probabilistic classifier, great for text
    • Independence assumption (often violated, yet effective)
    • Very fast; use Laplace smoothing to handle zero likelihoods

🎯 Key Takeaways

Essential Formulas

  1. Entropy: \(H(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)\)

  2. Information Gain: \(IG(S,A) = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v)\)

  3. Sigmoid: \(\sigma(z) = \frac{1}{1 + e^{-z}}\)

  4. Euclidean Distance: \(d(P,Q) = \sqrt{\sum_{i=1}^{n}(p_i - q_i)^2}\)

  5. Bayes: \(P(C \mid X) \propto P(C) \prod_{i=1}^{n} P(x_i \mid C)\)

❓ Questions?

Thank you for your attention!

📧 Contact: abdallah.khemais@example.com

📚 Resources: GitHub/TD-Classification