Detailed Correction - TD1

Fundamentals of Classification

Abdallah Khemais

Part 1 : Decision Trees

Exercise 1.1 : Problem Statement

Goal: Predict whether a customer buys a computer.

#	Age	Income	Student	Credit	Buys
1	Young	High	No	Excellent	No
2	Young	High	No	Excellent	No
3	Middle	High	No	Excellent	Yes
4	Senior	Medium	No	Excellent	Yes
5	Senior	Low	Yes	Excellent	Yes
6	Senior	Low	Yes	Fair	No
7	Middle	Low	Yes	Fair	Yes
8	Young	Medium	No	Excellent	No
9	Young	Low	Yes	Excellent	Yes
10	Senior	Medium	Yes	Excellent	Yes
11	Young	Medium	Yes	Fair	Yes
12	Middle	Medium	No	Fair	Yes
13	Middle	High	Yes	Excellent	Yes
14	Senior	Medium	No	Fair	No

Attributes to test: Age, Income, Student, Credit
Target: Buys (Yes / No) — 9 Yes, 5 No

1.1 : Algorithm & Definitions

ID3 Algorithm — How it works

Starting from the full dataset \(S\), the algorithm recursively builds the tree:

Stop condition (leaf): All examples same class → leaf with that class. No attributes left → leaf with majority class.
Attribute selection: Compute \(IG(S, A)\) for every remaining attribute \(A\), pick \(A^* = \arg\max_A\, IG(S, A)\).
Branching: For each value \(v\) of \(A^*\), create a branch for \(S_v\) and recurse without \(A^*\).

Key Formulas

Entropy — measures disorder: \[H(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)\]

Information Gain — entropy reduction after split: \[IG(S, A) = H(S) - \sum_{v \in Values(A)} \frac{|S_v|}{|S|} H(S_v)\]

1.1 : Step 0 — Initial Entropy \(H(S)\)

Full dataset: 14 records — 9 Yes, 5 No

\[p_{yes} = \frac{9}{14} \approx 0.643 \qquad p_{no} = \frac{5}{14} \approx 0.357\]

\[H(S) = -(0.643 \times \log_2 0.643 + 0.357 \times \log_2 0.357)\]

\[= -(0.643 \times (-0.637) + 0.357 \times (-1.485))\]

\[= -(-0.410 - 0.530)\]

\[\boxed{H(S) = 0.940}\]

This is our baseline. Every IG is computed relative to this value.

1.1 : Step 1 — IG for Attribute Age

Subsets from the 14 records:

Value	Records	Yes	No	\(H(S_v)\)
Young	5 (1,2,8,9,11)	2	3	\(-(\frac{2}{5}\log_2\frac{2}{5}+\frac{3}{5}\log_2\frac{3}{5})=0.971\)
Middle	4 (3,7,12,13)	4	0	\(0\) (pure)
Senior	5 (4,5,6,10,14)	3	2	\(-(\frac{3}{5}\log_2\frac{3}{5}+\frac{2}{5}\log_2\frac{2}{5})=0.971\)

Weighted entropy: \[H(S, Age) = \frac{5}{14}(0.971) + \frac{4}{14}(0) + \frac{5}{14}(0.971)\] \[= 0.347 + 0 + 0.347 = 0.694\]

\[\boxed{IG(Age) = 0.940 - 0.694 = \mathbf{0.246}}\]

1.1 : Step 1 — IG for Attribute Income

Value	Records	Yes	No	\(H(S_v)\)
High	4 (1,2,3,13)	2	2	\(-(\frac{2}{4}\log_2\frac{2}{4}+\frac{2}{4}\log_2\frac{2}{4})=1.0\)
Medium	6 (4,8,10,11,12,14)	4	2	\(-(\frac{4}{6}\log_2\frac{4}{6}+\frac{2}{6}\log_2\frac{2}{6})=0.918\)
Low	4 (5,6,7,9)	3	1	\(-(\frac{3}{4}\log_2\frac{3}{4}+\frac{1}{4}\log_2\frac{1}{4})=0.811\)

Weighted entropy: \[H(S, Income) = \frac{4}{14}(1.0) + \frac{6}{14}(0.918) + \frac{4}{14}(0.811)\] \[= 0.286 + 0.393 + 0.232 = 0.911\]

\[\boxed{IG(Income) = 0.940 - 0.911 = \mathbf{0.029}}\]

1.1 : Step 1 — IG for Attribute Student

Value	Records	Yes	No	\(H(S_v)\)
No	7 (1,2,3,4,8,12,14)	3	4	\(-(\frac{3}{7}\log_2\frac{3}{7}+\frac{4}{7}\log_2\frac{4}{7})=0.985\)
Yes	7 (5,6,7,9,10,11,13)	6	1	\(-(\frac{6}{7}\log_2\frac{6}{7}+\frac{1}{7}\log_2\frac{1}{7})=0.592\)

Weighted entropy: \[H(S, Student) = \frac{7}{14}(0.985) + \frac{7}{14}(0.592)\] \[= 0.493 + 0.296 = 0.789\]

\[\boxed{IG(Student) = 0.940 - 0.789 = \mathbf{0.151}}\]

1.1 : Step 1 — IG for Attribute Credit

Value	Records	Yes	No	\(H(S_v)\)
Excellent	8 (1,2,3,4,5,9,10,13)	6	2	\(-(\frac{6}{8}\log_2\frac{6}{8}+\frac{2}{8}\log_2\frac{2}{8})=0.811\)
Fair	6 (6,7,8,11,12,14)	3	3	\(-(\frac{3}{6}\log_2\frac{3}{6}+\frac{3}{6}\log_2\frac{3}{6})=1.0\)

Weighted entropy: \[H(S, Credit) = \frac{8}{14}(0.811) + \frac{6}{14}(1.0)\] \[= 0.463 + 0.429 = 0.892\]

\[\boxed{IG(Credit) = 0.940 - 0.892 = \mathbf{0.048}}\]

1.1 : Step 1 — Root Node Selection

Summary of all Information Gains at the root:

Attribute	\(IG\)
Age	0.246	✅ Maximum → Root node
Student	0.151
Credit	0.048
Income	0.029

graph TD
    A["🌳 Age  ← ROOT (IG = 0.246)"]
    A -- Young --> B["? (5 records)"]
    A -- Middle --> C["? (4 records)"]
    A -- Senior --> D["? (5 records)"]
    style A fill:#f9f,stroke:#333,stroke-width:3px

Now we recurse on each of the 3 branches independently.

1.1 : Step 2 — Branch Middle (Records 3, 7, 12, 13)

#	Age	Income	Student	Credit	Buys
3	Middle	High	No	Excellent	Yes
7	Middle	Low	Yes	Fair	Yes
12	Middle	Medium	No	Fair	Yes
13	Middle	High	Yes	Excellent	Yes

All 4 records → Yes. \[H(S_{Middle}) = 0 \quad \text{(pure node)}\]

🍃 Leaf: YES — No further splitting needed.

1.1 : Step 2 — Branch Young — Entropy

Young subset (Records 1, 2, 8, 9, 11): 2 Yes / 3 No

\[H(S_{Young}) = -\!\left(\tfrac{2}{5}\log_2\tfrac{2}{5}+\tfrac{3}{5}\log_2\tfrac{3}{5}\right) = 0.971\]

Not pure → we must test the remaining attributes: Income, Student, Credit.

#	Income	Student	Credit	Buys
1	High	No	Excellent	No
2	High	No	Excellent	No
8	Medium	No	Excellent	No
9	Low	Yes	Excellent	Yes
11	Medium	Yes	Fair	Yes

1.1 : Step 2 — Branch Young — IG(Income)

Test Income on the 5 Young records:

Value	Records	Yes	No	\(H(S_v)\)
High	1, 2	0	2	\(0\) (pure No)
Medium	8, 11	1	1	\(1.0\) (max disorder)
Low	9	1	0	\(0\) (pure Yes)

\[H(S_{Young}, Income) = \frac{2}{5}(0)+\frac{2}{5}(1.0)+\frac{1}{5}(0) = 0.400\]

\[\boxed{IG(Income \mid Young) = 0.971 - 0.400 = 0.571}\]

1.1 : Step 2 — Branch Young — IG(Student)

Test Student on the 5 Young records:

Value	Records	Yes	No	\(H(S_v)\)
No	1, 2, 8	0	3	\(0\) (pure No)
Yes	9, 11	2	0	\(0\) (pure Yes)

\[H(S_{Young}, Student) = \frac{3}{5}(0)+\frac{2}{5}(0) = 0\]

\[\boxed{IG(Student \mid Young) = 0.971 - 0 = \mathbf{0.971}}\]

→ Perfect split! Both subsets are pure.

1.1 : Step 2 — Branch Young — IG(Credit)

Test Credit on the 5 Young records:

Value	Records	Yes	No	\(H(S_v)\)
Excellent	1, 2, 8, 9	1	3	\(-(\frac{1}{4}\log_2\frac{1}{4}+\frac{3}{4}\log_2\frac{3}{4})=0.811\)
Fair	11	1	0	\(0\) (pure Yes)

\[H(S_{Young}, Credit) = \frac{4}{5}(0.811)+\frac{1}{5}(0) = 0.649\]

\[\boxed{IG(Credit \mid Young) = 0.971 - 0.649 = 0.322}\]

1.1 : Step 2 — Branch Young — Best Split

Attribute	\(IG\)
Student	0.971	✅ Maximum
Income	0.571
Credit	0.322

→ Student splits the Young branch perfectly!

Student = No → Records 1, 2, 8 → all No → \(H=0\) → 🍃 Leaf: NO

Student = Yes → Records 9, 11 → all Yes → \(H=0\) → 🍃 Leaf: YES

graph TD
    A["Age = Young"] --> B{Student?}
    B -- No --> C(["🍃 NO (3/3)"])
    B -- Yes --> D(["🍃 YES (2/2)"])
    style C fill:#f99,stroke:#333
    style D fill:#9f9,stroke:#333

1.1 : Step 2 — Branch Senior — Entropy

Senior subset (Records 4, 5, 6, 10, 14): 3 Yes / 2 No

\[H(S_{Senior}) = -\!\left(\tfrac{3}{5}\log_2\tfrac{3}{5}+\tfrac{2}{5}\log_2\tfrac{2}{5}\right) = 0.971\]

Not pure → we test the remaining attributes: Income, Student, Credit.

#	Income	Student	Credit	Buys
4	Medium	No	Excellent	Yes
5	Low	Yes	Excellent	Yes
6	Low	Yes	Fair	No
10	Medium	Yes	Excellent	Yes
14	Medium	No	Fair	No

1.1 : Step 2 — Branch Senior — IG(Income)

Test Income on the 5 Senior records:

Value	Records	Yes	No	\(H(S_v)\)
Medium	4, 10, 14	2	1	\(-(\frac{2}{3}\log_2\frac{2}{3}+\frac{1}{3}\log_2\frac{1}{3})=0.918\)
Low	5, 6	1	1	\(1.0\) (max disorder)

\[H(S_{Senior}, Income) = \frac{3}{5}(0.918)+\frac{2}{5}(1.0) = 0.951\]

\[\boxed{IG(Income \mid Senior) = 0.971 - 0.951 = 0.020}\]

1.1 : Step 2 — Branch Senior — IG(Student)

Test Student on the 5 Senior records:

Value	Records	Yes	No	\(H(S_v)\)
No	4, 14	1	1	\(1.0\) (max disorder)
Yes	5, 6, 10	2	1	\(-(\frac{2}{3}\log_2\frac{2}{3}+\frac{1}{3}\log_2\frac{1}{3})=0.918\)

\[H(S_{Senior}, Student) = \frac{2}{5}(1.0)+\frac{3}{5}(0.918) = 0.951\]

\[\boxed{IG(Student \mid Senior) = 0.971 - 0.951 = 0.020}\]

1.1 : Step 2 — Branch Senior — IG(Credit)

Test Credit on the 5 Senior records:

Value	Records	Yes	No	\(H(S_v)\)
Excellent	4, 5, 10	3	0	\(0\) (pure Yes)
Fair	6, 14	0	2	\(0\) (pure No)

\[H(S_{Senior}, Credit) = \frac{3}{5}(0)+\frac{2}{5}(0) = 0\]

\[\boxed{IG(Credit \mid Senior) = 0.971 - 0 = \mathbf{0.971}}\]

→ Perfect split! Both subsets are pure.

1.1 : Step 2 — Branch Senior — Best Split

Attribute	\(IG\)
Credit	0.971	✅ Maximum
Income	0.020
Student	0.020

→ Credit splits the Senior branch perfectly!

Credit = Excellent → Records 4, 5, 10 → all Yes → \(H=0\) → 🍃 Leaf: YES

Credit = Fair → Records 6, 14 → all No → \(H=0\) → 🍃 Leaf: NO

graph TD
    A["Age = Senior"] --> B{Credit?}
    B -- Excellent --> C(["🍃 YES (3/3)"])
    B -- Fair --> D(["🍃 NO (2/2)"])
    style C fill:#9f9,stroke:#333
    style D fill:#f99,stroke:#333

1.1 : Final Decision Tree

Assembling all branches:

graph TD
    ROOT["🌳 Age?"]
    ROOT -- Young --> SY{Student?}
    ROOT -- Middle --> LM(["🍃 YES<br/>4/4"])
    ROOT -- Senior --> SC{Credit?}
    SY -- No --> LN(["🍃 NO<br/>3/3"])
    SY -- Yes --> LY(["🍃 YES<br/>2/2"])
    SC -- Excellent --> LE(["🍃 YES<br/>3/3"])
    SC -- Fair --> LF(["🍃 NO<br/>2/2"])
    style ROOT fill:#f9f,stroke:#333,stroke-width:3px
    style LM fill:#9f9,stroke:#333,stroke-width:2px
    style LY fill:#9f9,stroke:#333,stroke-width:2px
    style LE fill:#9f9,stroke:#333,stroke-width:2px
    style LN fill:#f99,stroke:#333,stroke-width:2px
    style LF fill:#f99,stroke:#333,stroke-width:2px

1.1 : Tree Summary & Verification

All 14 records correctly classified:

Branch	Path	Records	Prediction	Correct?
Middle	Age=Middle	3,7,12,13	YES	✅ 4/4
Young+Student=No	Age=Young, Student=No	1,2,8	NO	✅ 3/3
Young+Student=Yes	Age=Young, Student=Yes	9,11	YES	✅ 2/2
Senior+Credit=Excellent	Age=Senior, Credit=Excellent	4,5,10	YES	✅ 3/3
Senior+Credit=Fair	Age=Senior, Credit=Fair	6,14	NO	✅ 2/2

Tip

Tree depth = 2. Only 2 of 4 attributes were needed (Age + one of Student/Credit). Income was never selected — it carried the least information gain at every level.

Part 2 : Logistic Regression

Exercise 2.1 : Problem Statement

Admission model:

Sigmoid Function:

\[P(y=1 \mid x) = \sigma(z) = \frac{1}{1 + e^{-z}}\]

where \(z = 0.05 \times \text{score} - 3\)

Questions:

Probability for score = 60?
Probability for score = 80?
Score required for \(P(y=1) = 0.5\)?

Why Logistic Regression?

The sigmoid maps any real value to \((0, 1)\), making it ideal for binary classification. The decision boundary is the score where \(\sigma(z) = 0.5\), i.e. \(z = 0\).

2.1 : Sigmoid Function — Visualization

2.1 : Step-by-step Calculations

1. Score = 60: \[z = 0.05 \times 60 - 3 = 0\]

\[P(y=1 \mid x=60) = \frac{1}{1 + e^{0}} = \frac{1}{2} = \boxed{0.5}\]

→ Exactly on the decision boundary.

2. Score = 80: \[z = 0.05 \times 80 - 3 = 1\]

\[P(y=1 \mid x=80) = \frac{1}{1 + e^{-1}} = \frac{1}{1 + 0.368} \approx \boxed{0.731}\]

→ 73% chance of admission.

3. Score for \(P = 0.5\):

\(\sigma(z) = 0.5 \Leftrightarrow z = 0 \Leftrightarrow 0.05 \times \text{score} - 3 = 0\)

\[\text{score} = \frac{3}{0.05} = \boxed{60}\]

Part 3 : k-Nearest Neighbors (k-NN)

Exercise 3.1 : Problem Statement

Points in 2D space:

A(2, 3) [Red], B(3, 4) [Red]
C(5, 6) [Blue], D(5, 4) [Blue], E(7, 8) [Blue]

Target point: P(4, 5).

What is its class for \(k = 3\)?

k-NN Algorithm

Compute the distance from P to every training point.
Keep the k nearest neighbors.
Assign the majority class among those neighbors.

3.1 : Graphical Representation

3.1 : Formula & Calculations

Euclidean Distance:

\[d(P, Q) = \sqrt{(x_P - x_Q)^2 + (y_P - y_Q)^2}\]

Distances from P(4, 5):

\[d(P, A) = \sqrt{(4-2)^2 + (5-3)^2} = \sqrt{4+4} = \sqrt{8} \approx \boxed{2.828}\]

\[d(P, B) = \sqrt{(4-3)^2 + (5-4)^2} = \sqrt{1+1} = \sqrt{2} \approx \boxed{1.414}\]

\[d(P, C) = \sqrt{(4-5)^2 + (5-6)^2} = \sqrt{1+1} = \sqrt{2} \approx \boxed{1.414}\]

\[d(P, D) = \sqrt{(4-5)^2 + (5-4)^2} = \sqrt{1+1} = \sqrt{2} \approx \boxed{1.414}\]

\[d(P, E) = \sqrt{(4-7)^2 + (5-8)^2} = \sqrt{9+9} = \sqrt{18} \approx \boxed{4.243}\]

3.1 : Final Decision

Sorted by ascending distance:

Rank	Point	Distance	Class
1	B(3,4)	1.414	Red
2	C(5,6)	1.414	Blue
3	D(5,4)	1.414	Blue
4	A(2,3)	2.828	Red
5	E(7,8)	4.243	Blue

For k = 3, neighbors are B, C, D.

Vote count:

B → Red (1 vote)
C → Blue (1 vote)
D → Blue (1 vote)

Result: 2 Blue vs 1 Red.

\[\boxed{\text{P is classified as Blue}}\]

Part 4 : Naive Bayes

Exercise 4.1 : Problem Statement

Predict whether an email is Spam or Ham.

The email contains: “free” and “money”.

Email	“free”	“money”	Class
1	Yes	No	Spam
2	Yes	Yes	Spam
3	No	No	Ham
4	No	No	Ham
5	Yes	Yes	Spam

4.1 : Formula & Key Terms

Bayes’ Theorem:

\[P(C \mid X) = \frac{P(X \mid C) \cdot P(C)}{P(X)}\]

For Naive Bayes classification (independence assumption):

\[P(C \mid x_1, x_2, \dots, x_n) \propto P(C) \prod_{i=1}^{n} P(x_i \mid C)\]

Terminology:

\(P(C)\) — Prior probability (global class frequency)
\(P(X \mid C)\) — Likelihood (feature frequency per class)
\(P(C \mid X)\) — Posterior probability (our prediction target)

Naïve Assumption

Features \(x_1, \dots, x_n\) are assumed conditionally independent given the class. This simplifies computation enormously, and works surprisingly well in practice (e.g., spam filtering).

4.1 : Step-by-step Calculations

Step 1 — Prior Probabilities

\[P(Spam) = \frac{3}{5} = 0.6 \qquad P(Ham) = \frac{2}{5} = 0.4\]

Step 2 — Likelihoods for {“free”, “money”}

For Spam (3 emails):

\[P(\text{free} \mid Spam) = \frac{3}{3} = 1.0 \qquad P(\text{money} \mid Spam) = \frac{2}{3} \approx 0.667\]

For Ham (2 emails):

\[P(\text{free} \mid Ham) = \frac{0}{2} = 0 \qquad P(\text{money} \mid Ham) = \frac{0}{2} = 0\]

4.1 : Final Computation

Step 3 — Posterior Probabilities

\[P(Spam \mid \text{free, money}) \propto P(Spam) \times P(\text{free} \mid Spam) \times P(\text{money} \mid Spam)\]

\[= 0.6 \times 1.0 \times 0.667 = \boxed{0.4}\]

\[P(Ham \mid \text{free, money}) \propto P(Ham) \times P(\text{free} \mid Ham) \times P(\text{money} \mid Ham)\]

\[= 0.4 \times 0 \times 0 = \boxed{0}\]

Verdict: The email is classified as SPAM

\[\frac{P(Spam \mid X)}{P(Spam \mid X) + P(Ham \mid X)} = \frac{0.4}{0.4 + 0} = 100\%\]

📌 What does this formula measure?

This is a normalization step — since Naive Bayes computes scores that are only proportional to probabilities (\(\propto\)), we divide each class score by the sum of all scores to obtain a true probability between 0 and 1.

\(P(Spam \mid X) \propto 0.4\) → score for Spam
\(P(Ham \mid X) \propto 0\) → score for Ham, wiped out by a zero likelihood

A single zero likelihood eliminates the class entirely. Here Ham scores 0 because neither “free” nor “money” ever appeared in Ham emails → one missing word is enough to discard the whole class.

⚠️ Zero Probability Problem → Laplace Smoothing

To avoid this, add 1 to every count before computing likelihoods: \[P(x_i \mid C) = \frac{\text{count}(x_i, C) + 1}{\text{count}(C) + |V|}\] where \(|V|\) is the vocabulary size. This ensures no probability is ever exactly 0.

Lab Session : Python Experiments

💻 k-NN with scikit-learn

💻 Logistic Regression

👋 Conclusion

Decision Trees — Maximize information gain at each split
- Criterion: Entropy & Gain (ID3 algorithm)
- Strength: Interpretable, visual, no feature scaling needed
Logistic Regression — Simple probabilistic baseline
- Sigmoid: \(\sigma(z) = \frac{1}{1+e^{-z}}\)
- Strength: Calibrated probabilities, fast training
k-NN — Geometric, instance-based learning
- Always normalize features before use
- Sensitive to the choice of \(k\) and to outliers
Naive Bayes — Probabilistic classifier, great for text
- Independence assumption (often violated, yet effective)
- Very fast; use Laplace smoothing to handle zero likelihoods

🎯 Key Takeaways

Essential Formulas

Entropy: \(H(S) = -\sum_{i=1}^{c} p_i \log_2(p_i)\)
Information Gain: \(IG(S,A) = H(S) - \sum_{v} \frac{|S_v|}{|S|} H(S_v)\)
Sigmoid: \(\sigma(z) = \frac{1}{1 + e^{-z}}\)
Euclidean Distance: \(d(P,Q) = \sqrt{\sum_{i=1}^{n}(p_i - q_i)^2}\)
Bayes: \(P(C \mid X) \propto P(C) \prod_{i=1}^{n} P(x_i \mid C)\)

❓ Questions?

Thank you for your attention!

📧 Contact: abdallah.khemais@example.com

📚 Resources: GitHub/TD-Classification