首页 关于我 项目 博客 图谱 简历 联系 English
返回列表

2026年4月15日

3152 Lecture 6

Decision Tree

FIT3152Class NoteEnglish

Lecture Slide: FIT3152 Lecture 06.pdf

Previous: 3152 Lecture 5 - Clustering
Next: 3152 Lecture 7 - Naive Bayes Classification and Evaluate performance

[!NOTE] Other Unit In FIT2086, we also mentioned this at 2086 Lecture 9 - Trees and Nearest Neighbour Methods#Decision Trees But this is the extended one.

Entropy

Entropy (信息熵) measure the chaos in a system (in Thermodynamics). In Information Theory, it represent the uncertainty in a random variable. In the other word, it measures the information included in an event.

The higher the entropy, the more information is included. A real world example, Chinese has higher entropy comparing to English per character.

In Decision Tree, entropy is used to measure how mixed the target classes are. If all examples belong to the same class, entropy is 0. If the classes are evenly mixed, entropy is high.

The entropy can be calculated as below for event:

Entropy(S)=i=1nP(Ci)log2P(Ci)S=the dataset / collection of examplesn=the number of classesCi=class iP(Ci)=the probability of class Ci\begin{aligned} Entropy(S)&=-\sum_{i=1}^{n}P(C_i)\log_2P(C_i)\\ S &= \text{the dataset / collection of examples} \\ n &= \text{the number of classes} \\ C_i &= \text{class } i \\ P(C_i) &= \text{the probability of class } C_i \end{aligned}

For example, if class is Yes / No:

Entropy(S)=P(Yes)log2P(Yes)P(No)log2P(No)Entropy(S)=-P(Yes)\log_2P(Yes)-P(No)\log_2P(No)

Gain

Gain measures that how much the entropy will decrease after split comparing to before

Gain(S,A)=Entropy(S)vValues(A)SvSEntropy(Sv)S=the original dataset / collection of examplesA=the attribute used for splittingValues(A)=all possible values of attribute Av=one possible value of attribute ASv=the subset of S where A=vSv=the number of examples in subset SvS=the total number of examples in SEntropy(S)=entropy before splittingEntropy(Sv)=entropy of subset Sv after splitting\begin{aligned} Gain(S,A)&=Entropy(S)-\sum_{v \in Values(A)}\frac{|S_v|}{|S|} Entropy(S_v) \\[6pt] S &= \text{the original dataset / collection of examples} \\ A &= \text{the attribute used for splitting} \\ Values(A) &= \text{all possible values of attribute } A \\ v &= \text{one possible value of attribute } A \\ S_v &= \text{the subset of } S \text{ where } A=v \\ |S_v| &= \text{the number of examples in subset } S_v \\ |S| &= \text{the total number of examples in } S \\ Entropy(S) &= \text{entropy before splitting} \\ Entropy(S_v) &= \text{entropy of subset } S_v \text{ after splitting} \end{aligned}

In natural language, Gain = before split entropy - split weighted average entropy\text{Gain = before split entropy - split weighted average entropy}.

The higher the Gain is, the more effective the split is, because more uncertainty or impurity is removed after the split. We want to minimise the entropy left after each split, so the tree can separate the classes more efficiently with fewer splits.

Decision Tree

Decision Tree is a classification model that splits the dataset based on predictor variables. Each internal node represents a test on one predictor, each branch represents one possible value of that predictor, and each leaf node represents the predicted class. Decision Tree.png

ID3

ID3 is a specific decision tree algorithm. It uses entropy and information gain to decide which attribute should be used for splitting.

The basic idea is that a good split should make the child groups more pure, it will increase the homogeneity (同质性). In other words, after splitting, each group should contain examples that mostly belong to the same class, so the entropy after split should be lower.

ID3 is a greedy algorithm, once the split attribute is been made, it can not be change in the rest of the tree. For each step ID3 only focus on the maximum gain in current step. By choosing the local maximum gain doesn’t mean it is the global best option.

The process of building a tree with ID3:

  1. Calculate the entropy of the current dataset.
  2. For each predictor attribute, split the dataset by its possible values.
  3. Calculate the weighted average entropy after the split.
  4. Calculate the information gain for each attribute.
  5. Choose the attribute with the highest gain as the split node.
  6. Repeat the same process for each branch that is still impure.

The root node is the first attribute chosen by ID3. It should be the attribute with the highest information gain, because it removes the most uncertainty from the original dataset.

After choosing the root, each value of that attribute becomes a branch. If all examples in a branch belong to the same class, then this branch becomes a leaf node. If the branch still contains mixed classes, then ID3 will calculate gain again using the remaining attributes and continue splitting.

The splitting stops when the data in a node is pure, which means entropy is 0, or when there are no useful attributes left to split.

So the overall goal of ID3 is to choose the attribute that gives the purest class breakdown at each step.

反向链接