信息增益计算和决策树生长过程
给定训练集S,下面以信息增益作为最佳划分的标准,演示信息增益的计算和决策树生长的过程:
根节点
(1)以“Outlook”被选做划分属性
总共有14条数据,打球9条,不打球的5条
根据Outlook进行划分:
- Sunny中有5条数据,其中打球2条,不打球3条
- Overcast中有4条数据,其中打球4条,不打球0条
- Rain中有5条数据,其中打球3条,不打球2条
如下图所示:
计算信息增益:
E(Outlook)=−914log2914−514log2514E(SSunny)=−25log225−35log235E(SOvercast)=−44log244−04log204E(SRain)=−35log235−25log225Gain(Outlook)=E(Outlook)−[514E(SSunny)+414E(SOvercast)+514E(SRain)]=0.246E(Outlook)=-\frac{9}{14}log_2\frac{9}{14}-\frac{5}{14}log_2\frac{5}{14} \\E(S_{Sunny})=-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5} \\E(S_{Overcast})=-\frac{4}{4}log_2\frac{4}{4}-\frac{0}{4}log_2\frac{0}{4} \\E(S_{Rain})=-\frac{3}{5}log_2\frac{3}{5}-\frac{2}{5}log_2\frac{2}{5} \\Gain(Outlook)=E(Outlook)-[\frac{5}{14}E(S_{Sunny})+\frac{4}{14}E(S_{Overcast})+\frac{5}{14}E(S_{Rain})] =0.246 E(Outlook)=−149log2149−145log2145E(SSunny)=−52log252−53log253E(SOvercast)=−44log244−40log240E(SRain)=−53log253−52log252Gain(Outlook)=E(Outlook)−[145E(SSunny)+144E(SOvercast)+145E(SRain)]=0.246
可见,用属性“Outlook”划分样本集S的信息增益为:
Gain(S,Outlook)=0.246
(2)以“Temperature”作为划分属性
根据Temperature进行划分:
- Hot中有4条数据,其中打球2条,不打球2条
- Mild中有6条数据,其中打球4条,不打球2条
- Cool中有4条数据,其中打球3条,不打球1条
如下图所示:
计算信息增益:
E(Temperature)=−914log2914−514log2514E(SHot)=−24log224−24log224E(SMild)=−46log246−26log226E(SRain)=−34log234−14log214Gain(Temperature)=E(Temperature)−[414E(SHot)+614E(SMild)+414E(SCool)]=0.029E(Temperature)=-\frac{9}{14}log_2\frac{9}{14}-\frac{5}{14}log_2\frac{5}{14} \\E(S_{Hot})=-\frac{2}{4}log_2\frac{2}{4}-\frac{2}{4}log_2\frac{2}{4} \\E(S_{Mild})=-\frac{4}{6}log_2\frac{4}{6}-\frac{2}{6}log_2\frac{2}{6} \\E(S_{Rain})=-\frac{3}{4}log_2\frac{3}{4}-\frac{1}{4}log_2\frac{1}{4} \\Gain(Temperature)=E(Temperature)-[\frac{4}{14}E(S_{Hot})+\frac{6}{14}E(S_{Mild})+\frac{4}{14}E(S_{Cool})] =0.029 E(Temperature)=−149log2149−145log2145E(SHot)=−42log242−42log242E(SMild)=−64log264−62log262E(SRain)=−43log243−41log241Gain(Temperature)=E(Temperature)−[144E(SHot)+146E(SMild)+144E(SCool)]=0.029
(3)以“Humidity”作为划分属性
根据Humidity进行划分:
- High中有7条数据,其中打球3条,不打球4条
- Normal中有7条数据,其中打球6条,不打球1条
如下图所示:
计算信息增益:
E(Humidity)=−914log2914−514log2514E(SHigh)=−37log237−47log247E(SNormal)=−67log267−17log217Gain(Humidity)=E(Humidity)−[714E(SHigh)+714E(SNormal)]=0.151E(Humidity)=-\frac{9}{14}log_2\frac{9}{14}-\frac{5}{14}log_2\frac{5}{14} \\E(S_{High})=-\frac{3}{7}log_2\frac{3}{7}-\frac{4}{7}log_2\frac{4}{7} \\E(S_{Normal})=-\frac{6}{7}log_2\frac{6}{7}-\frac{1}{7}log_2\frac{1}{7} \\Gain(Humidity)=E(Humidity)-[\frac{7}{14}E(S_{High})+\frac{7}{14}E(S_{Normal})] =0.151 E(Humidity)=−149log2149−145log2145E(SHigh)=−73log273−74log274E(SNormal)=−76log276−71log271Gain(Humidity)=E(Humidity)−[147E(SHigh)+147E(SNormal)]=0.151
(4)以“Wind”作为划分属性
根据Wind进行划分:
- Weak中有8条数据,其中打球6条,不打球2条
- Strong中有6条数据,其中打球3条,不打球3条
如下图所示:
计算信息增益:
E(Wind)=−914log2914−514log2514E(SWeak)=−68log268−28log228E(SStrong)=−36log236−36log236Gain(Wind)=E(Wind)−[814E(SWeak)+614E(SStrong)]=0.048E(Wind)=-\frac{9}{14}log_2\frac{9}{14}-\frac{5}{14}log_2\frac{5}{14} \\E(S_{Weak})=-\frac{6}{8}log_2\frac{6}{8}-\frac{2}{8}log_2\frac{2}{8} \\E(S_{Strong})=-\frac{3}{6}log_2\frac{3}{6}-\frac{3}{6}log_2\frac{3}{6} \\Gain(Wind)=E(Wind)-[\frac{8}{14}E(S_{Weak})+\frac{6}{14}E(S_{Strong})] =0.048 E(Wind)=−149log2149−145log2145E(SWeak)=−86log286−82log282E(SStrong)=−63log263−63log263Gain(Wind)=E(Wind)−[148E(SWeak)+146E(SStrong)]=0.048
(5)根节点选择
比较四个以不同属性划分的信息增益:
- Gain(S,Outlook)=0.246
- Gain(S,Temperature)=0.029
- Gain(S,Humidity)=0.151
- Gain(S,Wind)=0.048
所以,对于当前节点,用“Outlook”划分样本集S的信息增益最大,被选为划分属性。
左儿子节点(Sunny)
(1)以“Temperature”作为划分属性
总共有5条数据,打球2条,不打球的3条
根据Temperature进行划分:
- Hot中有2条数据,其中打球0条,不打球2条
- Mild中有2条数据,其中打球1条,不打球1条
- Cool中有1条数据,其中打球1条,不打球0条
如下图所示:
计算信息增益:
E(Temperature)=−25log225−35log235E(SHot)=−02log202−22log222E(SMild)=−12log212−12log212E(SRain)=−11log211−01log201Gain(Temperature)=E(Temperature)−[25E(SHot)+25E(SMild)+15E(SCool)]=0.5710E(Temperature)=-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5} \\E(S_{Hot})=-\frac{0}{2}log_2\frac{0}{2}-\frac{2}{2}log_2\frac{2}{2} \\E(S_{Mild})=-\frac{1}{2}log_2\frac{1}{2}-\frac{1}{2}log_2\frac{1}{2} \\E(S_{Rain})=-\frac{1}{1}log_2\frac{1}{1}-\frac{0}{1}log_2\frac{0}{1} \\Gain(Temperature)=E(Temperature)-[\frac{2}{5}E(S_{Hot})+\frac{2}{5}E(S_{Mild})+\frac{1}{5}E(S_{Cool})] =0.5710 E(Temperature)=−52log252−53log253E(SHot)=−20log220−22log222E(SMild)=−21log221−21log221E(SRain)=−11log211−10log210Gain(Temperature)=E(Temperature)−[52E(SHot)+52E(SMild)+51E(SCool)]=0.5710
(2)以“Humidity”作为划分属性
总共有5条数据,打球2条,不打球的3条
根据Humidity进行划分:
- High中有3条数据,其中打球0条,不打球3条
- Normal中有2条数据,其中打球2条,不打球0条
如下图所示:
计算信息增益:
E(Humidity)=−25log225−35log235E(SHigh)=−03log203−33log233E(SNormal)=−22log222−02log202Gain(Humidity)=E(Humidity)−35E(SHigh)+25E(SNormal)=0.9710E(Humidity)=-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5} \\E(S_{High})=-\frac{0}{3}log_2\frac{0}{3}-\frac{3}{3}log_2\frac{3}{3} \\E(S_{Normal})=-\frac{2}{2}log_2\frac{2}{2}-\frac{0}{2}log_2\frac{0}{2} \\Gain(Humidity)=E(Humidity)-\frac{3}{5}E(S_{High})+\frac{2}{5}E(S_{Normal}) =0.9710 E(Humidity)=−52log252−53log253E(SHigh)=−30log230−33log233E(SNormal)=−22log222−20log220Gain(Humidity)=E(Humidity)−53E(SHigh)+52E(SNormal)=0.9710
(3)以“Wind”作为划分属性
总共有5条数据,打球2条,不打球的3条
根据Wind进行划分:
- Weak中有3条数据,其中打球1条,不打球2条
- Strong中有2条数据,其中打球1条,不打球1条
如下图所示:
计算信息增益:
E(Wind)=−25log225−35log235E(SWeak)=−13log213−23log223E(SStrong)=−12log212−12log212Gain(Wind)=E(Wind)−[35E(SWeak)+25E(SStrong)]=0.019973E(Wind)=-\frac{2}{5}log_2\frac{2}{5}-\frac{3}{5}log_2\frac{3}{5} \\E(S_{Weak})=-\frac{1}{3}log_2\frac{1}{3}-\frac{2}{3}log_2\frac{2}{3} \\E(S_{Strong})=-\frac{1}{2}log_2\frac{1}{2}-\frac{1}{2}log_2\frac{1}{2} \\Gain(Wind)=E(Wind)-[\frac{3}{5}E(S_{Weak})+\frac{2}{5}E(S_{Strong})] =0.019973 E(Wind)=−52log252−53log253E(SWeak)=−31log231−32log232E(SStrong)=−21log221−21log221Gain(Wind)=E(Wind)−[53E(SWeak)+52E(SStrong)]=0.019973
(4)左儿子节点选择
比较四个以不同属性划分的信息增益:
- Gain(S,Temperature)=0.5710
- Gain(S,Humidity)=0.9710
- Gain(S,Wind)=0.019973
所以,对于当前节点,用“Humidity”划分样本集S的信息增益最大,被选为划分属性。
右儿子节点(Rain)
(1)以“Temperature”作为划分属性
总共有5条数据,打球3条,不打球的2条
根据Temperature进行划分:
- Mild中有3条数据,其中打球2条,不打球1条
- Cool中有2条数据,其中打球1条,不打球1条
如下图所示:
计算信息增益:
E(Temperature)=−35log235−25log225E(SMild)=−23log223−13log213E(SRain)=−12log212−12log212Gain(Temperature)=E(Temperature)−[35E(SMild)+25E(SCool)]=0.019973E(Temperature)=-\frac{3}{5}log_2\frac{3}{5}-\frac{2}{5}log_2\frac{2}{5} \\E(S_{Mild})=-\frac{2}{3}log_2\frac{2}{3}-\frac{1}{3}log_2\frac{1}{3} \\E(S_{Rain})=-\frac{1}{2}log_2\frac{1}{2}-\frac{1}{2}log_2\frac{1}{2} \\Gain(Temperature)=E(Temperature)-[\frac{3}{5}E(S_{Mild})+\frac{2}{5}E(S_{Cool})] = 0.019973 E(Temperature)=−53log253−52log252E(SMild)=−32log232−31log231E(SRain)=−21log221−21log221Gain(Temperature)=E(Temperature)−[53E(SMild)+52E(SCool)]=0.019973
(2)以“Humidity”作为划分属性
总共有5条数据,打球3条,不打球的2条
根据Humidity进行划分:
- High中有2条数据,其中打球1条,不打球1条
- Normal中有3条数据,其中打球2条,不打球1条
如下图所示:
计算信息增益:
E(Humidity)=−35log235−25log225E(SHigh)=−12log212−12log212E(SNormal)=−23log223−13log213Gain(Humidity)=E(Humidity)−[25E(SHigh)+35E(SNormal)]=0.019973E(Humidity)=-\frac{3}{5}log_2\frac{3}{5}-\frac{2}{5}log_2\frac{2}{5} \\E(S_{High})=-\frac{1}{2}log_2\frac{1}{2}-\frac{1}{2}log_2\frac{1}{2} \\E(S_{Normal})=-\frac{2}{3}log_2\frac{2}{3}-\frac{1}{3}log_2\frac{1}{3} \\Gain(Humidity)=E(Humidity)-[\frac{2}{5}E(S_{High})+\frac{3}{5}E(S_{Normal})] = 0.019973 E(Humidity)=−53log253−52log252E(SHigh)=−21log221−21log221E(SNormal)=−32log232−31log231Gain(Humidity)=E(Humidity)−[52E(SHigh)+53E(SNormal)]=0.019973
(3)以“Wind”作为划分属性
总共有5条数据,打球3条,不打球的2条
根据Wind进行划分:
- Weak中有3条数据,其中打球3条,不打球0条
- Strong中有2条数据,其中打球0条,不打球2条
如下图所示:
计算信息增益:
E(Wind)=−35log235−25log225E(SWeak)=−33log233−03log203E(SStrong)=−02log202−22log222Gain(Wind)=E(Wind)−[35E(SWeak)+25E(SStrong)]=0.9710E(Wind)=-\frac{3}{5}log_2\frac{3}{5}-\frac{2}{5}log_2\frac{2}{5} \\E(S_{Weak})=-\frac{3}{3}log_2\frac{3}{3}-\frac{0}{3}log_2\frac{0}{3} \\E(S_{Strong})=-\frac{0}{2}log_2\frac{0}{2}-\frac{2}{2}log_2\frac{2}{2} \\Gain(Wind)=E(Wind)-[\frac{3}{5}E(S_{Weak})+\frac{2}{5}E(S_{Strong})] =0.9710 E(Wind)=−53log253−52log252E(SWeak)=−33log233−30log230E(SStrong)=−20log220−22log222Gain(Wind)=E(Wind)−[53E(SWeak)+52E(SStrong)]=0.9710
(4)右儿子节点选择
比较四个以不同属性划分的信息增益:
- Gain(S,Temperature)= 0.019973
- Gain(S,Humidity)=0.019973
- Gain(S,Wind)=0.9710
所以,对于当前节点,用“Wind”划分样本集S的信息增益最大,被选为划分属性。
决策树
}E(S_{Weak})+\frac{2}{5}E(S_{Strong})]
=0.9710
$$
(4)右儿子节点选择
比较四个以不同属性划分的信息增益:
- Gain(S,Temperature)= 0.019973
- Gain(S,Humidity)=0.019973
- Gain(S,Wind)=0.9710
所以,对于当前节点,用“Wind”划分样本集S的信息增益最大,被选为划分属性。
[外链图片转存中…(img-7QjJzuWa-1665499737384)]