机器学习笔记之玻尔兹曼机——基于平均场推断梯度求解[续]
引言
基于玻尔兹曼机(三)梯度求解(基于平均场理论的变分推断)的思路继续求解Λ3\Lambda_3Λ3的梯度。
Λ3\Lambda_3Λ3梯度求解
对Λ3\Lambda_3Λ3进行化简。其就是一个熵的形式:
Λ3=−∑h(i)Q(h(i)∣v(i);ϕ)logQ(h(i)∣v(i);ϕ)\Lambda_3 = - \sum_{h^{(i)}} \mathcal Q(h^{(i)} \mid v^{(i)};\phi) \log \mathcal Q(h^{(i)} \mid v^{(i)};\phi)Λ3=−h(i)∑Q(h(i)∣v(i);ϕ)logQ(h(i)∣v(i);ϕ)
由于平均场假设作为条件,将上式进行展开:
这里暂时先将负号带回去了~
Λ3=∑h(i){∏j=1PQ(hj(i)∣v(i);ϕ)log∏j=1P[1Q(hj(i)∣v(i);ϕ)]}=∑h(i){∏j=1PQ(hj(i)∣v(i);ϕ)∑j=1Plog[1Q(hj(i)∣v(i);ϕ)]}\begin{aligned} \Lambda_3 & = \sum_{h^{(i)}} \left\{\prod_{j=1}^{\mathcal P} \mathcal Q(h_j^{(i)} \mid v^{(i)};\phi) \log \prod_{j=1}^{\mathcal P} \left[\frac{1}{\mathcal Q(h_j^{(i)} \mid v^{(i)};\phi)}\right]\right\} \\ & = \sum_{h^{(i)}} \left\{\prod_{j=1}^{\mathcal P} \mathcal Q(h_j^{(i)} \mid v^{(i)};\phi) \sum_{j=1}^{\mathcal P}\log \left[\frac{1}{\mathcal Q(h_j^{(i)} \mid v^{(i)};\phi)}\right]\right\} \end{aligned}Λ3=h(i)∑{j=1∏PQ(hj(i)∣v(i);ϕ)logj=1∏P[Q(hj(i)∣v(i);ϕ)1]}=h(i)∑{j=1∏PQ(hj(i)∣v(i);ϕ)j=1∑Plog[Q(hj(i)∣v(i);ϕ)1]}
继续将∑h(i)=∑h1(i),h2(i),⋯,hP(i)\sum_{h^{(i)}} = \sum_{h_1^{(i)},h_2^{(i)},\cdots,h_{\mathcal P}^{(i)}}∑h(i)=∑h1(i),h2(i),⋯,hP(i)展开,并与相关项进行归纳:
Λ3=∑j=1P∑hj(i)Q(hj(i)∣v(i);ϕ)⋅log[1Q(hj(i)∣v(i);ϕ)]\Lambda_3 = \sum_{j=1}^{\mathcal P} \sum_{h_j^{(i)}} \mathcal Q(h_j^{(i)} \mid v^{(i)};\phi) \cdot \log \left[\frac{1}{\mathcal Q(h_j^{(i)} \mid v^{(i)};\phi)}\right]Λ3=j=1∑Phj(i)∑Q(hj(i)∣v(i);ϕ)⋅log[Q(hj(i)∣v(i);ϕ)1]
继续观察∑hj(i)Q(hj(i)∣v(i);ϕ)⋅log[1Q(hj(i)∣v(i);ϕ)]\sum_{h_j^{(i)}} \mathcal Q(h_j^{(i)} \mid v^{(i)};\phi) \cdot \log \left[\frac{1}{\mathcal Q(h_j^{(i)} \mid v^{(i)};\phi)}\right]∑hj(i)Q(hj(i)∣v(i);ϕ)⋅log[Q(hj(i)∣v(i);ϕ)1],基于hj(i)h_j^{(i)}hj(i)的伯努利分布,该式可表示为如下形式:
Q(hj(i)=1∣v(i);ϕ)⋅log[1Q(hj(i)=1∣v(i);ϕ)]+Q(hj(i)=0∣v(i);ϕ)⋅log[1Q(hj(i)=0∣v(i);ϕ)]=ϕj⋅log[1ϕj]+(1−ϕj)log[11−ϕj]\begin{aligned} & \quad \mathcal Q(h_j^{(i)}=1 \mid v^{(i)};\phi) \cdot log \left[\frac{1}{\mathcal Q(h_j^{(i)}=1 \mid v^{(i)};\phi)}\right] + \mathcal Q(h_j^{(i)}=0 \mid v^{(i)};\phi) \cdot log \left[\frac{1}{\mathcal Q(h_j^{(i)}=0 \mid v^{(i)};\phi)}\right] \\ & = \phi_j \cdot \log \left[\frac{1}{\phi_j}\right] + (1 - \phi_j) \log \left[\frac{1}{1 - \phi_j}\right] \end{aligned}Q(hj(i)=1∣v(i);ϕ)⋅log[Q(hj(i)=1∣v(i);ϕ)1]+Q(hj(i)=0∣v(i);ϕ)⋅log[Q(hj(i)=0∣v(i);ϕ)1]=ϕj⋅log[ϕj1]+(1−ϕj)log[1−ϕj1]
至此,Λ3\Lambda_3Λ3可表示为:
Λ3=∑j=1P{ϕj⋅log[1ϕj]+(1−ϕj)log[11−ϕj]}\Lambda_3 = \sum_{j=1}^{\mathcal P}\left\{\phi_j \cdot \log \left[\frac{1}{\phi_j}\right] + (1 - \phi_j) \log \left[\frac{1}{1 - \phi_j}\right]\right\}Λ3=j=1∑P{ϕj⋅log[ϕj1]+(1−ϕj)log[1−ϕj1]}
求解最优参数ϕ^j\hat {\phi}_jϕ^j
至此,Λ1,Λ2,Λ3\Lambda_1,\Lambda_2,\Lambda_3Λ1,Λ2,Λ3全部化简完毕,将这三项分别对ϕj\phi_jϕj求偏导。具体结果如下:
{∂∂ϕj[∑i=1D∑l=1Pϕl⋅vi(i)⋅Wil]=∑i=1Dvi(i)⋅Wij∂∂ϕj[∑j=1P∑l≠jPϕi⋅ϕl⋅Jil]=∑l≠jPϕl⋅Jil∂∂ϕj[∑j=1P{ϕj⋅log[1ϕj]+(1−ϕj)log[11−ϕj]}]=−logϕj1−ϕj\begin{cases} \frac{\partial}{\partial \phi_j} \left[\sum_{i=1}^{\mathcal D}\sum_{l=1}^{\mathcal P} \phi_l \cdot v_i^{(i)} \cdot \mathcal W_{il}\right] = \sum_{i=1}^{\mathcal D} v_i^{(i)} \cdot \mathcal W_{ij} \\ \frac{\partial}{\partial \phi_j} \left[\sum_{j=1}^{\mathcal P}\sum_{l\neq j}^{\mathcal P} \phi_i \cdot \phi_l \cdot \mathcal J_{il}\right] = \sum_{l \neq j}^{\mathcal P} \phi_l \cdot \mathcal J_{il} \\ \frac{\partial}{\partial \phi_j} [\sum_{j=1}^{\mathcal P}\left\{\phi_j \cdot \log \left[\frac{1}{\phi_j}\right] + (1 - \phi_j) \log \left[\frac{1}{1 - \phi_j}\right]\right\}] = -\log \frac{\phi_j}{1 - \phi_j} \end{cases}⎩⎨⎧∂ϕj∂[∑i=1D∑l=1Pϕl⋅vi(i)⋅Wil]=∑i=1Dvi(i)⋅Wij∂ϕj∂[∑j=1P∑l=jPϕi⋅ϕl⋅Jil]=∑l=jPϕl⋅Jil∂ϕj∂[∑j=1P{ϕj⋅log[ϕj1]+(1−ϕj)log[1−ϕj1]}]=−log1−ϕjϕj
令三项之和为0,求解ϕj\phi_jϕj:
∂∂ϕj[Λ1+Λ2+Λ3]≜0⇔∑i=1Dvi(i)⋅Wij+∑l≠jPϕl⋅Jil−logϕj1−ϕj=0\frac{\partial}{\partial \phi_j} [\Lambda_1 + \Lambda_2 + \Lambda_3] \triangleq 0 \Leftrightarrow \sum_{i=1}^{\mathcal D} v_i^{(i)} \cdot \mathcal W_{ij} + \sum_{l \neq j}^{\mathcal P} \phi_l \cdot \mathcal J_{il} - \log \frac{\phi_j}{1 - \phi_j} = 0 ∂ϕj∂[Λ1+Λ2+Λ3]≜0⇔i=1∑Dvi(i)⋅Wij+l=j∑Pϕl⋅Jil−log1−ϕjϕj=0
因而有:
ϕj[1+exp(∑i=1Dvi(i)⋅Wij+∑l≠jPϕl⋅Jil)]=exp(∑i=1Dvi(i)⋅Wij+∑l≠jPϕl⋅Jil)⇒ϕj=11+exp(∑i=1Dvi(i)⋅Wij+∑l≠jPϕl⋅Jil)=Sigmoid[∑i=1Dvi(i)⋅Wij+∑l≠jPϕl⋅Jil]\phi_j \left[1 + \exp \left(\sum_{i=1}^{\mathcal D} v_i^{(i)} \cdot \mathcal W_{ij} + \sum_{l \neq j}^{\mathcal P} \phi_l \cdot \mathcal J_{il}\right)\right] = \exp \left(\sum_{i=1}^{\mathcal D} v_i^{(i)} \cdot \mathcal W_{ij} + \sum_{l \neq j}^{\mathcal P} \phi_l \cdot \mathcal J_{il}\right) \\ \begin{aligned} \Rightarrow \phi_j & = \frac{1}{1 + \exp \left(\sum_{i=1}^{\mathcal D} v_i^{(i)} \cdot \mathcal W_{ij} + \sum_{l \neq j}^{\mathcal P} \phi_l \cdot \mathcal J_{il}\right)} \\ & = \text{Sigmoid} \left[\sum_{i=1}^{\mathcal D} v_i^{(i)} \cdot \mathcal W_{ij} + \sum_{l \neq j}^{\mathcal P} \phi_l \cdot \mathcal J_{il}\right] \end{aligned} ϕj1+expi=1∑Dvi(i)⋅Wij+l=j∑Pϕl⋅Jil=expi=1∑Dvi(i)⋅Wij+l=j∑Pϕl⋅Jil⇒ϕj=1+exp(∑i=1Dvi(i)⋅Wij+∑l=jPϕl⋅Jil)1=Sigmoidi=1∑Dvi(i)⋅Wij+l=j∑Pϕl⋅Jil
很明显,这是一个迭代方程——用非ϕj\phi_jϕj的其他结果的线性运算∑l≠jP\sum_{l \neq j}^{\mathcal P}∑l=jP对ϕj\phi_jϕj进行表示。并且它是一个不动点方程。它的具体求解方式是:在初始状态下给定随机结果,固定住ϕj\phi_jϕj之外的其他项,求解当前迭代步骤下ϕj\phi_jϕj的最优解;以此类推,直到所有的ϕj(j=1,2,⋯,P)\phi_{j}(j=1,2,\cdots,\mathcal P)ϕj(j=1,2,⋯,P)均固定一遍,此时一次迭代结束,继续迭代下去,基于不动点方程的性质,ϕj(j=1,2,⋯,P)\phi_j(j=1,2,\cdots,\mathcal P)ϕj(j=1,2,⋯,P)均会收敛至某一结果,至此ϕ={ϕ1,ϕ2,⋯,ϕP}\phi = \{\phi_1,\phi_2,\cdots,\phi_{\mathcal P}\}ϕ={ϕ1,ϕ2,⋯,ϕP}可以被近似出来,最终求解Q(h(i)∣v(i);ϕ)\mathcal Q(h^{(i)} \mid v^{(i)};\phi)Q(h(i)∣v(i);ϕ)的分布结果,并最终替代Pmodel(h(i)∣v(i);θ)\mathcal P_{model}(h^{(i)} \mid v^{(i)};\theta)Pmodel(h(i)∣v(i);θ)对模型参数的梯度进行描述。
这实际上就是‘坐标上升思想’。
返回要求解的模型参数梯度:
{∇W=η(EPdata[v(i)(h(i))T]−EPmodel[v(i)(h(i))T])Pdata⇒Pdata(v(i)∈V)⋅Pmodel(h(i)∣v(i))Pmodel=Pmodel(v(i),h(i))\begin{cases} \nabla_{\mathcal W} = \eta \left(\mathbb E_{\mathcal P_{data}} \left[v^{(i)}(h^{(i)})^T\right] - \mathbb E_{\mathcal P_{model}} \left[v^{(i)}(h^{(i)})^T\right]\right)\\ \mathcal P_{data} \Rightarrow \mathcal P_{data}(v^{(i)} \in \mathcal V) \cdot \mathcal P_{model}(h^{(i)} \mid v^{(i)}) \\ \mathcal P_{model} = \mathcal P_{model}(v^{(i)},h^{(i)}) \end{cases}⎩⎨⎧∇W=η(EPdata[v(i)(h(i))T]−EPmodel[v(i)(h(i))T])Pdata⇒Pdata(v(i)∈V)⋅Pmodel(h(i)∣v(i))Pmodel=Pmodel(v(i),h(i))
此时关于隐变量的后验概率Pmodel\mathcal P_{model}Pmodel并不需要使用MCMC采样方法进行求解,通过变分推断的方式也可以进行求解。
关于负相的处理方式,依然需要使用采样方法,随着技术的迭代,采样方式也得到了更新,例如对比散度方法,同样可以加快采样速度。
至此,关于玻尔兹曼机的介绍到此结束。
相关参考:
(系列二十八)玻尔兹曼机7-平均场推断3