PAC学习框架
defination 2.1 泛化误差
给定假设和目标概念h∈H,c∈Ch\in H, c \in Ch∈H,c∈C,潜在分布DDD,hhh的泛化误差为
R(h)=Px∼D[h(x)≠c(x)]=Ex∼D[1h(x)≠c(x)]R(h)=P_{x\sim D}[h(x)\ne c(x)] = E_{x\sim D}[1_{h(x)\ne c(x)}] R(h)=Px∼D[h(x)=c(x)]=Ex∼D[1h(x)=c(x)]
由于DDD和ccc均未知,因此R(h)R(h)R(h)无法直接得到。
defination 2.2 经验误差
给定h∈H,c∈Ch\in H, c \in Ch∈H,c∈C,样本集S=(x1,⋯,xn)S=(x_1, \cdots, x_n)S=(x1,⋯,xn),经验误差或者经验风险定义为
R(h)^=1M∑i=1m1h(xi)≠c(xi)\hat{R(h)}=\frac{1}{M}\sum_{i=1}^m1_{h(x_i)\ne c(x_i)} R(h)^=M1i=1∑m1h(xi)=c(xi)
注意到泛化误差是经验误差的期望
KaTeX parse error: Expected 'EOF', got '&' at position 12: {aligned} &̲E[\hat{R}(h)]=E… {aligned}
defination 2.3 PAC-learning
算法AAA,样本规模mmm,分布DDD,概念c∈Cc\in Cc∈C,对于∀ϵ>0,∀δ>0\forall \epsilon>0, \forall \delta>0∀ϵ>0,∀δ>0,若
PS∼Dm[R(hs)≤ϵ]≥1−δP_{S\sim D^m}[R(h_s)\le \epsilon] \ge1-\delta PS∼Dm[R(hs)≤ϵ]≥1−δ
算法AAA称为一个PAC-learning algorithm。
Theorem2.1 Learning Bound, Finite Hypothese, consistent case
若HHH有限,是X→Y\mathcal{X}\rightarrow \mathcal{Y}X→Y的函数组成的集合。AAA为学习算法,根据i.i.d.i.i.d.i.i.d.样本集SSS返回一个一致的假设hsh_shs,使得R^(hs)=0\hat{R}(h_s)=0R^(hs)=0,对于∀ϵ,δ>0\forall\epsilon, \delta>0∀ϵ,δ>0,不等式
PS∼Dm[R(hs)≤ϵ]≥1−δP_{S\sim D^m}[R(h_s)\le \epsilon] \ge1-\delta PS∼Dm[R(hs)≤ϵ]≥1−δ
成立的条件是
m≥1ϵ(log∣H∣+log1δ)m\ge\frac{1}{\epsilon}(log|H|+log\frac{1}{\delta}) m≥ϵ1(log∣H∣+logδ1)
也可以得到如下的泛化界:
R(hs)≤1m(log∣H∣+log1δ)R(h_s)\le\frac{1}{m}(log|H|+log\frac{1}{\delta}) R(hs)≤m1(log∣H∣+logδ1)
证明:通过一个一致收敛界,表达存在一个一致假设使得泛化误差大于ϵ\epsilonϵ的概率,然后用1减,得到任意一致假设泛化误差小于ϵ\epsilonϵ的概率。
P[∃h∈H,R^(h)=0∩R(h)>ϵ]=P[(R^(h1)=0∩R(h1)>ϵ)∪⋯∪(R^(hn)=0∩R(hn)>ϵ)]≤∑h∈HP(R^(h)=0∩R(h)>ϵ)≤∣H∣P(R^(h)=0∣R(h)>ϵ)≤∣H∣(1−ϵ)m≤∣H∣e−ϵm\begin{aligned} & P[\exist h\in H, \hat{R}(h)=0 \cap R(h)>\epsilon]\\ & =P[(\hat{R}(h_1)=0\cap R(h_1)>\epsilon)\cup\cdots\cup(\hat{R}(h_n)=0\cap R(h_n)>\epsilon)]\\ &\le\sum_{h\in H}P(\hat{R}(h)=0\cap R(h)>\epsilon)\\ &\le |H|P(\hat{R}(h)=0|R(h)>\epsilon)\\ &\le|H|(1-\epsilon)^m\\ &\le|H|e^{-\epsilon m} \end{aligned} P[∃h∈H,R^(h)=0∩R(h)>ϵ]=P[(R^(h1)=0∩R(h1)>ϵ)∪⋯∪(R^(hn)=0∩R(hn)>ϵ)]≤h∈H∑P(R^(h)=0∩R(h)>ϵ)≤∣H∣P(R^(h)=0∣R(h)>ϵ)≤∣H∣(1−ϵ)m≤∣H∣e−ϵm
令
δ=∣H∣e−ϵmlogδ=log∣H∣−ϵmm=1−ϵ(log∣H∣−logδ)m=1ϵ(log1∣H∣+log1δ)\delta = |H|e^{-\epsilon m}\\ log\delta=log|H|-\epsilon m\\ m=\frac{1}{-\epsilon}(log|H|-log\delta)\\ m=\frac{1}{\epsilon}(log\frac{1}{|H|}+log\frac{1}{\delta}) δ=∣H∣e−ϵmlogδ=log∣H∣−ϵmm=−ϵ1(log∣H∣−logδ)m=ϵ1(log∣H∣1+logδ1)
这是样本复杂度界,进而有如下的泛化界
R(hs)≤1m(log∣H∣+log1δ)R(h_s)\le\frac{1}{m}(log|H|+log\frac{1}{\delta}) R(hs)≤m1(log∣H∣+logδ1)
C. 1 Generalization Bound for fixed hypothese
利用Hoeffding不等式可以得到
P[∣R^(h)−R(h)∣≥ϵ]≤2e−2mϵ22e−2mϵ2=δϵ=12mlog(2/δ)\begin{aligned} &P[|\hat{R}(h)-R(h)|\ge \epsilon] \le 2e^{-2m\epsilon^2}\\ &2e^{-2m\epsilon^2}=\delta\\ &\epsilon = \sqrt{\frac{1}{2m}log(2/\delta)} \end{aligned} P[∣R^(h)−R(h)∣≥ϵ]≤2e−2mϵ22e−2mϵ2=δϵ=2m1log(2/δ)
R^(h)−R(h)≥−12mlog(δ/2)R(h)≤R^(h)+12mlog(δ/2)\hat{R}(h)-R(h)\ge-\sqrt{\frac{1}{2m}log(\delta/2)}\\ R(h)\le\hat{R}(h)+\frac{1}{2m}log(\delta/2) R^(h)−R(h)≥−2m1log(δ/2)R(h)≤R^(h)+2m1log(δ/2)
Theorem 2.2 Leanring Bound finite Hypothese set, inconsistant
HHH为有限假设集,对于∀δ>0\forall \delta >0∀δ>0,至少以1−δ1-\delta1−δ的概率有下式成立
∀h∈H,R(h)≤R^(h)+log∣H∣+log(2ϵ)2m\forall h \in H,R(h)\le\hat{R}(h)+\sqrt{\frac{log|H|+log(\frac{2}{\epsilon})}{2m}} ∀h∈H,R(h)≤R^(h)+2mlog∣H∣+log(ϵ2)
证明:
P[∃h∈H,∣R^(h)−R(h)∣≥ϵ]=P[∣R^(h1)−R(h1)∣≥ϵ∪⋯∪∣R^(hn)−R(hn)∣≥ϵ]≤∑h∈HP[∣R^(h)−R(h)∣≥ϵ]≤2∣H∣e−2mϵ2\begin{aligned} &P[\exist h\in H,|\hat{R}(h)-R(h)|\ge\epsilon]\\ &=P[|\hat{R}(h_1)-R(h_1)|\ge\epsilon \cup\cdots \cup|\hat{R}(h_n)-R(h_n)|\ge\epsilon]\\ &\le\sum_{h\in H}P[|\hat{R}(h)-R(h)|\ge\epsilon]\\ &\le2|H|e^{-2m\epsilon^2} \end{aligned} P[∃h∈H,∣R^(h)−R(h)∣≥ϵ]=P[∣R^(h1)−R(h1)∣≥ϵ∪⋯∪∣R^(hn)−R(hn)∣≥ϵ]≤h∈H∑P[∣R^(h)−R(h)∣≥ϵ]≤2∣H∣e−2mϵ2
2∣H∣e−2mϵ2=δϵ=logδ2∣H∣−2m2|H|e^{-2m\epsilon^2}=\delta\\ \epsilon=\sqrt{\frac{log\frac{\delta}{2|H|}}{-2m}} 2∣H∣e−2mϵ2=δϵ=−2mlog2∣H∣δ
这表明,泛化误差的上界是由经验误差和和复杂度项决定的。在类似经验误差的情况下, 有必要减小模型的复杂度,这是奥卡姆剃刀原则。