Reinforcement Learning with Code 【Chapter 9. Policy Gradient Methods】

news/2024/4/20 11:08:18/文章来源:https://blog.csdn.net/qq_44940689/article/details/131997050

Reinforcement Learning with Code

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning, .

文章目录

  • Reinforcement Learning with Code
    • Chapter 9. Policy Gradient Methods
      • 9.1 Basic idea of policy gradient
      • 9.2 Metrics to define optimal policies
      • 9.3 Gradients of the metrics
      • 9.4 Policy gradient by Monte Carlo estimation: REINFORCE
    • Reference

Chapter 9. Policy Gradient Methods

​ The idea of function approximation can be applied to represent not only state/action values but also policies. Up to now in this book, policies have been represented by tables: the action probabilities of all states are stored in a table π ( a ∣ s ) \pi(a|s) π(as), each entry of which is indexed by a state and an action. In this chapter, we show that polices can be represented by parameterized functions denoted as π ( a ∣ s , θ ) \pi(a|s,\theta) π(as,θ), where θ ∈ R m \theta\in\mathbb{R}^m θRm is a parameter vector. The function representation is also sometimes written as π ( a , s , θ ) , π θ ( a ∣ s ) , \textcolor{blue}{\pi(a,s,\theta)},\textcolor{blue}{\pi_\theta(a|s)}, π(a,s,θ),πθ(as), or π θ ( a , s ) \textcolor{blue}{\pi_\theta(a,s)} πθ(a,s).

​ When policies are represented as a function, optimal policies can be found by optimizing certain scalar metrics. Such kind of method is called policy gradient.

9.1 Basic idea of policy gradient

How to define optimal policies? When represented as a table, a policy π \pi π is defined as optimal if it can maximize every state value. When represented by a function, a policy π \pi π is fully determined by θ \theta θ together with the function strcuture. The policy is defined as optimal if it can maximize certain scalar metrics, which we will introduce later.

How to update policies? When represented as a table, a plicy π \pi π can be updated by directly changing the entries in the table. However, when represented by a parameterized function, a policy π \pi π cannot be updated in this way anymore. Instead, it can only be improved by updating the parameter θ \theta θ. We can use gradient-based method to optimize some metrics to update the parameter θ \theta θ.

9.2 Metrics to define optimal policies

​ The first metric is the average state value or simply called average value. Let

v π = [ ⋯ , v π ( s ) , ⋯ ] T ∈ R ∣ S ∣ d π = [ ⋯ , d π ( s ) , ⋯ ] T ∈ R ∣ S ∣ v_\pi = [\cdots, v_\pi(s), \cdots]^T \in \mathbb{R}^{|\mathcal{S}|} \\ d_\pi = [\cdots, d_\pi(s), \cdots]^T \in \mathbb{R}^{|\mathcal{S}|} vπ=[,vπ(s),]TRSdπ=[,dπ(s),]TRS

be the vector of state values and a vector of distribution of state value, respectively. Here, d π ( s ) ≥ 0 d_\pi(s)\ge 0 dπ(s)0 is the weight for state s s s and satisfies ∑ s d π ( s ) = 1 \sum_s d_\pi(s)=1 sdπ(s)=1. The metric of average value is defined as

v ˉ π ≜ d π T v π = ∑ s d π ( s ) v π ( s ) = E [ v π ( S ) ] \begin{aligned} \textcolor{red}{\bar{v}_\pi} & \textcolor{red}{\triangleq d_\pi^T v_\pi} \\ & \textcolor{red}{= \sum_s d_\pi(s)v_\pi(s)} \\ & \textcolor{red}{= \mathbb{E}[v_\pi(S)]} \end{aligned} vˉπdπTvπ=sdπ(s)vπ(s)=E[vπ(S)]

where S ∼ d π S \sim d_\pi Sdπ. As its name suggests, v ˉ π \bar{v}_\pi vˉπ is simply a weighted average of the state values. The distribution d π ( s ) d_\pi(s) dπ(s) statisfies stationary distribution by sovling the equation

d π T P π = d π T d^T_\pi P_\pi = d^T_\pi dπTPπ=dπT

where P π P_\pi Pπ is the state transition probability matrix.

​ The second metrics is the average one-step rewrad or simply called average reward. Let

r π = [ ⋯ , r π ( s ) , ⋯ ] T ∈ R ∣ S ∣ r_\pi = [\cdots, r_\pi(s),\cdots]^T \in \mathbb{R}^{|\mathcal{S}|} rπ=[,rπ(s),]TRS

be the vector of one-step immediate rewards. Here

r π ( s ) = ∑ a π ( a ∣ s ) r ( s , a ) r_\pi(s) = \sum_a \pi(a|s)r(s,a) rπ(s)=aπ(as)r(s,a)

is the average of the one-step immediate reward that can be obtained starting from state s s s, and r ( s , a ) = E [ R ∣ s , a ] = ∑ r r p ( r ∣ s , a ) r(s,a)=\mathbb{E}[R|s,a]=\sum_r r p(r|s,a) r(s,a)=E[Rs,a]=rrp(rs,a) is the average of the one-step immediate reward that can be obtained after taking action a a a at state s s s. Then the metric is defined as

r ˉ π ≜ d π T r π = ∑ s d π ( s ) ∑ a π ( a ∣ s ) ∑ r r p ( r ∣ s , a ) = ∑ s d π ( s ) ∑ a π ( a ∣ s ) r ( s , a ) = ∑ s d π ( s ) r π ( s ) = E [ r π ( S ) ] \begin{aligned} \textcolor{red}{\bar{r}_\pi} & \textcolor{red}{\triangleq d_\pi^T r_\pi} \\ & \textcolor{red}{= \sum_s d_\pi(s)\sum_a \pi(a|s) \sum_r r p(r|s,a) } \\ & \textcolor{red}{= \sum_s d_\pi(s)\sum_a \pi(a|s)r(s,a) } \\ & \textcolor{red}{= \sum_s d_\pi(s)r_\pi(s)} \\ & \textcolor{red}{= \mathbb{E}[r_\pi(S)]} \end{aligned} rˉπdπTrπ=sdπ(s)aπ(as)rrp(rs,a)=sdπ(s)aπ(as)r(s,a)=sdπ(s)rπ(s)=E[rπ(S)]

where S ∼ d π S\sim d_\pi Sdπ. As its name suggests, r ˉ π \bar{r}_\pi rˉπ is simply a weighted average of the one-step immediate rewards.

​ The third metric is the state value of a specific starting state v π ( s 0 ) v_\pi(s_0) vπ(s0). For some tasks, we can only start from a specific state s 0 s_0 s0. In this case, we only care about the long-term return starting from s 0 s_0 s0. This metric can also be viewed as a weighted average of the state values.

v π ( s 0 ) = ∑ s ∈ S d 0 ( s ) v π ( s ) \textcolor{red}{v_\pi(s_0) = \sum_{s\in\mathcal{S}} d_0(s) v_\pi(s)} vπ(s0)=sSd0(s)vπ(s)

where d 0 ( s = s 0 ) = 1 , d 0 ( s ≠ s 0 ) = 0 d_0(s=s_0)=1, d_0(s\ne s_0)=0 d0(s=s0)=1,d0(s=s0)=0.

​ We aim to search different value of parameter θ \theta θ to maximize these metrics.

9.3 Gradients of the metrics

Theorem 9.1 (Policy gradient theorem). The gradient of the average reward r ˉ π \bar{r}_\pi rˉπ metric is

∇ θ r ˉ π ( θ ) ≃ ∑ s d π ( s ) ∑ a ∇ θ π ( a ∣ s , θ ) q π ( s , a ) \textcolor{blue}{\nabla_\theta \bar{r}_\pi(\theta) \simeq \sum_s d_\pi(s)\sum_a \nabla_\theta \pi(a|s,\theta) q_\pi(s,a)} θrˉπ(θ)sdπ(s)aθπ(as,θ)qπ(s,a)

where ∇ θ π \nabla_\theta \pi θπ is the gradient of π \pi π with respect to θ \theta θ. Here ≃ \simeq refers to either strict equality or approximated equality. In particular, it is a strict equation in the undiscounted case where γ = 1 \gamma=1 γ=1 and an approximated equation in the discounted case where 0 < γ < 1 0<\gamma<1 0<γ<1. The approximation is more accurate in the discounted case when γ \gamma γ is closer to 1 1 1. Moreover, the equation has a more compact and useful form expressed in terms of expectation:

∇ θ r ˉ π ( θ ) ≃ E [ ∇ θ ln ⁡ π ( A ∣ S , θ ) q π ( S , A ) ] \textcolor{red}{\nabla_\theta \bar{r}_\pi(\theta) \simeq \mathbb{E} [\nabla_\theta \ln \pi(A|S,\theta)q_\pi(S,A)]} θrˉπ(θ)E[θlnπ(AS,θ)qπ(S,A)]

where ln ⁡ \ln ln is the natural logarithm and S ∼ d π , A ∼ π ( S ) S\sim d_\pi, A\sim \pi(S) Sdπ,Aπ(S).

​ Why the two equations mentioned above is equivalent? Here is the derivation process

∇ θ r ˉ π ( θ ) ≃ ∑ s d π ( s ) ∑ a ∇ θ π ( a ∣ s , θ ) q π ( s , a ) = E [ ∑ a ∇ θ π ( a ∣ S , θ ) q π ( S , a ) ] \begin{aligned} \nabla_\theta \bar{r}_\pi(\theta) & \simeq \sum_s d_\pi(s)\sum_a \nabla_\theta \pi(a|s,\theta) q_\pi(s,a) \\ & = \mathbb{E}\Big[ \sum_a \nabla_\theta \pi(a|S,\theta) q_\pi(S,a) \Big] \end{aligned} θrˉπ(θ)sdπ(s)aθπ(as,θ)qπ(s,a)=E[aθπ(aS,θ)qπ(S,a)]

where S ∼ d π ( s ) S \sim d_\pi(s) Sdπ(s). Furthermore, consider the function ln ⁡ π \ln\pi lnπ where ln ⁡ \ln ln is the natural algorithm.

∇ θ ln ⁡ π ( a ∣ s , θ ) = ∇ θ π ( a ∣ s , θ ) π ( a ∣ s , θ ) → ∇ θ π ( a ∣ s , θ ) = π ( a ∣ s , θ ) ∇ θ ln ⁡ π ( a ∣ s , θ ) \begin{aligned} \nabla_\theta \ln \pi (a|s,\theta) & = \frac{\nabla_\theta \pi(a|s,\theta)}{\pi(a|s,\theta)} \\ \to \nabla_\theta \pi(a|s,\theta) &= \pi(a|s,\theta) \nabla_\theta \ln \pi (a|s,\theta) \end{aligned} θlnπ(as,θ)θπ(as,θ)=π(as,θ)θπ(as,θ)=π(as,θ)θlnπ(as,θ)

By substituting

∇ θ r ˉ π ( θ ) = E [ ∑ a ∇ θ π ( a ∣ S , θ ) q π ( S , a ) ] = E [ ∑ a π ( a ∣ S , θ ) ∇ θ ln ⁡ π ( a ∣ S , θ ) q π ( S , a ) ] = E [ ∇ θ ln ⁡ π ( A ∣ S , θ ) q π ( S , A ) ] \begin{aligned} \nabla_\theta \bar{r}_\pi(\theta) & = \mathbb{E}\Big[ \sum_a \nabla_\theta \pi(a|S,\theta) q_\pi(S,a) \Big] \\ & = \mathbb{E}\Big[ \sum_a \pi(a|S,\theta) \nabla_\theta \ln \pi (a|S,\theta) q_\pi(S,a) \Big] \\ & = \mathbb{E}\Big[ \nabla_\theta \ln \pi (A|S,\theta) q_\pi(S,A) \Big] \end{aligned} θrˉπ(θ)=E[aθπ(aS,θ)qπ(S,a)]=E[aπ(aS,θ)θlnπ(aS,θ)qπ(S,a)]=E[θlnπ(AS,θ)qπ(S,A)]

where A ∼ π ( s , θ ) A \sim \pi(s,\theta) Aπ(s,θ).

​ Next we will show the metrics average one-step reward r ˉ π \bar{r}_\pi rˉπ and average state value v ˉ π \bar{v}_\pi vˉπ is equivalent. When discounted rate γ ∈ [ 0 , 1 ) \gamma\in[0,1) γ[0,1) is given, that

r ˉ π = ( 1 − γ ) v ˉ π \textcolor{blue}{\bar{r}_\pi = (1-\gamma)\bar{v}_\pi} rˉπ=(1γ)vˉπ

Proof, note that v ˉ π ( θ ) = d π T v π \bar{v}_\pi(\theta)=d^T_\pi v_\pi vˉπ(θ)=dπTvπ and r ˉ = d π T r π \bar{r}=d^T_\pi r_\pi rˉ=dπTrπ, where v π v_\pi vπ and r π r_\pi rπ statisfy the Bellman equation v π = r π + γ P π v π v_\pi=r_\pi + \gamma P_\pi v_\pi vπ=rπ+γPπvπ. Then multiplying d π T d_\pi^T dπT on the both left sides of the Bellman equation gives

v ˉ π = r ˉ π + γ d π T P π v π = r ˉ π + γ d π T v π = r ˉ π + γ v ˉ π \bar{v}_\pi = \bar{r}_\pi + \gamma d^T_\pi P_\pi v_\pi = \bar{r}_\pi + \gamma d^T_\pi v_\pi = \bar{r}_\pi + \gamma \bar{v}_\pi vˉπ=rˉπ+γdπTPπvπ=rˉπ+γdπTvπ=rˉπ+γvˉπ

which implies r ˉ π = ( 1 − γ ) v ˉ π \bar{r}_\pi = (1-\gamma)\bar{v}_\pi rˉπ=(1γ)vˉπ.

Theorem 9.2 (Gradient of v π ( s 0 ) v_\pi(s_0) vπ(s0) in the discounted case). In the discounted case where γ ∈ [ 0 , 1 ) \gamma \in [0,1) γ[0,1), the gradients of v π ( s 0 ) v_\pi(s_0) vπ(s0) is

∇ θ v π ( s 0 ) = E [ ∇ θ ln ⁡ π ( A ∣ S , θ ) q π ( S , A ) ] \nabla_\theta v_\pi(s_0) = \mathbb{E}[\nabla_\theta \ln \pi(A|S, \theta)q_\pi(S,A)] θvπ(s0)=E[θlnπ(AS,θ)qπ(S,A)]

where S ∼ ρ π S \sim \rho_\pi Sρπ and A ∼ π ( s , θ ) A \sim \pi(s,\theta) Aπ(s,θ). Here, the state distribution ρ π \rho_\pi ρπ is

ρ π ( s ) = Pr ⁡ π ( s ∣ s 0 ) = ∑ k = 0 γ k Pr ⁡ ( s 0 → s , k , π ) = [ ( I n − γ P π ) − 1 ] s 0 , s \rho_\pi(s) = \Pr_\pi (s|s_0) = \sum_{k=0} \gamma^k \Pr (s_0\to s, k, \pi) = [(I_n - \gamma P_\pi)^{-1}]_{s_0,s} ρπ(s)=πPr(ss0)=k=0γkPr(s0s,k,π)=[(InγPπ)1]s0,s

which is the discounted total probability transiting from s 0 s_0 s0 to s s s under policy π \pi π.

Theorem 9.3 (Gradient of v ˉ π \bar{v}_\pi vˉπ and r ˉ π \bar{r}_\pi rˉπ in the discounted case). In the discounted case where γ ∈ [ 0 , 1 ) \gamma \in [0,1) γ[0,1), the gradients of v ˉ π \bar{v}_\pi vˉπ and r ˉ π \bar{r}_\pi rˉπ are, respectively,

∇ θ v ˉ π ≈ 1 1 − γ ∑ s d π ( s ) ∑ a ∇ θ π ( a ∣ s , θ ) q π ( s , a ) ∇ θ r ˉ π ≈ ∑ s d π ( s ) ∑ a ∇ θ π ( a ∣ s , θ ) q π ( s , a ) \begin{aligned} \nabla_\theta \bar{v}_\pi & \approx \frac{1}{1-\gamma} \sum_s d_\pi(s) \sum_a \nabla_\theta \pi(a|s,\theta) q_\pi(s,a) \\ \nabla_\theta \bar{r}_\pi & \approx \sum_s d_\pi(s) \sum_a \nabla_\theta \pi(a|s,\theta) q_\pi(s,a) \end{aligned} θvˉπθrˉπ1γ1sdπ(s)aθπ(as,θ)qπ(s,a)sdπ(s)aθπ(as,θ)qπ(s,a)

where the approximations are more accurate when γ \gamma γ is closer to 1 1 1.

9.4 Policy gradient by Monte Carlo estimation: REINFORCE

​ Consider J ( θ ) = r ˉ π ( θ ) J(\theta) = \bar{r}_\pi(\theta) J(θ)=rˉπ(θ) or v π ( s 0 ) v_\pi(s_0) vπ(s0). The gradient-ascent algorithm maximizing J ( θ ) J(\theta) J(θ) is

θ t + 1 = θ t + α ∇ θ J ( θ ) = θ t + α E [ ∇ θ ln ⁡ π ( A ∣ S , θ t ) q π ( S , A ) ] \begin{aligned} \theta_{t+1} & = \theta_t + \alpha \nabla_\theta J(\theta) \\ & = \theta_t + \alpha \mathbb{E}[\nabla_\theta \ln\pi(A|S,\theta_t) q_\pi(S,A)] \end{aligned} θt+1=θt+αθJ(θ)=θt+αE[θlnπ(AS,θt)qπ(S,A)]

where α > 0 \alpha>0 α>0 is a constant learning rate. Since the expected value on the right-hand side is unknown, we can replace the expected value with a sample (the idea of stochastic gradient). Then we have

θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) q π ( s t , a t ) \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln\pi(a_t|s_t,\theta_t) q_\pi(s_t,a_t) θt+1=θt+αθlnπ(atst,θt)qπ(st,at)

However this cannot be implemented because q π ( s t , a t ) q_\pi(s_t,a_t) qπ(st,at) is the true value we can’t obtain. Hence, we use q t ( s t , a t ) q_t(s_t,a_t) qt(st,at) to estimate the true action value q π ( s t , a t ) q_\pi(s_t,a_t) qπ(st,at).

θ t + 1 = θ t + α ∇ θ ln ⁡ π ( a t ∣ s t , θ t ) q t ( s t , a t ) \theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln\pi(a_t|s_t,\theta_t) q_t(s_t,a_t) θt+1=θt+αθlnπ(atst,θt)qt(st,at)

If q π ( s t , a t ) q_\pi(s_t,a_t) qπ(st,at) is approximated by Monte Carlo estimation,

q π ( s t , a t ) ≜ E [ G t ∣ S t = s t , A t = a t ] ≈ 1 n ∑ i = 1 n g ( i ) ( s t , a t ) \begin{aligned} q_\pi(s_t,a_t) & \triangleq \mathbb{E}[G_t|S_t=s_t, A_t=a_t] \\ & \textcolor{blue}{\approx \frac{1}{n} \sum_{i=1}^n g^{(i)}(s_t,a_t)} \\ \end{aligned} qπ(st,at)E[GtSt=st,At=at]n1i=1ng(i)(st,at)

with stochastic approximation we don’t need to collect n n n episode start from ( s t , a t ) (s_t,a_t) (st,at) to approximate q π ( s t , a t ) q_\pi(s_t,a_t) qπ(st,at), we just need a discounted return starting from ( s t , a t ) (s_t,a_t) (st,at)

q π ( s t , a t ) ≈ q t ( a t , a t ) = ∑ k = t + 1 T γ k − t − 1 r k q_\pi(s_t,a_t) \approx q_t(a_t,a_t) = \sum_{k=t+1}^T \gamma^{k-t-1}r_k qπ(st,at)qt(at,at)=k=t+1Tγkt1rk

The algorithm is called REINFORCE.

Pseudocode:

Image

Reference

赵世钰老师的课程

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.luyixian.cn/news_show_337120.aspx

如若内容造成侵权/违法违规/事实不符,请联系dt猫网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

opencv+ffmpeg环境(ubuntu)搭建全面详解

一.先讲讲opencv和ffmpeg之间的关系 1.1它们之间的联系 我们知道opencv主要是用来做图像处理的&#xff0c;但也包含视频解码的功能&#xff0c;而在视频解码部分的功能opencv是使用了ffmpeg。所以它们都是可以处理图像和视频的编解码&#xff0c;我个人感觉两个的侧重点不一…

SpringBoot项目部署(前后端分离、Linux部署项目)

一、架构 部署环境说明&#xff1a; 192.168.122.100(服务器A)&#xff1a; Nginx&#xff1a;部署前端项目、配置反向代理 Mysql&#xff1a;主从复制结构中的主库 192.168.122.131 (服务器B)&#xff1a; jdk: 运行Java项目 git:版本控制工具 (从gitee中拉取源码) maven:…

数据结构:快速的Redis有哪些慢操作?

redis 为什么要这莫快&#xff1f;一个就是他是基于内存的&#xff0c;另外一个就是他是他的数据结构 说到这儿&#xff0c;你肯定会说&#xff1a;“这个我知道&#xff0c;不就是 String&#xff08;字符串&#xff09;、List&#xff08;列表&#xff09;、 Hash&#xff08…

【SpringCloud Alibaba】(五)服务雪崩与容错方案

在前面的文章中&#xff0c;我们实现了用户微服务、商品微服务和订单微服务之间的远程调用&#xff0c;并且实现了服务调用的负载均衡。 但是&#xff0c;现在系统中存在着一个很明显的问题&#xff1a;那就是如果系统的并发量上来后&#xff0c;系统并没有容错的能力&#xf…

Java | 继承、多态、抽象类与接口

目录 一、类的继承 二、Object类 2.1 getClass()方法 2.2 toString()方法 2.3 equals()方法 三 、对象类型的转换 3.1 向上转换 3.2 向下转型 四、使用instanceof关键字判断对象类型 五、方法的重载 六、final关键字 6.1 final变量 6.2 final方法 6.3 final类 七…

LeetCode 1857. Largest Color Value in a Directed Graph【拓扑排序,动态规划】困难

本文属于「征服LeetCode」系列文章之一&#xff0c;这一系列正式开始于2021/08/12。由于LeetCode上部分题目有锁&#xff0c;本系列将至少持续到刷完所有无锁题之日为止&#xff1b;由于LeetCode还在不断地创建新题&#xff0c;本系列的终止日期可能是永远。在这一系列刷题文章…

【UE5】快速认识入门

目录 &#x1f31f;1. 快速安装&#x1f31f;2. 简单快捷键操作&#x1f31f;3. 切换默认的打开场景&#x1f31f;4. 虚幻引擎术语 &#x1f31f;1. 快速安装 进入Unreal Engine 5官网进行下载即可&#xff1a;UE5 &#x1f4dd;官方帮助文档 打开后在启动器里创建5.2.1引擎…

冯诺依曼体系的认识、来源、原理、组成、功能和特点

目录 一.认识冯诺依曼 二.冯诺依曼体系结构的来源 三.冯诺依曼体系结构计算机 3.1工作原理 3.2组成部件 3.3功能和特点 &#x1f381;个人主页&#xff1a;tq02的博客_CSDN博客-C语言,Java,Java数据结构领域博主 &#x1f3a5; 本文由 tq02 原创&#xff0c;首发于 CSDN&…

C++笔记之vector的resize()和clear()用法

C笔记之vector的resize()和clear()用法 code review! 文章目录 C笔记之vector的resize()和clear()用法1.resize()2.clear() 1.resize() 运行 2.clear() 运行

Jsp+Ssh+Mysql实现的简单的企业物资信息管理系统项目源码附带视频指导运行教程

由jspssh&#xff08;springstruts2mysql&#xff09;实现的企业物资信息管理系统&#xff0c;系统功能比较简单&#xff0c;实现了基本的管理员、操作员等用户管理、物品分类管理、物品管理、入库管理、出库管理、库存预警、客户管理、供应商管理等基本功能需要的可以联系我分…

怎么在线制作证件?教你一键生成证件照

无论是申请身份证、护照、驾照还是学生证&#xff0c;都需要一张清晰、规范的证件照。但是&#xff0c;为了拍摄一张完美的证件照&#xff0c;需要付出不少时间和精力。而现在&#xff0c;我们可以使用压缩图网站提供的证件照制作工具&#xff0c;轻松制作出一张清晰、规范的证…

《重构的时机和方法》——让你的代码更健壮、更易维护

&#x1f44f;作者简介&#xff1a;大家好&#xff0c;我是爱敲代码的小黄&#xff0c;独角兽企业的Java开发工程师&#xff0c;CSDN博客专家&#xff0c;阿里云专家博主&#x1f4d5;系列专栏&#xff1a;Java设计模式、Spring源码系列、Netty源码系列、Kafka源码系列、JUC源码…

FFmpeg aresample_swr_opts的解析

ffmpeg option的解析 aresample_swr_opts是AVFilterGraph中的option。 static const AVOption filtergraph_options[] {{ "thread_type", "Allowed thread types", OFFSET(thread_type), AV_OPT_TYPE_FLAGS,{ .i64 AVFILTER_THREAD_SLICE }, 0, INT_MA…

【学会动态规划】打家劫舍 II(12)

目录 动态规划怎么学&#xff1f; 1. 题目解析 2. 算法原理 1. 状态表示 2. 状态转移方程 3. 初始化 4. 填表顺序 5. 返回值 3. 代码编写 写在最后&#xff1a; 动态规划怎么学&#xff1f; 学习一个算法没有捷径&#xff0c;更何况是学习动态规划&#xff0c; 跟我…

P3818 小A和uim之大逃离 II

题目 思路 一眼bfs 好像需要记录的东西有点多啊&#xff0c;那就交给数组吧 s t i j 0 / 1 st_{ij0/1} stij0/1​表示用/没用特殊步走到(i,j)的步数&#xff0c;然后套bfs模板即可 代码 #include<bits/stdc.h> using namespace std; const int N1005; int n,m,d,r,st…

使用Kmeans算法完成聚类任务

聚类任务 聚类任务是一种无监督学习任务&#xff0c;其目的是将一组数据点划分成若干个类别或簇&#xff0c;使得同一个簇内的数据点之间的相似度尽可能高&#xff0c;而不同簇之间的相似度尽可能低。聚类算法可以帮助我们发现数据中的内在结构和模式&#xff0c;发现异常点和离…

Pycharm debug程序,跳转至指定循环条件/循环次数

在断点出右键&#xff0c;然后设置条件 示例 for i in range(1,100):a i 1b i 2print(a, b, i) 注意&#xff1a; 1、你应该debug断点在循环后的位置而不是循环上的位置&#xff0c;然后你就可以设置你的条件进入到指定的循环上了 2、设置条件&#xff0c;要使用等于符号…

系统集成|第七章(笔记)

目录 第七章 范围管理7.1 项目范围管理概念7.2 主要过程7.2.1 规划范围管理7.2.2 收集需求7.2.3 定义范围7.2.4 创建工作分解结构 - WBS7.2.5 范围确认7.2.6 范围控制 上篇&#xff1a;第六章、整体管理 第七章 范围管理 7.1 项目范围管理概念 概述&#xff1a;项目范围管理就…

【深度学习Week3】ResNet+ResNeXt

ResNetResNeXt 一、ResNetⅠ.视频学习Ⅱ.论文阅读 二、ResNeXtⅠ.视频学习Ⅱ.论文阅读 三、猫狗大战Lenet网络Resnet网络 四、思考题 一、ResNet Ⅰ.视频学习 ResNet在2015年由微软实验室提出&#xff0c;该网络的亮点&#xff1a; 1.超深的网络结构&#xff08;突破1000层&…

C#之泛型

目录 一、概述 二、C#中的泛型 继续栈的示例 三、泛型类 &#xff08;一&#xff09;声明泛型类 &#xff08;二&#xff09;创建构造类型 &#xff08;三&#xff09;创建变量和实例 &#xff08;四&#xff09;比较泛型和非泛型栈 四、类型参数的约束 &#xff08;一…