Reinforcement Learning with Code

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning, .

文章目录

Reinforcement Learning with Code
- Chapter 9. Policy Gradient Methods
- - 9.1 Basic idea of policy gradient
  - 9.2 Metrics to define optimal policies
  - 9.3 Gradients of the metrics
  - 9.4 Policy gradient by Monte Carlo estimation: REINFORCE
- Reference

Chapter 9. Policy Gradient Methods

The idea of function approximation can be applied to represent not only state/action values but also policies. Up to now in this book, policies have been represented by tables: the action probabilities of all states are stored in a table $\pi(a|s)$ , each entry of which is indexed by a state and an action. In this chapter, we show that polices can be represented by parameterized functions denoted as $\pi(a|s,\theta)$ , where $\theta\in\mathbb{R}^m$ is a parameter vector. The function representation is also sometimes written as $\textcolor{blue}{\pi(a,s,\theta)},\textcolor{blue}{\pi_\theta(a|s)},$ or $\textcolor{blue}{\pi_\theta(a,s)}$ .

When policies are represented as a function, optimal policies can be found by optimizing certain scalar metrics. Such kind of method is called policy gradient.

9.1 Basic idea of policy gradient

How to define optimal policies? When represented as a table, a policy $\pi$ is defined as optimal if it can maximize every state value. When represented by a function, a policy $\pi$ is fully determined by $\theta$ together with the function strcuture. The policy is defined as optimal if it can maximize certain scalar metrics, which we will introduce later.

How to update policies? When represented as a table, a plicy $\pi$ can be updated by directly changing the entries in the table. However, when represented by a parameterized function, a policy $\pi$ cannot be updated in this way anymore. Instead, it can only be improved by updating the parameter $\theta$ . We can use gradient-based method to optimize some metrics to update the parameter $\theta$ .

9.2 Metrics to define optimal policies

The first metric is the average state value or simply called average value. Let

$v_\pi = [\cdots, v_\pi(s), \cdots]^T \in \mathbb{R}^{|\mathcal{S}|} \\ d_\pi = [\cdots, d_\pi(s), \cdots]^T \in \mathbb{R}^{|\mathcal{S}|}$

be the vector of state values and a vector of distribution of state value, respectively. Here, $d_\pi(s)\ge 0$ is the weight for state $s$ and satisfies $\sum_s d_\pi(s)=1$ . The metric of average value is defined as

$\begin{aligned} \textcolor{red}{\bar{v}_\pi} & \textcolor{red}{\triangleq d_\pi^T v_\pi} \\ & \textcolor{red}{= \sum_s d_\pi(s)v_\pi(s)} \\ & \textcolor{red}{= \mathbb{E}[v_\pi(S)]} \end{aligned}$

where $\sim d_\pi$ . As its name suggests, $\bar{v}_\pi$ is simply a weighted average of the state values. The distribution $d_\pi(s)$ statisfies stationary distribution by sovling the equation

$d^T_\pi P_\pi = d^T_\pi$

where $P_\pi$ is the state transition probability matrix.

The second metrics is the average one-step rewrad or simply called average reward. Let

$r_\pi = [\cdots, r_\pi(s),\cdots]^T \in \mathbb{R}^{|\mathcal{S}|}$

be the vector of one-step immediate rewards. Here

$r_\pi(s) = \sum_a \pi(a|s)r(s,a)$

is the average of the one-step immediate reward that can be obtained starting from state $s$ , and $r(s,a)=\mathbb{E}[R|s,a]=\sum_r r p(r|s,a)$ is the average of the one-step immediate reward that can be obtained after taking action $a$ at state $s$ . Then the metric is defined as

$\begin{aligned} \textcolor{red}{\bar{r}_\pi} & \textcolor{red}{\triangleq d_\pi^T r_\pi} \\ & \textcolor{red}{= \sum_s d_\pi(s)\sum_a \pi(a|s) \sum_r r p(r|s,a) } \\ & \textcolor{red}{= \sum_s d_\pi(s)\sum_a \pi(a|s)r(s,a) } \\ & \textcolor{red}{= \sum_s d_\pi(s)r_\pi(s)} \\ & \textcolor{red}{= \mathbb{E}[r_\pi(S)]} \end{aligned}$

where $S\sim d_\pi$ . As its name suggests, $\bar{r}_\pi$ is simply a weighted average of the one-step immediate rewards.

The third metric is the state value of a specific starting state $v_\pi(s_0)$ . For some tasks, we can only start from a specific state $s_0$ . In this case, we only care about the long-term return starting from $s_0$ . This metric can also be viewed as a weighted average of the state values.

$\textcolor{red}{v_\pi(s_0) = \sum_{s\in\mathcal{S}} d_0(s) v_\pi(s)}$

where $d_0(s=s_0)=1, d_0(s\ne s_0)=0$ .

We aim to search different value of parameter $\theta$ to maximize these metrics.

9.3 Gradients of the metrics

Theorem 9.1 (Policy gradient theorem). The gradient of the average reward $\bar{r}_\pi$ metric is

$\textcolor{blue}{\nabla_\theta \bar{r}_\pi(\theta) \simeq \sum_s d_\pi(s)\sum_a \nabla_\theta \pi(a|s,\theta) q_\pi(s,a)}$

where $\nabla_\theta \pi$ is the gradient of $\pi$ with respect to $\theta$ . Here $\simeq$ refers to either strict equality or approximated equality. In particular, it is a strict equation in the undiscounted case where $\gamma=1$ and an approximated equation in the discounted case where $0<\gamma<1$ . The approximation is more accurate in the discounted case when $\gamma$ is closer to $1$ . Moreover, the equation has a more compact and useful form expressed in terms of expectation:

$\textcolor{red}{\nabla_\theta \bar{r}_\pi(\theta) \simeq \mathbb{E} [\nabla_\theta \ln \pi(A|S,\theta)q_\pi(S,A)]}$

where $\ln$ is the natural logarithm and $S\sim d_\pi, A\sim \pi(S)$ .

Why the two equations mentioned above is equivalent? Here is the derivation process

$\begin{aligned} \nabla_\theta \bar{r}_\pi(\theta) & \simeq \sum_s d_\pi(s)\sum_a \nabla_\theta \pi(a|s,\theta) q_\pi(s,a) \\ & = \mathbb{E}\Big[ \sum_a \nabla_\theta \pi(a|S,\theta) q_\pi(S,a) \Big] \end{aligned}$

where $\sim d_\pi(s)$ . Furthermore, consider the function $\ln\pi$ where $\ln$ is the natural algorithm.

$\begin{aligned} \nabla_\theta \ln \pi (a|s,\theta) & = \frac{\nabla_\theta \pi(a|s,\theta)}{\pi(a|s,\theta)} \\ \to \nabla_\theta \pi(a|s,\theta) &= \pi(a|s,\theta) \nabla_\theta \ln \pi (a|s,\theta) \end{aligned}$

By substituting

$\begin{aligned} \nabla_\theta \bar{r}_\pi(\theta) & = \mathbb{E}\Big[ \sum_a \nabla_\theta \pi(a|S,\theta) q_\pi(S,a) \Big] \\ & = \mathbb{E}\Big[ \sum_a \pi(a|S,\theta) \nabla_\theta \ln \pi (a|S,\theta) q_\pi(S,a) \Big] \\ & = \mathbb{E}\Big[ \nabla_\theta \ln \pi (A|S,\theta) q_\pi(S,A) \Big] \end{aligned}$

where $\sim \pi(s,\theta)$ .

Next we will show the metrics average one-step reward $\bar{r}_\pi$ and average state value $\bar{v}_\pi$ is equivalent. When discounted rate $\gamma\in[0,1)$ is given, that

$\textcolor{blue}{\bar{r}_\pi = (1-\gamma)\bar{v}_\pi}$

Proof, note that $\bar{v}_\pi(\theta)=d^T_\pi v_\pi$ and $\bar{r}=d^T_\pi r_\pi$ , where $v_\pi$ and $r_\pi$ statisfy the Bellman equation $v_\pi=r_\pi + \gamma P_\pi v_\pi$ . Then multiplying $d_\pi^T$ on the both left sides of the Bellman equation gives

$\bar{v}_\pi = \bar{r}_\pi + \gamma d^T_\pi P_\pi v_\pi = \bar{r}_\pi + \gamma d^T_\pi v_\pi = \bar{r}_\pi + \gamma \bar{v}_\pi$

which implies $\bar{r}_\pi = (1-\gamma)\bar{v}_\pi$ .

Theorem 9.2 (Gradient of $v_\pi(s_0)$ in the discounted case). In the discounted case where $\gamma \in [0,1)$ , the gradients of $v_\pi(s_0)$ is

$\nabla_\theta v_\pi(s_0) = \mathbb{E}[\nabla_\theta \ln \pi(A|S, \theta)q_\pi(S,A)]$

where $\sim \rho_\pi$ and $\sim \pi(s,\theta)$ . Here, the state distribution $\rho_\pi$ is

$\rho_\pi(s) = \Pr_\pi (s|s_0) = \sum_{k=0} \gamma^k \Pr (s_0\to s, k, \pi) = [(I_n - \gamma P_\pi)^{-1}]_{s_0,s}$

which is the discounted total probability transiting from $s_0$ to $s$ under policy $\pi$ .

Theorem 9.3 (Gradient of $\bar{v}_\pi$ and $\bar{r}_\pi$ in the discounted case). In the discounted case where $\gamma \in [0,1)$ , the gradients of $\bar{v}_\pi$ and $\bar{r}_\pi$ are, respectively,

$\begin{aligned} \nabla_\theta \bar{v}_\pi & \approx \frac{1}{1-\gamma} \sum_s d_\pi(s) \sum_a \nabla_\theta \pi(a|s,\theta) q_\pi(s,a) \\ \nabla_\theta \bar{r}_\pi & \approx \sum_s d_\pi(s) \sum_a \nabla_\theta \pi(a|s,\theta) q_\pi(s,a) \end{aligned}$

where the approximations are more accurate when $\gamma$ is closer to $1$ .

9.4 Policy gradient by Monte Carlo estimation: REINFORCE

Consider $J(\theta) = \bar{r}_\pi(\theta)$ or $v_\pi(s_0)$ . The gradient-ascent algorithm maximizing $J(\theta)$ is

$\begin{aligned} \theta_{t+1} & = \theta_t + \alpha \nabla_\theta J(\theta) \\ & = \theta_t + \alpha \mathbb{E}[\nabla_\theta \ln\pi(A|S,\theta_t) q_\pi(S,A)] \end{aligned}$

where $\alpha>0$ is a constant learning rate. Since the expected value on the right-hand side is unknown, we can replace the expected value with a sample (the idea of stochastic gradient). Then we have

$\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln\pi(a_t|s_t,\theta_t) q_\pi(s_t,a_t)$

However this cannot be implemented because $q_\pi(s_t,a_t)$ is the true value we can’t obtain. Hence, we use $q_t(s_t,a_t)$ to estimate the true action value $q_\pi(s_t,a_t)$ .

$\theta_{t+1} = \theta_t + \alpha \nabla_\theta \ln\pi(a_t|s_t,\theta_t) q_t(s_t,a_t)$

If $q_\pi(s_t,a_t)$ is approximated by Monte Carlo estimation,

$\begin{aligned} q_\pi(s_t,a_t) & \triangleq \mathbb{E}[G_t|S_t=s_t, A_t=a_t] \\ & \textcolor{blue}{\approx \frac{1}{n} \sum_{i=1}^n g^{(i)}(s_t,a_t)} \\ \end{aligned}$

with stochastic approximation we don’t need to collect $n$ episode start from $s_t,a_t)$ to approximate $q_\pi(s_t,a_t)$ , we just need a discounted return starting from $s_t,a_t)$