Reinforcement Learning with Code 【Code 1. Tabular Q-learning】

news/2024/4/29 14:31:06/文章来源:https://blog.csdn.net/qq_44940689/article/details/131998905

Reinforcement Learning with Code 【Code 1. Tabular Q-learning】

This note records how the author begin to learn RL. Both theoretical understanding and code practice are presented. Many material are referenced such as ZhaoShiyu’s Mathematical Foundation of Reinforcement Learning.
This code refers to Mofan’s reinforcement learning course.

文章目录

  • Reinforcement Learning with Code 【Code 1. Tabular Q-learning】
    • 1.1 Problem and result
    • 1.2 Environment
    • 1.3 Tabular Q-learning Algorithm
    • 1.3 Run this main
    • Reference

1.1 Problem and result

Please consider the problem that a little mouse (denoted by red block) wants to avoid trap (denoted by black block) to get the cheese (denoted by yellow circle). As the figure shows.

Image

This chapter aims to realize tabular Q-learning algorithm sovle this problem.

1.2 Environment

We use the tkinter package of python to build our environment to interact with agent.

import numpy as np
import time
import sys
import tkinter as tk
# if sys.version_info.major == 2: # 检查python版本是否是python2
#     import Tkinter as tk
# else:
#     import tkinter as tkUNIT = 40   # pixels
MAZE_H = 4  # grid height
MAZE_W = 4  # grid widthclass Maze(tk.Tk, object):def __init__(self):super(Maze, self).__init__()# Action Spaceself.action_space = ['up', 'down', 'right', 'left'] # action space self.n_actions = len(self.action_space)# 绘制GUIself.title('Maze env')self.geometry('{0}x{1}'.format(MAZE_W * UNIT, MAZE_H * UNIT))   # 指定窗口大小 "width x height"self._build_maze()def _build_maze(self):self.canvas = tk.Canvas(self, bg='white',height=MAZE_H * UNIT,width=MAZE_W * UNIT)     # 创建背景画布# create gridsfor c in range(UNIT, MAZE_W * UNIT, UNIT): # 绘制列分隔线x0, y0, x1, y1 = c, 0, c, MAZE_H * UNITself.canvas.create_line(x0, y0, x1, y1)for r in range(UNIT, MAZE_H * UNIT, UNIT): # 绘制行分隔线x0, y0, x1, y1 = 0, r, MAZE_W * UNIT, rself.canvas.create_line(x0, y0, x1, y1)# create origin 第一个方格的中心,origin = np.array([UNIT/2, UNIT/2]) # hellhell1_center = origin + np.array([UNIT * 2, UNIT])self.hell1 = self.canvas.create_rectangle(hell1_center[0] - (UNIT/2 - 5), hell1_center[1] - (UNIT/2 - 5),hell1_center[0] + (UNIT/2 - 5), hell1_center[1] + (UNIT/2 - 5),fill='black')# hellhell2_center = origin + np.array([UNIT, UNIT * 2])self.hell2 = self.canvas.create_rectangle(hell2_center[0] - (UNIT/2 - 5), hell2_center[1] - (UNIT/2 - 5),hell2_center[0] + (UNIT/2 - 5), hell2_center[1] + (UNIT/2 - 5),fill='black')# create oval 绘制终点圆形oval_center = origin + np.array([UNIT*2, UNIT*2])self.oval = self.canvas.create_oval(oval_center[0] - (UNIT/2 - 5), oval_center[1] - (UNIT/2 - 5),oval_center[0] + (UNIT/2 - 5), oval_center[1] + (UNIT/2 - 5),fill='yellow')# create red rect 绘制agent红色方块,初始在方格左上角self.rect = self.canvas.create_rectangle(origin[0] - (UNIT/2 - 5), origin[1] - (UNIT/2 - 5),origin[0] + (UNIT/2 - 5), origin[1] + (UNIT/2 - 5),fill='red')# pack all 显示所有canvasself.canvas.pack()def get_state(self, rect):# convert the coordinate observation to state tuple# use the uniformed center as the state such as # |(1,1)|(2,1)|(3,1)|...# |(1,2)|(2,2)|(3,2)|...# |(1,3)|(2,3)|(3,3)|...# |....x0,y0,x1,y1 = self.canvas.coords(rect)x_center = (x0+x1)/2y_center = (y0+y1)/2state = ((x_center-(UNIT/2))/UNIT + 1, (y_center-(UNIT/2))/UNIT + 1)return statedef reset(self):self.update()self.after(500) # delay 500msself.canvas.delete(self.rect)   # delete origin rectangleorigin = np.array([UNIT/2, UNIT/2])self.rect = self.canvas.create_rectangle(origin[0] - (UNIT/2 - 5), origin[1] - (UNIT/2 - 5),origin[0] + (UNIT/2 - 5), origin[1] + (UNIT/2 - 5),fill='red')# return observation return self.get_state(self.rect)   def step(self, action):# agent和环境进行一次交互s = self.get_state(self.rect)   # 获得智能体的坐标base_action = np.array([0, 0])if action == self.action_space[0]:   # upif s[1] > 1:base_action[1] -= UNITelif action == self.action_space[1]:   # downif s[1] < MAZE_H:base_action[1] += UNITelif action == self.action_space[2]:   # rightif s[0] < MAZE_W:base_action[0] += UNITelif action == self.action_space[3]:   # leftif s[0] > 1:base_action[0] -= UNITself.canvas.move(self.rect, base_action[0], base_action[1])  # move agents_ = self.get_state(self.rect)  # next state# reward functionif s_ == self.get_state(self.oval):     # reach the terminalreward = 1done = Trues_ = 'terminal'elif s_ in [self.get_state(self.hell1), self.get_state(self.hell2)]: # reach the blockreward = -1done = Trues_ = 'terminal'else:reward = 0done = Falsereturn s_, reward, donedef render(self): # 显示函数self.after(300) # delay 300msself.update()   # update the windowif __name__ == '__main__':def test():for t in range(10):s = env.reset()print(s)while True:env.render()a = 'right's, r, done = env.step(a)print(s)if done:breakenv = Maze()env.after(100, test)      # 在延迟100ms后调用函数testenv.mainloop()

This part is important that the reward function design is include, which is as follows

reward = { 1 , if reach the cheese − 1 , if reach the trap 0 , others \text{reward} = \left \{ \begin{aligned} & 1, \quad \text{if reach the cheese} \\ & -1, \quad \text{if reach the trap} \\ & 0, \quad \text{others} \end{aligned} \right. reward= 1,if reach the cheese1,if reach the trap0,others

We need to explan some function of the class Maze.

  • First, the function _build_maze creates the inital maze location.
    In this example we use the left up coordination of each grid as the state of each block.
  • Second, the function get_state converts the coordination of each grid to numerical representation such as ( 1 , 1 ) , ( 1 , 2 ) , ⋯ (1,1),(1,2),\cdots (1,1),(1,2),.
  • Third, the function reset renew the state which means placing the mouse in the original grid.
  • Then, the function step we let the agent interact with envrionment for one step, ang get the reward after the action.
  • Then, the function render controls updating the window.

1.3 Tabular Q-learning Algorithm

import numpy as np
import pandas as pdclass QLearningTable():def __init__(self, actions, learning_rate=0.01, reward_decay=0.9, e_greedy=0.9):self.actions = actions  # action listself.lr = learning_rateself.gamma = reward_decayself.epsilon = e_greedy # epsilon greedy update policyself.q_table = pd.DataFrame(columns=self.actions, dtype=np.float64)def check_state_exist(self, state):if state not in self.q_table.index:# append new state to q table, use the coordinate as the observation# self.q_table = self.q_table.append(       # DataFrame.append is invalid#     pd.Series(#         [0]*len(self.actions),#         index=self.q_table.columns,#         name=state,#     )# )self.q_table = pd.concat([self.q_table,pd.DataFrame(data=np.zeros((1,len(self.actions))),columns = self.q_table.columns,index = [state])])def choose_action(self, observation):self.check_state_exist(observation)# action selection# epsilon greedy algorithmif np.random.uniform() < self.epsilon:state_action = self.q_table.loc[observation, :]# some actions may have the same value, randomly choose on in these actions# state_action == np.max(state_action) generate bool mask# choose best actionaction = np.random.choice(state_action[state_action == np.max(state_action)].index)else:# choose random actionaction = np.random.choice(self.actions)return actiondef learn(self, s, a, r, s_):self.check_state_exist(s_)q_predict = self.q_table.loc[s, a]if s_ != 'terminal':q_target = r + self.gamma * self.q_table.loc[s_, :].max()  # next state is not terminalelse:q_target = r  # next state is terminalself.q_table.loc[s, a] += self.lr * (q_target - q_predict)  # update

We store the Q-table as a DataFrame of pandas. The explanation of the functions are as follows.

  • First, the function check_state_exist check the existence of one state, if not we append it to the Q-table. This is because once the state-action pair is visited, then we update it into the Q-table.
  • Second, the function choose_action is following the ϵ \epsilon ϵ-greedy algorithm

π ( a ∣ s ) = { 1 − ϵ ∣ A ( s ) ∣ ( ∣ A ( s ) ∣ − 1 ) , for the geedy action ϵ ∣ A ( s ) ∣ , for the other  ∣ A ( s ) ∣ − 1 actions \pi(a|s) = \left \{ \begin{aligned} 1 - \frac{\epsilon}{|\mathcal{A}(s)|}(|\mathcal{A(s)}|-1), & \quad \text{for the geedy action} \\ \frac{\epsilon}{|\mathcal{A}(s)|}, & \quad \text{for the other } |\mathcal{A}(s)|-1 \text{ actions} \end{aligned} \right. π(as)= 1A(s)ϵ(A(s)1),A(s)ϵ,for the geedy actionfor the other A(s)1 actions

  • Third, the function learn is update the q value as Q-learning algorithm purposed.

Q-learning : { q t + 1 ( s t , a t ) = q t ( s t , a t ) − α t ( s t , a t ) [ q t ( s t , a t ) − ( r t + 1 + γ max ⁡ a ∈ A ( s t + 1 ) q t ( s t + 1 , a ) ) ] q t + 1 ( s , a ) = q t ( s , a ) , for all  ( s , a ) ≠ ( s t , a t ) \text{Q-learning} : \left \{ \begin{aligned} \textcolor{red}{q_{t+1}(s_t,a_t)} & \textcolor{red}{= q_t(s_t,a_t) - \alpha_t(s_t,a_t) \Big[q_t(s_t,a_t) - (r_{t+1}+ \gamma \max_{a\in\mathcal{A}(s_{t+1})} q_t(s_{t+1},a)) \Big]} \\ \textcolor{red}{q_{t+1}(s,a)} & \textcolor{red}{= q_t(s,a)}, \quad \text{for all } (s,a) \ne (s_t,a_t) \end{aligned} \right. Q-learning: qt+1(st,at)qt+1(s,a)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γaA(st+1)maxqt(st+1,a))]=qt(s,a),for all (s,a)=(st,at)

1.3 Run this main

Run this main script that we can run the all codes.

from maze_env_custom import Maze
from RL_brain import QLearningTableMAX_EPISODE = 10def update():for episode in range(MAX_EPISODE):# initial observation, observation is the rect's coordiante# observation is [x0,y0, x1,y1]observation = env.reset()   while True:# fresh envenv.render()# RL choose action based on observation ['up', 'down', 'right', 'left']action = RL.choose_action(str(observation))# RL take action and get next observation and rewardobservation_, reward, done = env.step(action)# RL learn from this transitionRL.learn(str(observation), action, reward, str(observation_))# swap observationobservation = observation_# break while loop when end of this episodeif done:break# show q_tableprint(RL.q_table)print('\n')# end of gameprint('game over')env.destroy()if __name__ == "__main__":env = Maze()RL = QLearningTable(env.action_space)env.after(100, update)env.mainloop()

Reference

赵世钰老师的课程
莫烦ReinforcementLearning course

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.luyixian.cn/news_show_337313.aspx

如若内容造成侵权/违法违规/事实不符,请联系dt猫网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

PostgreSQL构建时间

– PostgreSQL构建时间 select make_timestamp(2023,7,27,7,34,16);

Docker复杂命令便捷操作

启动所有状态为Created的容器 要启动所有状态为"created"的Docker容器&#xff0c;可以使用以下命令&#xff1a; docker container start $(docker container ls -aq --filter "statuscreated")上述命令执行了以下步骤&#xff1a; docker container l…

基于x-scan的渲染算法

基于x-scan算法实现的z-buffer染色&#xff0c;.net core framework 3.1运行。 x-scan算法实现&#xff1a; public List<Vertex3> xscan() {List<Vertex3> results new List<Vertex3>();SurfaceFormula formula getFormula();Box rect getBound();for …

【VTK】基于读取出来的 STL 模型,当用户点击鼠标左键时,程序将获取点击位置的点,显示其坐标,并设置它为模型的旋转原点

知识不是单独的&#xff0c;一定是成体系的。更多我的个人总结和相关经验可查阅这个专栏&#xff1a;Visual Studio。 文章目录 class PointPickedSignal : public QObjectclass MouseInteractorCommand : public vtkCommandvoid A::on_pushButtonSelected_clicked()void A::on…

谷粒商城第六天-商品服务之分类管理下的获取三级分类树形列表

目录 一、总述 1.1 前端思路 1.2 后端思路 二、前端部分 2.1 在网页中建好目录及菜单 2.1.1 建好商品目录 2.1.2 建好分类管理菜单 ​编辑 2.2 编写组件 2.2.1 先完成组件文件的创建 2.2.2 编写组件 2.2.2.1 显示三级分类树形列表 三、后端部分 3.1 编写商品分类…

Michael.W基于Foundry精读Openzeppelin第15期——SignedMath.sol

Michael.W基于Foundry精读Openzeppelin第15期——SignedMath.sol 0. 版本0.1 SignedMath.sol 1. 目标合约2. 代码精读2.1 max(int256 a, int256 b) && min(int256 a, int256 b)2.2 average(int256 a, int256 b)2.3 abs(int256 n) 0. 版本 [openzeppelin]&#xff1a;v…

Blazor前后端框架Known-V1.2.8

V1.2.8 Known是基于C#和Blazor开发的前后端分离快速开发框架&#xff0c;开箱即用&#xff0c;跨平台&#xff0c;一处代码&#xff0c;多处运行。 Gitee&#xff1a; https://gitee.com/known/KnownGithub&#xff1a;https://github.com/known/Known 概述 基于C#和Blazor…

python整理

Python 整理&#xff08;更新中&#xff09; 一、环境搭建 1- 下载python解析器 下载地址&#xff1a;https://www.python.org/ 2- 安装解析器: 3.pycharm 安装操作 1- 下载pycharm 下载地址: https://www.jetbrains.com/pycharm/ pycharm开发第一个Python程序 在这…

css定义超级链接a标签里面的title的样式

效果: 代码: 总结:此css 使用于任何元素,不仅仅是a标签!

Flutter 使用texture_rgba_renderer实现桌面端渲染视频

Flutter视频渲染系列 第一章 Android使用Texture渲染视频 第二章 Windows使用Texture渲染视频 第三章 Linux使用Texture渲染视频 第四章 全平台FFICustomPainter渲染视频 第五章 Windows使用Native窗口渲染视频 第六章 桌面端使用texture_rgba_renderer渲染视频&#xff08;本…

Power BI-云端报表定时刷新--ODBC、MySQL、Oracle等其他本地数据源的刷新(二)

ODBC数据源 一些小众的数据源无法直接连接&#xff0c;需要通过微软系统自带的应用“ODBC数据源”连接。 1.首次使用应安装对应数据库的ODBC驱动程序&#xff0c;Mysql的ODBC驱动需要手动安装 2.在web服务中进行数据源的配置 Mysql数据源 1.Powerbi与Gateway第一次连SQL…

IDEA debug总结

调试一次编程题&#xff0c;发现没有掌握debug技巧&#xff0c;确实费事&#xff0c;做一次总结&#xff0c;方便以后回顾。 Run to Cursor 跳到光标处&#xff0c;适用于快速跳过循环&#xff0c;定位到光标处&#xff0c;而不用到处打断点&#xff0c;使用断点跳转。非常实…

Cilium系列-6-从地址伪装从IPtables切换为eBPF

系列文章 Cilium 系列文章 前言 将 Kubernetes 的 CNI 从其他组件切换为 Cilium, 已经可以有效地提升网络的性能. 但是通过对 Cilium 不同模式的切换/功能的启用, 可以进一步提升 Cilium 的网络性能. 具体调优项包括不限于: 启用本地路由(Native Routing)完全替换 KubeProx…

解决AttributeError: ‘DataParallel‘ object has no attribute ‘xxxx_fc1‘

问题描述 训练模型时&#xff0c;分阶段训练&#xff0c;第二阶段加载第一阶段训练好的模型的参数&#xff0c;接着训练 第一阶段训练&#xff0c;含有代码 if (train_on_gpu):if torch.cuda.device_count() > 1:net nn.DataParallel(net)net net.to(device)第二阶段训练…

记一次 .NET 某设备监控系统 死锁分析

一&#xff1a;背景 1. 讲故事 上周看了一位训练营朋友的dump&#xff0c;据朋友说他的程序卡死了&#xff0c;看完之后发现是一例经典的死锁问题&#xff0c;蛮有意思&#xff0c;这个案例算是学习 .NET高级调试 入门级的案例&#xff0c;这里和大家分享一下。 二&#xff…

从C语言到C++_29(红黑树封装set和map)红黑树迭代器的实现

目录 1. set和map中的红黑树 2. 仿函数比较键值对 3. 红黑树迭代器的实现 3.1 迭代器 3.2 迭代器-- 3.3 map的operator[ ] 4. 完整代码 Set.h Map.h RedBlackTree.h Test.cpp 本章完&#xff0c; 1. set和map中的红黑树 前一篇红黑树的源代码&#xff1a; #pragm…

掌握Python的X篇_12_如何使用VS Code调试Python程序

本篇将会介绍如何使用VS Code调试Python程序。 文章目录 1. 什么是调试2. 断点3. 如何启动调试4. 监视窗口5. 单步 1. 什么是调试 我们可以利用VS Code对Python代码进行调试。所谓调试&#xff0c;大家可以理解成有能力将程序进行 “慢动作播放”让我们有机会看到程序一步一步…

Diffusion扩散模型学习2——Stable Diffusion结构解析-以文本生成图像(文生图,txt2img)为例

Diffusion扩散模型学习2——Stable Diffusion结构解析-以文本生成图像&#xff08;文生图&#xff0c;txt2img&#xff09;为例 学习前言源码下载地址网络构建一、什么是Stable Diffusion&#xff08;SD&#xff09;二、Stable Diffusion的组成三、生成流程1、文本编码2、采样流…

JetBrains 为测试自动化打造的强大 IDE-Aqua

QA 和测试工程对现代软件开发必不可少。 在 JetBrains&#xff0c;我们相信使用正确的工具对每项工作都很重要。 对我们来说&#xff0c;为自动化测试开发创建单独的工具是自然而然的事&#xff0c;因为这使我们能够满足多角色软件开发团队的需求。 我们很高兴能够推出 JetBra…

信号的学习笔记二

文章目录 信号捕捉signal信号捕捉sigaction信号集未决信号集和阻塞信号集的工作过程 ![在这里插入图片描述](https://img-blog.csdnimg.cn/b896346af6f1462089779e513a7e237b.png)信号集相关函数sigemptysetsigfillsetsigaddsetsigdelsetsigismember应用 以下函数设置内核信号集…