DQN Framework
- The agent interacts with the environment to generate next state, reward and termination information, which will be stored in a replay buffer.
Agent与环境交互,产生下一个状态、奖励和终止等信息,并将这些信息存储在回放缓冲区中。
- Sample from the buffer, calculate the loss and optimize the model.
从缓冲区采样,计算损耗并优化模型
Application
1.1 Cartpole Introduction
- action spaces: left or right
动作空间:向左或者向右
- state spaces:
- position of the cart on the track (小车在轨的位置)
- angle of the pole with the vertical (杆与竖直方向的夹角)
- cart velocity (小车速度)
- rate of change of the angle (角度变化率)
- tips
- the reward boundary of cartpole-v0 is 200, and that of cartpole-v1 is 500.
cartpole-v0的奖励边界是200,cartpole-v1的奖励边界是500。
1.2 Code
1.3 Result
- episode reward
- mean reward
Reference
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)