考虑有限 MDP
M
=
{
S
,
A
,
ρ
,
P
,
R
,
γ
}
\mathcal{M}=\{\mathcal{S,A},\rho,P,R,\gamma\}
M={S,A,ρ,P,R,γ},离线数据集
D
N
D_N
DN 由某 behavior policy
β
\beta
β 收集的
(
s
i
,
a
i
,
r
i
)
(s_i,a_i,r_i)
(si,ai,ri) 组成,期望奖励为
r
(
s
,
a
)
=
E
r
∣
s
,
a
[
r
]
r(s,a)=\mathbb{E}_{r|s,a}[r]
r(s,a)=Er∣s,a[r],任意策略
π
\pi
π 的
Q
Q
Q 价值定义为
Q
π
(
s
,
a
)
:
=
E
P
,
π
∣
s
0
=
s
,
a
0
=
a
[
∑
t
=
0
∞
γ
t
r
(
s
t
,
a
t
)
]
Q^{\pi}(s, a):=\mathbb{E}_{P, \pi \mid s_{0}=s, a_{0}=a}\left[\sum_{t=0}^{\infty} \gamma^{t} r\left(s_{t}, a_{t}\right)\right]
Qπ(s,a):=EP,π∣s0=s,a0=a[t=0∑∞γtr(st,at)] 目标是最大化学得策略的期望 return
J
(
π
)
:
=
E
ρ
,
P
,
π
[
∑
t
=
0
∞
γ
t
r
(
s
t
,
a
t
)
]
=
E
s
∼
ρ
a
∼
π
∣
s
[
Q
π
(
s
,
a
)
]
J(\pi):=\underset{\rho, P, \pi}{\mathbb{E}}\left[\sum_{t=0}^{\infty} \gamma^{t} r\left(s_{t}, a_{t}\right)\right]=\underset{\substack{s \sim \rho \\ a \sim \pi \mid s}}{\mathbb{E}}\left[Q^{\pi}(s, a)\right]
J(π):=ρ,P,πE[t=0∑∞γtr(st,at)]=s∼ρa∼π∣sE[Qπ(s,a)] 允许访问环境来调优一组少量的(< 10)超参数集