Maxton‘s Blog

Back

RL Study Notes: Actor-Critic Algorithm

Overview of RL Actor-Critic frameworks, covering derivations and updates for QAC, A2C, importance sampling, and DPG.

Actor-critic#

  • Actor is responsible for the policy update, deciding which action to take in a given state.
  • Critic is responsible for policy evaluation or value estimation, used to judge the quality of the policy selected by the Actor.

QAC (Q-Actor-Critic)#

Critic (value update):wt+1=wt+αw[rt+1+γq(st+1,at+1,wt)q(st,at,wt)]wq(st,at,wt)Actor (policy update):θt+1=θt+αθθlnπ(atst,θt)q(st,at,wt+1)\begin{aligned} &\text{Critic (value update):} \\ &w_{t+1} = w_t + \alpha_w [r_{t+1} + \gamma q(s_{t+1}, a_{t+1}, w_t) - q(s_t, a_t, w_t)] \nabla_w q(s_t, a_t, w_t) \\ &\text{Actor (policy update):} \\ &\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \ln \pi(a_t | s_t, \theta_t) q(s_t, a_t, w_{t+1}) \end{aligned}

A2C (Advantage Actor-Critic)#

  • Introduces a baseline to reduce variance.
θJ(θ)=ESη,Aπ[θlnπ(AS,θt)qπ(S,A)]=ESη,Aπ[θlnπ(AS,θt)(qπ(S,A)b(S))]\begin{aligned} \nabla_{\theta} J(\theta) &= \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_{\theta} \ln \pi(A|S, \theta_t) q_{\pi}(S, A) \right] \\ &= \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_{\theta} \ln \pi(A|S, \theta_t) (q_{\pi}(S, A) - b(S)) \right] \end{aligned}

To make the above equation hold, it must satisfy:

ESη,Aπ[θlnπ(AS,θt)b(S)]=0\mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_{\theta} \ln \pi(A|S, \theta_t) b(S) \right] = 0

The specific derivation is as follows:

ESη,Aπ[θlnπ(AS,θt)b(S)]=sSη(s)aAπ(as,θt)θlnπ(as,θt)b(s)=sSη(s)aAθπ(as,θt)b(s)=sSη(s)b(s)θaAπ(as,θt)=sSη(s)b(s)θ1=0\begin{aligned} \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_{\theta} \ln \pi(A|S, \theta_t) b(S) \right] &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \pi(a|s, \theta_t) \nabla_{\theta} \ln \pi(a|s, \theta_t) b(s) \\ &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_{\theta} \pi(a|s, \theta_t) b(s) \\ &= \sum_{s \in \mathcal{S}} \eta(s) b(s) \nabla_{\theta} \sum_{a \in \mathcal{A}} \pi(a|s, \theta_t) \\ &= \sum_{s \in \mathcal{S}} \eta(s) b(s) \nabla_{\theta} 1 = 0 \end{aligned}
  • The baseline function is primarily used to control variance, so the goal is to find an optimal baseline function that minimizes variance. The optimal baseline is:
b(s)=EAπ[θlnπ(As,θt)2q(s,A)]EAπ[θlnπ(As,θt)2]b^*(s) = \frac{\mathbb{E}_{A \sim \pi} [\|\nabla_{\theta} \ln \pi(A|s, \theta_t)\|^2 q(s, A)]}{\mathbb{E}_{A \sim \pi} [\|\nabla_{\theta} \ln \pi(A|s, \theta_t)\|^2]}
  • Since this form is too complex to compute, in practice, the weight term EAπ[θlnπ(As,θt)2]\mathbb{E}_{A \sim \pi} [\|\nabla_{\theta} \ln \pi(A|s, \theta_t)\|^2] is usually removed, approximating it as:
b(s)=EAπ[q(s,A)]=vπ(s)b(s) = \mathbb{E}_{A \sim \pi} [q(s, A)] = v_{\pi}(s)

When b(s)=vπ(s)b(s) = v_\pi(s):

  • The gradient ascent algorithm is:
θt+1=θt+αE[θlnπ(AS,θt)[qπ(S,A)vπ(S)]]θt+αE[θlnπ(AS,θt)δπ(S,A)]\begin{aligned} \theta_{t+1} &= \theta_t + \alpha \mathbb{E} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) [q_\pi(S, A) - v_\pi(S)] \right] \\ &\doteq \theta_t + \alpha \mathbb{E} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) \delta_\pi(S, A) \right] \end{aligned}

Where:

δπ(S,A)qπ(S,A)vπ(S)\delta_\pi(S, A) \doteq q_\pi(S, A) - v_\pi(S)

This term is called the Advantage function. According to the definition of vπ(S)v_{\pi}(S), it is the expected value of all actions in state SS. If the qq value of a certain action is greater than the average vv, it indicates that the action possesses an “advantage”.

  • The stochastic version of this algorithm is:
θt+1=θt+αθlnπ(atst,θt)[qt(st,at)vt(st)]=θt+αθlnπ(atst,θt)δt(st,at)\begin{aligned} \theta_{t+1} &= \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) [q_t(s_t, a_t) - v_t(s_t)] \\ &= \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) \delta_t(s_t, a_t) \end{aligned}

Furthermore, this algorithm can be rewritten as:

θt+1=θt+αθlnπ(atst,θt)δt(st,at)=θt+αθπ(atst,θt)π(atst,θt)δt(st,at)=θt+α(δt(st,at)π(atst,θt))step sizeθπ(atst,θt)\begin{aligned} \theta_{t+1} &= \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) \delta_t(s_t, a_t) \\ &= \theta_t + \alpha \frac{\nabla_\theta \pi(a_t|s_t, \theta_t)}{\pi(a_t|s_t, \theta_t)} \delta_t(s_t, a_t) \\ &= \theta_t + \underbrace{\alpha \left( \frac{\delta_t(s_t, a_t)}{\pi(a_t|s_t, \theta_t)} \right)}_{\text{step size}} \nabla_\theta \pi(a_t|s_t, \theta_t) \end{aligned}
  • The update step size is proportional to the relative value δt\delta_t, rather than the absolute value qtq_t, which is logically more reasonable.
  • It still balances exploration and exploitation well.

Approximation via TD error:

δt=qt(st,at)vt(st)rt+1+γvt(st+1)vt(st)\delta_t = q_t(s_t, a_t) - v_t(s_t) \approx r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t)
  • This approximation is reasonable because:
E[qπ(S,A)vπ(S)S=st,A=at]=E[Rt+1+γvπ(St+1)vπ(St)S=st,A=at]\mathbb{E}[q_\pi(S, A) - v_\pi(S)|S = s_t, A = a_t] = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) - v_\pi(S_t)|S = s_t, A = a_t]
  • Advantage: Only one neural network is needed to approximate vπ(s)v_\pi(s), eliminating the need to maintain two separate networks to approximate qπ(s,a)q_\pi(s, a) and vπ(s)v_\pi(s).

Importance Sampling and Off-policy#

Importance sampling technique#

Note that:

EXp0[X]=xp0(x)x=xp1(x)p0(x)p1(x)xf(x)=EXp1[f(X)]\mathbb{E}_{X \sim p_0}[X] = \sum_x p_0(x)x = \sum_x p_1(x) \underbrace{\frac{p_0(x)}{p_1(x)} x}_{f(x)} = \mathbb{E}_{X \sim p_1}[f(X)]
  • Therefore, we can estimate EXp0[X]\mathbb{E}_{X \sim p_0}[X] by estimating EXp1[f(X)]\mathbb{E}_{X \sim p_1}[f(X)].
  • The estimation method is as follows:

Let:

fˉ1ni=1nf(xi),where xip1\bar{f} \doteq \frac{1}{n} \sum_{i=1}^n f(x_i), \quad \text{where } x_i \sim p_1

Then:

EXp1[fˉ]=EXp1[f(X)]\mathbb{E}_{X \sim p_1}[\bar{f}] = \mathbb{E}_{X \sim p_1}[f(X)] varXp1[fˉ]=1nvarXp1[f(X)]\text{var}_{X \sim p_1}[\bar{f}] = \frac{1}{n} \text{var}_{X \sim p_1}[f(X)]

So, fˉ\bar{f} is a good approximation of EXp0[X]\mathbb{E}_{X \sim p_0}[X]:

EXp0[X]fˉ=1ni=1nf(xi)=1ni=1np0(xi)p1(xi)xi\mathbb{E}_{X \sim p_0}[X] \approx \bar{f} = \frac{1}{n} \sum_{i=1}^n f(x_i) = \frac{1}{n} \sum_{i=1}^n \frac{p_0(x_i)}{p_1(x_i)} x_i
  • The ratio p0(xi)p1(xi)\frac{p_0(x_i)}{p_1(x_i)} is called the importance weight.
  • If p1(xi)=p0(xi)p_1(x_i) = p_0(x_i), the importance weight is 1, and fˉ\bar{f} degenerates to the standard arithmetic mean.
  • If p0(xi)>p1(xi)p_0(x_i) > p_1(x_i), it means sample xix_i appears more frequently in distribution p0p_0 than in p1p_1. An importance weight greater than 1 will enhance the proportion of this sample in the expectation calculation.

The objective function is defined as:

J(θ)=sSdβ(s)vπ(s)=ESdβ[vπ(S)]J(\theta) = \sum_{s \in \mathcal{S}} d_\beta(s) v_\pi(s) = \mathbb{E}_{S \sim d_\beta} [v_\pi(S)]

Its gradient is:

θJ(θ)=ESρ,Aβ[π(AS,θ)β(AS)θlnπ(AS,θ)qπ(S,A)]\nabla_\theta J(\theta) = \mathbb{E}_{S \sim \rho, A \sim \beta} \left[ \frac{\pi(A | S, \theta)}{\beta(A | S)} \nabla_\theta \ln \pi(A | S, \theta) q_\pi(S, A) \right]

The off-policy gradient also has invariance to the baseline function b(s)b(s):

θJ(θ)=ESρ,Aβ[π(AS,θ)β(AS)θlnπ(AS,θ)(qπ(S,A)b(S))]\nabla_{\theta} J(\theta) = \mathbb{E}_{S \sim \rho, A \sim \beta} \left[ \frac{\pi(A|S, \theta)}{\beta(A|S)} \nabla_{\theta} \ln \pi(A|S, \theta) (q_{\pi}(S, A) - b(S)) \right]
  • To reduce the estimation variance, we similarly choose the baseline function b(S)=vπ(S)b(S) = v_{\pi}(S), obtaining:
θJ(θ)=E[π(AS,θ)β(AS)θlnπ(AS,θ)(qπ(S,A)vπ(S))]\nabla_{\theta} J(\theta) = \mathbb{E} \left[ \frac{\pi(A|S, \theta)}{\beta(A|S)} \nabla_{\theta} \ln \pi(A|S, \theta) (q_{\pi}(S, A) - v_{\pi}(S)) \right]

The corresponding stochastic gradient ascent algorithm is:

θt+1=θt+αθπ(atst,θt)β(atst)θlnπ(atst,θt)(qt(st,at)vt(st))\theta_{t+1} = \theta_t + \alpha_{\theta} \frac{\pi(a_t|s_t, \theta_t)}{\beta(a_t|s_t)} \nabla_{\theta} \ln \pi(a_t|s_t, \theta_t) (q_t(s_t, a_t) - v_t(s_t))

Similar to the on-policy case:

qt(st,at)vt(st)rt+1+γvt(st+1)vt(st)δt(st,at)q_t(s_t, a_t) - v_t(s_t) \approx r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) \doteq \delta_t(s_t, a_t)

The final form of the algorithm becomes:

θt+1=θt+αθπ(atst,θt)β(atst)θlnπ(atst,θt)δt(st,at)\theta_{t+1} = \theta_t + \alpha_{\theta} \frac{\pi(a_t|s_t, \theta_t)}{\beta(a_t|s_t)} \nabla_{\theta} \ln \pi(a_t|s_t, \theta_t) \delta_t(s_t, a_t)

Rewritten to highlight the step size relationship:

θt+1=θt+αθ(δt(st,at)β(atst))θπ(atst,θt)\theta_{t+1} = \theta_t + \alpha_{\theta} \left( \frac{\delta_t(s_t, a_t)}{\beta(a_t|s_t)} \right) \nabla_{\theta} \pi(a_t|s_t, \theta_t)

Deterministic Actor-Critic (DPG)#

Evolution of policy representation:

  • Previously, the general policy was denoted as π(as,θ)[0,1]\pi(a|s, \theta) \in [0, 1], which is usually stochastic.
  • Now, a deterministic policy is introduced, denoted as:
a=μ(s,θ)μ(s)a = \mu(s, \theta) \doteq \mu(s)
  • μ\mu is a direct mapping from the state space S\mathcal{S} to the action space A\mathcal{A}.
  • In practice, μ\mu is often parameterized by a neural network, where the input is ss, the output is directly the action aa, and the parameters are θ\theta.

The gradient of the objective function is:

θJ(θ)=sSρμ(s)θμ(s)(aqμ(s,a))a=μ(s)=ESρμ[θμ(S)(aqμ(S,a))a=μ(S)]\begin{aligned} \nabla_{\theta} J(\theta) &= \sum_{s \in \mathcal{S}} \rho_{\mu}(s) \nabla_{\theta} \mu(s)\left.\left(\nabla_{a} q_{\mu}(s, a)\right)\right|_{a=\mu(s)} \\ &= \mathbb{E}_{S \sim \rho_{\mu}}\left[\left.\nabla_{\theta} \mu(S)\left(\nabla_{a} q_{\mu}(S, a)\right)\right|_{a=\mu(S)}\right] \end{aligned}

Based on the deterministic policy gradient, the gradient ascent algorithm to maximize J(θ)J(\theta) is:

θt+1=θt+αθESρμ[θμ(S)(aqμ(S,a))a=μ(S)]\theta_{t+1} = \theta_t + \alpha_\theta \mathbb{E}_{S \sim \rho_\mu} \left[ \nabla_\theta \mu(S) \left( \nabla_a q_\mu(S, a) \right) |_{a=\mu(S)} \right]

The corresponding single-step stochastic gradient ascent update is:

θt+1=θt+αθθμ(st)(aqμ(st,a))a=μ(st)\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu(s_t) \left( \nabla_a q_\mu(s_t, a) \right) |_{a=\mu(s_t)}

The overall architecture update logic is as follows:

TD error:

δt=rt+1+γq(st+1,μ(st+1,θt),wt)q(st,at,wt)\delta_t = r_{t+1} + \gamma q(s_{t+1}, \mu(s_{t+1}, \theta_t), w_t) - q(s_t, a_t, w_t)

Critic (value update):

wt+1=wt+αwδtwq(st,at,wt)w_{t+1} = w_t + \alpha_w \delta_t \nabla_w q(s_t, a_t, w_t)

Actor (policy update):

θt+1=θt+αθθμ(st,θt)(aq(st,a,wt+1))a=μ(st)\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu(s_t, \theta_t) \left( \nabla_a q(s_t, a, w_{t+1}) \right) |_{a=\mu(s_t)}
  • This is an off-policy implementation. The data collection policy (behavior policy β\beta) is usually different from the target policy μ\mu being optimized.
  • To ensure exploration, the behavior policy β\beta is typically set to the target policy plus noise, i.e., β=μ+noise\beta = \mu + \text{noise}.
  • Regarding the function approximation choice for q(s,a,w)q(s, a, w):
    • Linear function: q(s,a,w)=ϕT(s,a)wq(s, a, w) = \phi^T(s, a)w, where ϕ(s,a)\phi(s, a) is a manually designed feature vector.
    • Neural network: When using deep neural networks to approximate value and policy, it evolves into the Deep Deterministic Policy Gradient (DDPG) algorithm.