Actor-critic#
Actor is responsible for the policy update, deciding which action to take in a given state.
Critic is responsible for policy evaluation or value estimation, used to judge the quality of the policy selected by the Actor.
QAC (Q-Actor-Critic)#
Critic (value update): w t + 1 = w t + α w [ r t + 1 + γ q ( s t + 1 , a t + 1 , w t ) − q ( s t , a t , w t ) ] ∇ w q ( s t , a t , w t ) Actor (policy update): θ t + 1 = θ t + α θ ∇ θ ln π ( a t ∣ s t , θ t ) q ( s t , a t , w t + 1 ) \begin{aligned}
&\text{Critic (value update):} \\
&w_{t+1} = w_t + \alpha_w [r_{t+1} + \gamma q(s_{t+1}, a_{t+1}, w_t) - q(s_t, a_t, w_t)] \nabla_w q(s_t, a_t, w_t) \\
&\text{Actor (policy update):} \\
&\theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \ln \pi(a_t | s_t, \theta_t) q(s_t, a_t, w_{t+1})
\end{aligned} Critic (value update): w t + 1 = w t + α w [ r t + 1 + γ q ( s t + 1 , a t + 1 , w t ) − q ( s t , a t , w t )] ∇ w q ( s t , a t , w t ) Actor (policy update): θ t + 1 = θ t + α θ ∇ θ ln π ( a t ∣ s t , θ t ) q ( s t , a t , w t + 1 )
A2C (Advantage Actor-Critic)#
Introduces a baseline to reduce variance.
∇ θ J ( θ ) = E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) q π ( S , A ) ] = E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) ( q π ( S , A ) − b ( S ) ) ] \begin{aligned}
\nabla_{\theta} J(\theta) &= \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_{\theta} \ln \pi(A|S, \theta_t) q_{\pi}(S, A) \right] \\
&= \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_{\theta} \ln \pi(A|S, \theta_t) (q_{\pi}(S, A) - b(S)) \right]
\end{aligned} ∇ θ J ( θ ) = E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) q π ( S , A ) ] = E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) ( q π ( S , A ) − b ( S )) ]
To make the above equation hold, it must satisfy:
E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) b ( S ) ] = 0 \mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_{\theta} \ln \pi(A|S, \theta_t) b(S) \right] = 0 E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) b ( S ) ] = 0
The specific derivation is as follows:
E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) b ( S ) ] = ∑ s ∈ S η ( s ) ∑ a ∈ A π ( a ∣ s , θ t ) ∇ θ ln π ( a ∣ s , θ t ) b ( s ) = ∑ s ∈ S η ( s ) ∑ a ∈ A ∇ θ π ( a ∣ s , θ t ) b ( s ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ ∑ a ∈ A π ( a ∣ s , θ t ) = ∑ s ∈ S η ( s ) b ( s ) ∇ θ 1 = 0 \begin{aligned}
\mathbb{E}_{S \sim \eta, A \sim \pi} \left[ \nabla_{\theta} \ln \pi(A|S, \theta_t) b(S) \right] &= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \pi(a|s, \theta_t) \nabla_{\theta} \ln \pi(a|s, \theta_t) b(s) \\
&= \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_{\theta} \pi(a|s, \theta_t) b(s) \\
&= \sum_{s \in \mathcal{S}} \eta(s) b(s) \nabla_{\theta} \sum_{a \in \mathcal{A}} \pi(a|s, \theta_t) \\
&= \sum_{s \in \mathcal{S}} \eta(s) b(s) \nabla_{\theta} 1 = 0
\end{aligned} E S ∼ η , A ∼ π [ ∇ θ ln π ( A ∣ S , θ t ) b ( S ) ] = s ∈ S ∑ η ( s ) a ∈ A ∑ π ( a ∣ s , θ t ) ∇ θ ln π ( a ∣ s , θ t ) b ( s ) = s ∈ S ∑ η ( s ) a ∈ A ∑ ∇ θ π ( a ∣ s , θ t ) b ( s ) = s ∈ S ∑ η ( s ) b ( s ) ∇ θ a ∈ A ∑ π ( a ∣ s , θ t ) = s ∈ S ∑ η ( s ) b ( s ) ∇ θ 1 = 0
The baseline function is primarily used to control variance, so the goal is to find an optimal baseline function that minimizes variance. The optimal baseline is:
b ∗ ( s ) = E A ∼ π [ ∥ ∇ θ ln π ( A ∣ s , θ t ) ∥ 2 q ( s , A ) ] E A ∼ π [ ∥ ∇ θ ln π ( A ∣ s , θ t ) ∥ 2 ] b^*(s) = \frac{\mathbb{E}_{A \sim \pi} [\|\nabla_{\theta} \ln \pi(A|s, \theta_t)\|^2 q(s, A)]}{\mathbb{E}_{A \sim \pi} [\|\nabla_{\theta} \ln \pi(A|s, \theta_t)\|^2]} b ∗ ( s ) = E A ∼ π [ ∥ ∇ θ ln π ( A ∣ s , θ t ) ∥ 2 ] E A ∼ π [ ∥ ∇ θ ln π ( A ∣ s , θ t ) ∥ 2 q ( s , A )]
Since this form is too complex to compute, in practice, the weight term E A ∼ π [ ∥ ∇ θ ln π ( A ∣ s , θ t ) ∥ 2 ] \mathbb{E}_{A \sim \pi} [\|\nabla_{\theta} \ln \pi(A|s, \theta_t)\|^2] E A ∼ π [ ∥ ∇ θ ln π ( A ∣ s , θ t ) ∥ 2 ] is usually removed, approximating it as:
b ( s ) = E A ∼ π [ q ( s , A ) ] = v π ( s ) b(s) = \mathbb{E}_{A \sim \pi} [q(s, A)] = v_{\pi}(s) b ( s ) = E A ∼ π [ q ( s , A )] = v π ( s )
When b ( s ) = v π ( s ) b(s) = v_\pi(s) b ( s ) = v π ( s ) :
The gradient ascent algorithm is:
θ t + 1 = θ t + α E [ ∇ θ ln π ( A ∣ S , θ t ) [ q π ( S , A ) − v π ( S ) ] ] ≐ θ t + α E [ ∇ θ ln π ( A ∣ S , θ t ) δ π ( S , A ) ] \begin{aligned}
\theta_{t+1} &= \theta_t + \alpha \mathbb{E} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) [q_\pi(S, A) - v_\pi(S)] \right] \\
&\doteq \theta_t + \alpha \mathbb{E} \left[ \nabla_\theta \ln \pi(A|S, \theta_t) \delta_\pi(S, A) \right]
\end{aligned} θ t + 1 = θ t + α E [ ∇ θ ln π ( A ∣ S , θ t ) [ q π ( S , A ) − v π ( S )] ] ≐ θ t + α E [ ∇ θ ln π ( A ∣ S , θ t ) δ π ( S , A ) ]
Where:
δ π ( S , A ) ≐ q π ( S , A ) − v π ( S ) \delta_\pi(S, A) \doteq q_\pi(S, A) - v_\pi(S) δ π ( S , A ) ≐ q π ( S , A ) − v π ( S )
This term is called the Advantage function . According to the definition of v π ( S ) v_{\pi}(S) v π ( S ) , it is the expected value of all actions in state S S S . If the q q q value of a certain action is greater than the average v v v , it indicates that the action possesses an “advantage”.
The stochastic version of this algorithm is:
θ t + 1 = θ t + α ∇ θ ln π ( a t ∣ s t , θ t ) [ q t ( s t , a t ) − v t ( s t ) ] = θ t + α ∇ θ ln π ( a t ∣ s t , θ t ) δ t ( s t , a t ) \begin{aligned}
\theta_{t+1} &= \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) [q_t(s_t, a_t) - v_t(s_t)] \\
&= \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) \delta_t(s_t, a_t)
\end{aligned} θ t + 1 = θ t + α ∇ θ ln π ( a t ∣ s t , θ t ) [ q t ( s t , a t ) − v t ( s t )] = θ t + α ∇ θ ln π ( a t ∣ s t , θ t ) δ t ( s t , a t )
Furthermore, this algorithm can be rewritten as:
θ t + 1 = θ t + α ∇ θ ln π ( a t ∣ s t , θ t ) δ t ( s t , a t ) = θ t + α ∇ θ π ( a t ∣ s t , θ t ) π ( a t ∣ s t , θ t ) δ t ( s t , a t ) = θ t + α ( δ t ( s t , a t ) π ( a t ∣ s t , θ t ) ) ⏟ step size ∇ θ π ( a t ∣ s t , θ t ) \begin{aligned}
\theta_{t+1} &= \theta_t + \alpha \nabla_\theta \ln \pi(a_t|s_t, \theta_t) \delta_t(s_t, a_t) \\
&= \theta_t + \alpha \frac{\nabla_\theta \pi(a_t|s_t, \theta_t)}{\pi(a_t|s_t, \theta_t)} \delta_t(s_t, a_t) \\
&= \theta_t + \underbrace{\alpha \left( \frac{\delta_t(s_t, a_t)}{\pi(a_t|s_t, \theta_t)} \right)}_{\text{step size}} \nabla_\theta \pi(a_t|s_t, \theta_t)
\end{aligned} θ t + 1 = θ t + α ∇ θ ln π ( a t ∣ s t , θ t ) δ t ( s t , a t ) = θ t + α π ( a t ∣ s t , θ t ) ∇ θ π ( a t ∣ s t , θ t ) δ t ( s t , a t ) = θ t + step size α ( π ( a t ∣ s t , θ t ) δ t ( s t , a t ) ) ∇ θ π ( a t ∣ s t , θ t )
The update step size is proportional to the relative value δ t \delta_t δ t , rather than the absolute value q t q_t q t , which is logically more reasonable.
It still balances exploration and exploitation well.
Approximation via TD error:
δ t = q t ( s t , a t ) − v t ( s t ) ≈ r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) \delta_t = q_t(s_t, a_t) - v_t(s_t) \approx r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) δ t = q t ( s t , a t ) − v t ( s t ) ≈ r t + 1 + γ v t ( s t + 1 ) − v t ( s t )
This approximation is reasonable because:
E [ q π ( S , A ) − v π ( S ) ∣ S = s t , A = a t ] = E [ R t + 1 + γ v π ( S t + 1 ) − v π ( S t ) ∣ S = s t , A = a t ] \mathbb{E}[q_\pi(S, A) - v_\pi(S)|S = s_t, A = a_t] = \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) - v_\pi(S_t)|S = s_t, A = a_t] E [ q π ( S , A ) − v π ( S ) ∣ S = s t , A = a t ] = E [ R t + 1 + γ v π ( S t + 1 ) − v π ( S t ) ∣ S = s t , A = a t ]
Advantage : Only one neural network is needed to approximate v π ( s ) v_\pi(s) v π ( s ) , eliminating the need to maintain two separate networks to approximate q π ( s , a ) q_\pi(s, a) q π ( s , a ) and v π ( s ) v_\pi(s) v π ( s ) .
Importance Sampling and Off-policy#
Importance sampling technique#
Note that:
E X ∼ p 0 [ X ] = ∑ x p 0 ( x ) x = ∑ x p 1 ( x ) p 0 ( x ) p 1 ( x ) x ⏟ f ( x ) = E X ∼ p 1 [ f ( X ) ] \mathbb{E}_{X \sim p_0}[X] = \sum_x p_0(x)x = \sum_x p_1(x) \underbrace{\frac{p_0(x)}{p_1(x)} x}_{f(x)} = \mathbb{E}_{X \sim p_1}[f(X)] E X ∼ p 0 [ X ] = x ∑ p 0 ( x ) x = x ∑ p 1 ( x ) f ( x ) p 1 ( x ) p 0 ( x ) x = E X ∼ p 1 [ f ( X )]
Therefore, we can estimate E X ∼ p 0 [ X ] \mathbb{E}_{X \sim p_0}[X] E X ∼ p 0 [ X ] by estimating E X ∼ p 1 [ f ( X ) ] \mathbb{E}_{X \sim p_1}[f(X)] E X ∼ p 1 [ f ( X )] .
The estimation method is as follows:
Let:
f ˉ ≐ 1 n ∑ i = 1 n f ( x i ) , where x i ∼ p 1 \bar{f} \doteq \frac{1}{n} \sum_{i=1}^n f(x_i), \quad \text{where } x_i \sim p_1 f ˉ ≐ n 1 i = 1 ∑ n f ( x i ) , where x i ∼ p 1
Then:
E X ∼ p 1 [ f ˉ ] = E X ∼ p 1 [ f ( X ) ] \mathbb{E}_{X \sim p_1}[\bar{f}] = \mathbb{E}_{X \sim p_1}[f(X)] E X ∼ p 1 [ f ˉ ] = E X ∼ p 1 [ f ( X )]
var X ∼ p 1 [ f ˉ ] = 1 n var X ∼ p 1 [ f ( X ) ] \text{var}_{X \sim p_1}[\bar{f}] = \frac{1}{n} \text{var}_{X \sim p_1}[f(X)] var X ∼ p 1 [ f ˉ ] = n 1 var X ∼ p 1 [ f ( X )]
So, f ˉ \bar{f} f ˉ is a good approximation of E X ∼ p 0 [ X ] \mathbb{E}_{X \sim p_0}[X] E X ∼ p 0 [ X ] :
E X ∼ p 0 [ X ] ≈ f ˉ = 1 n ∑ i = 1 n f ( x i ) = 1 n ∑ i = 1 n p 0 ( x i ) p 1 ( x i ) x i \mathbb{E}_{X \sim p_0}[X] \approx \bar{f} = \frac{1}{n} \sum_{i=1}^n f(x_i) = \frac{1}{n} \sum_{i=1}^n \frac{p_0(x_i)}{p_1(x_i)} x_i E X ∼ p 0 [ X ] ≈ f ˉ = n 1 i = 1 ∑ n f ( x i ) = n 1 i = 1 ∑ n p 1 ( x i ) p 0 ( x i ) x i
The ratio p 0 ( x i ) p 1 ( x i ) \frac{p_0(x_i)}{p_1(x_i)} p 1 ( x i ) p 0 ( x i ) is called the importance weight .
If p 1 ( x i ) = p 0 ( x i ) p_1(x_i) = p_0(x_i) p 1 ( x i ) = p 0 ( x i ) , the importance weight is 1, and f ˉ \bar{f} f ˉ degenerates to the standard arithmetic mean.
If p 0 ( x i ) > p 1 ( x i ) p_0(x_i) > p_1(x_i) p 0 ( x i ) > p 1 ( x i ) , it means sample x i x_i x i appears more frequently in distribution p 0 p_0 p 0 than in p 1 p_1 p 1 . An importance weight greater than 1 will enhance the proportion of this sample in the expectation calculation.
The objective function is defined as:
J ( θ ) = ∑ s ∈ S d β ( s ) v π ( s ) = E S ∼ d β [ v π ( S ) ] J(\theta) = \sum_{s \in \mathcal{S}} d_\beta(s) v_\pi(s) = \mathbb{E}_{S \sim d_\beta} [v_\pi(S)] J ( θ ) = s ∈ S ∑ d β ( s ) v π ( s ) = E S ∼ d β [ v π ( S )]
Its gradient is:
∇ θ J ( θ ) = E S ∼ ρ , A ∼ β [ π ( A ∣ S , θ ) β ( A ∣ S ) ∇ θ ln π ( A ∣ S , θ ) q π ( S , A ) ] \nabla_\theta J(\theta) = \mathbb{E}_{S \sim \rho, A \sim \beta} \left[ \frac{\pi(A | S, \theta)}{\beta(A | S)} \nabla_\theta \ln \pi(A | S, \theta) q_\pi(S, A) \right] ∇ θ J ( θ ) = E S ∼ ρ , A ∼ β [ β ( A ∣ S ) π ( A ∣ S , θ ) ∇ θ ln π ( A ∣ S , θ ) q π ( S , A ) ]
The off-policy gradient also has invariance to the baseline function b ( s ) b(s) b ( s ) :
∇ θ J ( θ ) = E S ∼ ρ , A ∼ β [ π ( A ∣ S , θ ) β ( A ∣ S ) ∇ θ ln π ( A ∣ S , θ ) ( q π ( S , A ) − b ( S ) ) ] \nabla_{\theta} J(\theta) = \mathbb{E}_{S \sim \rho, A \sim \beta} \left[ \frac{\pi(A|S, \theta)}{\beta(A|S)} \nabla_{\theta} \ln \pi(A|S, \theta) (q_{\pi}(S, A) - b(S)) \right] ∇ θ J ( θ ) = E S ∼ ρ , A ∼ β [ β ( A ∣ S ) π ( A ∣ S , θ ) ∇ θ ln π ( A ∣ S , θ ) ( q π ( S , A ) − b ( S )) ]
To reduce the estimation variance, we similarly choose the baseline function b ( S ) = v π ( S ) b(S) = v_{\pi}(S) b ( S ) = v π ( S ) , obtaining:
∇ θ J ( θ ) = E [ π ( A ∣ S , θ ) β ( A ∣ S ) ∇ θ ln π ( A ∣ S , θ ) ( q π ( S , A ) − v π ( S ) ) ] \nabla_{\theta} J(\theta) = \mathbb{E} \left[ \frac{\pi(A|S, \theta)}{\beta(A|S)} \nabla_{\theta} \ln \pi(A|S, \theta) (q_{\pi}(S, A) - v_{\pi}(S)) \right] ∇ θ J ( θ ) = E [ β ( A ∣ S ) π ( A ∣ S , θ ) ∇ θ ln π ( A ∣ S , θ ) ( q π ( S , A ) − v π ( S )) ]
The corresponding stochastic gradient ascent algorithm is:
θ t + 1 = θ t + α θ π ( a t ∣ s t , θ t ) β ( a t ∣ s t ) ∇ θ ln π ( a t ∣ s t , θ t ) ( q t ( s t , a t ) − v t ( s t ) ) \theta_{t+1} = \theta_t + \alpha_{\theta} \frac{\pi(a_t|s_t, \theta_t)}{\beta(a_t|s_t)} \nabla_{\theta} \ln \pi(a_t|s_t, \theta_t) (q_t(s_t, a_t) - v_t(s_t)) θ t + 1 = θ t + α θ β ( a t ∣ s t ) π ( a t ∣ s t , θ t ) ∇ θ ln π ( a t ∣ s t , θ t ) ( q t ( s t , a t ) − v t ( s t ))
Similar to the on-policy case:
q t ( s t , a t ) − v t ( s t ) ≈ r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) ≐ δ t ( s t , a t ) q_t(s_t, a_t) - v_t(s_t) \approx r_{t+1} + \gamma v_t(s_{t+1}) - v_t(s_t) \doteq \delta_t(s_t, a_t) q t ( s t , a t ) − v t ( s t ) ≈ r t + 1 + γ v t ( s t + 1 ) − v t ( s t ) ≐ δ t ( s t , a t )
The final form of the algorithm becomes:
θ t + 1 = θ t + α θ π ( a t ∣ s t , θ t ) β ( a t ∣ s t ) ∇ θ ln π ( a t ∣ s t , θ t ) δ t ( s t , a t ) \theta_{t+1} = \theta_t + \alpha_{\theta} \frac{\pi(a_t|s_t, \theta_t)}{\beta(a_t|s_t)} \nabla_{\theta} \ln \pi(a_t|s_t, \theta_t) \delta_t(s_t, a_t) θ t + 1 = θ t + α θ β ( a t ∣ s t ) π ( a t ∣ s t , θ t ) ∇ θ ln π ( a t ∣ s t , θ t ) δ t ( s t , a t )
Rewritten to highlight the step size relationship:
θ t + 1 = θ t + α θ ( δ t ( s t , a t ) β ( a t ∣ s t ) ) ∇ θ π ( a t ∣ s t , θ t ) \theta_{t+1} = \theta_t + \alpha_{\theta} \left( \frac{\delta_t(s_t, a_t)}{\beta(a_t|s_t)} \right) \nabla_{\theta} \pi(a_t|s_t, \theta_t) θ t + 1 = θ t + α θ ( β ( a t ∣ s t ) δ t ( s t , a t ) ) ∇ θ π ( a t ∣ s t , θ t )
Deterministic Actor-Critic (DPG)#
Evolution of policy representation:
Previously, the general policy was denoted as π ( a ∣ s , θ ) ∈ [ 0 , 1 ] \pi(a|s, \theta) \in [0, 1] π ( a ∣ s , θ ) ∈ [ 0 , 1 ] , which is usually stochastic.
Now, a deterministic policy is introduced, denoted as:
a = μ ( s , θ ) ≐ μ ( s ) a = \mu(s, \theta) \doteq \mu(s) a = μ ( s , θ ) ≐ μ ( s )
μ \mu μ is a direct mapping from the state space S \mathcal{S} S to the action space A \mathcal{A} A .
In practice, μ \mu μ is often parameterized by a neural network, where the input is s s s , the output is directly the action a a a , and the parameters are θ \theta θ .
The gradient of the objective function is:
∇ θ J ( θ ) = ∑ s ∈ S ρ μ ( s ) ∇ θ μ ( s ) ( ∇ a q μ ( s , a ) ) ∣ a = μ ( s ) = E S ∼ ρ μ [ ∇ θ μ ( S ) ( ∇ a q μ ( S , a ) ) ∣ a = μ ( S ) ] \begin{aligned}
\nabla_{\theta} J(\theta) &= \sum_{s \in \mathcal{S}} \rho_{\mu}(s) \nabla_{\theta} \mu(s)\left.\left(\nabla_{a} q_{\mu}(s, a)\right)\right|_{a=\mu(s)} \\
&= \mathbb{E}_{S \sim \rho_{\mu}}\left[\left.\nabla_{\theta} \mu(S)\left(\nabla_{a} q_{\mu}(S, a)\right)\right|_{a=\mu(S)}\right]
\end{aligned} ∇ θ J ( θ ) = s ∈ S ∑ ρ μ ( s ) ∇ θ μ ( s ) ( ∇ a q μ ( s , a ) ) ∣ a = μ ( s ) = E S ∼ ρ μ [ ∇ θ μ ( S ) ( ∇ a q μ ( S , a ) ) ∣ a = μ ( S ) ]
Based on the deterministic policy gradient, the gradient ascent algorithm to maximize J ( θ ) J(\theta) J ( θ ) is:
θ t + 1 = θ t + α θ E S ∼ ρ μ [ ∇ θ μ ( S ) ( ∇ a q μ ( S , a ) ) ∣ a = μ ( S ) ] \theta_{t+1} = \theta_t + \alpha_\theta \mathbb{E}_{S \sim \rho_\mu} \left[ \nabla_\theta \mu(S) \left( \nabla_a q_\mu(S, a) \right) |_{a=\mu(S)} \right] θ t + 1 = θ t + α θ E S ∼ ρ μ [ ∇ θ μ ( S ) ( ∇ a q μ ( S , a ) ) ∣ a = μ ( S ) ]
The corresponding single-step stochastic gradient ascent update is:
θ t + 1 = θ t + α θ ∇ θ μ ( s t ) ( ∇ a q μ ( s t , a ) ) ∣ a = μ ( s t ) \theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu(s_t) \left( \nabla_a q_\mu(s_t, a) \right) |_{a=\mu(s_t)} θ t + 1 = θ t + α θ ∇ θ μ ( s t ) ( ∇ a q μ ( s t , a ) ) ∣ a = μ ( s t )
The overall architecture update logic is as follows:
TD error :
δ t = r t + 1 + γ q ( s t + 1 , μ ( s t + 1 , θ t ) , w t ) − q ( s t , a t , w t ) \delta_t = r_{t+1} + \gamma q(s_{t+1}, \mu(s_{t+1}, \theta_t), w_t) - q(s_t, a_t, w_t) δ t = r t + 1 + γ q ( s t + 1 , μ ( s t + 1 , θ t ) , w t ) − q ( s t , a t , w t )
Critic (value update) :
w t + 1 = w t + α w δ t ∇ w q ( s t , a t , w t ) w_{t+1} = w_t + \alpha_w \delta_t \nabla_w q(s_t, a_t, w_t) w t + 1 = w t + α w δ t ∇ w q ( s t , a t , w t )
Actor (policy update) :
θ t + 1 = θ t + α θ ∇ θ μ ( s t , θ t ) ( ∇ a q ( s t , a , w t + 1 ) ) ∣ a = μ ( s t ) \theta_{t+1} = \theta_t + \alpha_\theta \nabla_\theta \mu(s_t, \theta_t) \left( \nabla_a q(s_t, a, w_{t+1}) \right) |_{a=\mu(s_t)} θ t + 1 = θ t + α θ ∇ θ μ ( s t , θ t ) ( ∇ a q ( s t , a , w t + 1 ) ) ∣ a = μ ( s t )
This is an off-policy implementation. The data collection policy (behavior policy β \beta β ) is usually different from the target policy μ \mu μ being optimized.
To ensure exploration, the behavior policy β \beta β is typically set to the target policy plus noise, i.e., β = μ + noise \beta = \mu + \text{noise} β = μ + noise .
Regarding the function approximation choice for q ( s , a , w ) q(s, a, w) q ( s , a , w ) :
Linear function : q ( s , a , w ) = ϕ T ( s , a ) w q(s, a, w) = \phi^T(s, a)w q ( s , a , w ) = ϕ T ( s , a ) w , where ϕ ( s , a ) \phi(s, a) ϕ ( s , a ) is a manually designed feature vector.
Neural network : When using deep neural networks to approximate value and policy, it evolves into the Deep Deterministic Policy Gradient (DDPG) algorithm.