Maxton‘s Blog

Back

Policy Gradient Methods#

In policy gradient methods, the policy is represented as a parameterized function.

π(as,θ)\pi(a|s,\theta)

Where θRm\theta \in \mathbb{R}^m is the parameter vector.

  • This function can be a neural network, with the state ss as input and the probabilities of taking each action as output, parameterized by θ\theta.
  • When the state space is large, tabular representation is inefficient in terms of storage and generalization. Function approximation can effectively solve this problem.
  • The function representation is also commonly written as π(a,s,θ)\pi(a, s, \theta), πθ(as)\pi_\theta(a|s), or πθ(a,s)\pi_\theta(a, s).

Basic Idea#

Define an objective function (e.g., J(θ)J(\theta)) to evaluate the performance of the policy.

Use gradient ascent to update the parameters and find the optimal policy:

θt+1=θt+αθJ(θt)\theta_{t+1} = \theta_t + \alpha \nabla_\theta J(\theta_t)

Objective Functions (Metrics)#

1. Weighted Average of State Values#

vˉπ=sSd(s)vπ(s)\bar{v}_{\pi} = \sum_{s \in \mathcal{S}} d(s) v_{\pi}(s)
  • vˉπ\bar{v}_{\pi} is a weighted average.
  • d(s)0d(s) \ge 0 is the weight of state ss, which can be understood as the probability distribution of state occurrences.
  • Expressed in expectation form: vˉπ=E[vπ(S)]\bar{v}_{\pi} = \mathbb{E}[v_{\pi}(S)].

Its vector form is:

vˉπ=dTvπ\bar{v}_{\pi} = d^T v_{\pi}

Where vπRSv_{\pi} \in \mathbb{R}^{|\mathcal{S}|} and dRSd \in \mathbb{R}^{|\mathcal{S}|}.

In episodic tasks, the objective function can be defined as the expected return starting from the initial state:

J(θ)=E[t=0γtRt+1]=sSd0(s)vπ(s)\begin{aligned} J(\theta) &= \mathbb{E} \left[ \sum_{t=0}^{\infty} \gamma^t R_{t+1} \right] \\ &= \sum_{s \in \mathcal{S}} d_0(s) v_\pi(s) \end{aligned}

Regarding the setting of the weight d(s)d(s):

  • Independent of policy π\pi: In episodic tasks, dd is usually set to the initial state distribution d0d_0. For example, if all states are considered equally important, d0(s)=1/Sd_0(s) = 1/|\mathcal{S}|, or if only a specific initial state s0s_0 is of interest, d0(s0)=1d_0(s_0) = 1.
  • Dependent on policy π\pi: In continuing tasks, dd depends on the policy π\pi, and the stationary distribution dπd_\pi is typically chosen. The stationary distribution satisfies dπTPπ=dπTd_{\pi}^T P_{\pi} = d_{\pi}^T, where PπP_{\pi} is the state transition matrix.

2. Average One-Step Reward (Average Reward)#

rˉπ=sSdπ(s)rπ(s)=E[rπ(S)]\bar{r}_\pi = \sum_{s \in \mathcal{S}} d_\pi(s) r_\pi(s) = \mathbb{E}[r_\pi(S)]

Where state SdπS \sim d_\pi. The expected immediate reward in state ss is:

rπ(s)=aAπ(as)r(s,a)r_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) r(s, a)
  • The weight dπd_\pi is the stationary distribution.
  • rˉπ\bar{r}_\pi is the weighted average of the one-step immediate rewards.

The average reward can also be defined as the limit form of long-term rewards:

limn1nE[k=1nRt+kSt=s0]=sSdπ(s)rπ(s)=rˉπ\begin{aligned} \lim_{n \to \infty} \frac{1}{n} \mathbb{E} \left[ \sum_{k=1}^{n} R_{t+k} \mid S_t = s_0 \right] &= \sum_{s \in \mathcal{S}} d_\pi(s) r_\pi(s) = \bar{r}_\pi \end{aligned}

Here, the influence of the initial state s0s_0 is eliminated in the limit, making the two definitions equivalent.

Metrics Comparison:

  • The above metrics all depend on the policy π\pi, so they are essentially functions of the parameter θ\theta.
  • Intuitively, rˉπ\bar{r}_\pi focuses more on immediate rewards (myopic), while vˉπ\bar{v}_\pi cares more about long-term returns.
  • In the discounted case with a discount factor γ\gamma, there is a mathematical relationship between the two: rˉπ=(1γ)vˉπ\bar{r}_\pi = (1 - \gamma) \bar{v}_\pi.

Policy Gradient Theorem (Gradients of the Metrics)#

Whether the objective function is vˉπ\bar{v}_{\pi} or rˉπ\bar{r}_{\pi}, its gradient can be uniformly expressed in the following form (proportional to):

θJ(θ)sSη(s)aAθπ(as,θ)qπ(s,a)\nabla_{\theta} J(\theta) \propto \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \nabla_{\theta} \pi(a|s, \theta) q_{\pi}(s, a)

Where η\eta is the distribution weight of the states.

Derivation of Gradients and Transformation to Expectation#

1. Core Objective: To transform the complex “double summation over states and actions” form into a “mathematical expectation” form that allows for approximate calculation through data sampling.

2. Log-Derivative Trick: According to the calculus rule, taking the natural logarithm of the policy function and computing the gradient yields:

θlnπ(as,θ)=1π(as,θ)θπ(as,θ)\nabla_{\theta} \ln \pi(a|s, \theta) = \frac{1}{\pi(a|s, \theta)} \nabla_{\theta} \pi(a|s, \theta)

Rearranging the terms gives:

θπ(as,θ)=π(as,θ)θlnπ(as,θ)\nabla_{\theta} \pi(a|s, \theta) = \pi(a|s, \theta) \nabla_{\theta} \ln \pi(a|s, \theta)

This step ingeniously constructs the probability term π(as,θ)\pi(a|s, \theta) out of nowhere, which is a prerequisite for transforming the formula into an expectation.

3. Formula Substitution: Substituting the above result into the gradient formula:

θJ(θ)sSη(s)aAπ(as,θ)θlnπ(as,θ)qπ(s,a)\nabla_{\theta} J(\theta) \propto \sum_{s \in \mathcal{S}} \eta(s) \sum_{a \in \mathcal{A}} \pi(a|s, \theta) \nabla_{\theta} \ln \pi(a|s, \theta) q_{\pi}(s, a)

4. Transformation to Mathematical Expectation: The equation above contains a two-layer probability-weighted summation (the outer layer based on the state distribution η(s)\eta(s), and the inner layer based on the action probability π\pi), which is equivalent to the mathematical expectation over the random variables SS and AA:

θJ(θ)=E[θlnπ(AS,θ)qπ(S,A)]\nabla_{\theta} J(\theta) = \mathbb{E}[\nabla_{\theta} \ln \pi(A|S, \theta) q_{\pi}(S, A)]

5. Practical Significance: After completing this mathematical transformation, the algorithm no longer needs to iterate through all states and actions in the environment. The agent only needs to explore the environment according to the current policy, and the collected trajectory data (S,A)(S, A) will naturally follow this expected probability distribution, thus providing a theoretical basis for single-step sampling approximation.

If sampling is used to approximate the gradient, the single-step update direction is:

θJθlnπ(as,θ)qπ(s,a)\nabla_{\theta} J \approx \nabla_{\theta} \ln \pi(a|s, \theta) q_{\pi}(s, a)

Parameterization of the Policy Function (Softmax)#

To ensure the probability properties π(as,θ)>0\pi(a|s, \theta) > 0 and the sum of probabilities equals 1, the Softmax function is commonly used to map real-valued preferences into probabilities.

For any vector x=[x1,,xn]Tx = [x_1, \dots, x_n]^T:

zi=exij=1nexjz_i = \frac{e^{x_i}}{\sum_{j=1}^n e^{x_j}}

Where zi(0,1)z_i \in (0, 1) and i=1nzi=1\sum_{i=1}^n z_i = 1.

Applied to the policy function:

π(as,θ)=eh(s,a,θ)aAeh(s,a,θ)\pi(a|s, \theta) = \frac{e^{h(s,a,\theta)}}{\sum_{a' \in \mathcal{A}} e^{h(s,a',\theta)}}

Where h(s,a,θ)h(s, a, \theta) is the action preference function, which can be parameterized by a neural network.

REINFORCE and Policy Optimization Algorithms#

Using the true gradient to maximize the objective function:

θt+1=θt+αθJ(θ)=θt+αE[θlnπ(AS,θt)qπ(S,A)]\begin{aligned} \theta_{t+1} &= \theta_t + \alpha \nabla_{\theta} J(\theta) \\ &= \theta_t + \alpha \mathbb{E}[\nabla_{\theta} \ln \pi(A|S, \theta_t) q_{\pi}(S, A)] \end{aligned}

In practical applications, stochastic gradients are used for the update:

θt+1=θt+αθlnπ(atst,θt)qπ(st,at)\theta_{t+1} = \theta_t + \alpha \nabla_{\theta} \ln \pi(a_t|s_t, \theta_t) q_{\pi}(s_t, a_t)

Since the true action-value function qπq_\pi is unknown, it needs to be approximately estimated:

  • REINFORCE Algorithm: Uses the Monte Carlo method, taking the full trajectory return GtG_t as an unbiased estimate of qπ(st,at)q_\pi(s_t, a_t) for gradient updates.
  • Actor-Critic Algorithm: Combines with Temporal Difference (TD) algorithms to train a value function network to approximately estimate qπ(st,at)q_\pi(s_t, a_t).

Combining the reverse expansion of the log-derivative formula θlnπ=θππ\nabla_{\theta} \ln \pi = \frac{\nabla_{\theta} \pi}{\pi}, the parameter update rule can be intuitively rewritten as:

θt+1=θt+αθlnπ(atst,θt)qt(st,at)=θt+α(qt(st,at)π(atst,θt))βtθπ(atst,θt)\begin{aligned} \theta_{t+1} &= \theta_t + \alpha \nabla_{\theta} \ln \pi(a_t | s_t, \theta_t) q_t(s_t, a_t) \\ &= \theta_t + \alpha \underbrace{\left( \frac{q_t(s_t, a_t)}{\pi(a_t | s_t, \theta_t)} \right)}_{\beta_t} \nabla_{\theta} \pi(a_t | s_t, \theta_t) \end{aligned}

That is:

θt+1=θt+αβtθπ(atst,θt)\theta_{t+1} = \theta_t + \alpha \beta_t \nabla_{\theta} \pi(a_t | s_t, \theta_t)

The coefficient βt\beta_t here balances exploration and exploitation very well: it is proportional to the return qtq_t (encouraging high-return actions) and inversely proportional to the action probability π\pi (giving larger update steps to rare actions, thereby encouraging exploration).

RL Study Notes: Policy Gradient Methods
https://en.maxtonniu.com/blog/rl_chapter09
Author Maxton Niu
Published at February 22, 2026