RL Study Notes: Policy Gradient Methods
Core concepts of RL policy gradient methods: objective functions, the log-derivative trick, theorem derivation, and the REINFORCE algorithm.
Policy Gradient Methods#
In policy gradient methods, the policy is represented as a parameterized function.
Where is the parameter vector.
- This function can be a neural network, with the state as input and the probabilities of taking each action as output, parameterized by .
- When the state space is large, tabular representation is inefficient in terms of storage and generalization. Function approximation can effectively solve this problem.
- The function representation is also commonly written as , , or .
Basic Idea#
Define an objective function (e.g., ) to evaluate the performance of the policy.
Use gradient ascent to update the parameters and find the optimal policy:
Objective Functions (Metrics)#
1. Weighted Average of State Values#
- is a weighted average.
- is the weight of state , which can be understood as the probability distribution of state occurrences.
- Expressed in expectation form: .
Its vector form is:
Where and .
In episodic tasks, the objective function can be defined as the expected return starting from the initial state:
Regarding the setting of the weight :
- Independent of policy : In episodic tasks, is usually set to the initial state distribution . For example, if all states are considered equally important, , or if only a specific initial state is of interest, .
- Dependent on policy : In continuing tasks, depends on the policy , and the stationary distribution is typically chosen. The stationary distribution satisfies , where is the state transition matrix.
2. Average One-Step Reward (Average Reward)#
Where state . The expected immediate reward in state is:
- The weight is the stationary distribution.
- is the weighted average of the one-step immediate rewards.
The average reward can also be defined as the limit form of long-term rewards:
Here, the influence of the initial state is eliminated in the limit, making the two definitions equivalent.
Metrics Comparison:
- The above metrics all depend on the policy , so they are essentially functions of the parameter .
- Intuitively, focuses more on immediate rewards (myopic), while cares more about long-term returns.
- In the discounted case with a discount factor , there is a mathematical relationship between the two: .
Policy Gradient Theorem (Gradients of the Metrics)#
Whether the objective function is or , its gradient can be uniformly expressed in the following form (proportional to):
Where is the distribution weight of the states.
Derivation of Gradients and Transformation to Expectation#
1. Core Objective: To transform the complex “double summation over states and actions” form into a “mathematical expectation” form that allows for approximate calculation through data sampling.
2. Log-Derivative Trick: According to the calculus rule, taking the natural logarithm of the policy function and computing the gradient yields:
Rearranging the terms gives:
This step ingeniously constructs the probability term out of nowhere, which is a prerequisite for transforming the formula into an expectation.
3. Formula Substitution: Substituting the above result into the gradient formula:
4. Transformation to Mathematical Expectation: The equation above contains a two-layer probability-weighted summation (the outer layer based on the state distribution , and the inner layer based on the action probability ), which is equivalent to the mathematical expectation over the random variables and :
5. Practical Significance: After completing this mathematical transformation, the algorithm no longer needs to iterate through all states and actions in the environment. The agent only needs to explore the environment according to the current policy, and the collected trajectory data will naturally follow this expected probability distribution, thus providing a theoretical basis for single-step sampling approximation.
If sampling is used to approximate the gradient, the single-step update direction is:
Parameterization of the Policy Function (Softmax)#
To ensure the probability properties and the sum of probabilities equals 1, the Softmax function is commonly used to map real-valued preferences into probabilities.
For any vector :
Where and .
Applied to the policy function:
Where is the action preference function, which can be parameterized by a neural network.
REINFORCE and Policy Optimization Algorithms#
Using the true gradient to maximize the objective function:
In practical applications, stochastic gradients are used for the update:
Since the true action-value function is unknown, it needs to be approximately estimated:
- REINFORCE Algorithm: Uses the Monte Carlo method, taking the full trajectory return as an unbiased estimate of for gradient updates.
- Actor-Critic Algorithm: Combines with Temporal Difference (TD) algorithms to train a value function network to approximately estimate .
Combining the reverse expansion of the log-derivative formula , the parameter update rule can be intuitively rewritten as:
That is:
The coefficient here balances exploration and exploitation very well: it is proportional to the return (encouraging high-return actions) and inversely proportional to the action probability (giving larger update steps to rare actions, thereby encouraging exploration).
DOCS
-
CTF WP
-
WEB
-
Reinforcement Learning
- RL Study Notes: The Bellman Equation
- RL Study Notes: Basic Concepts
- RL Study Notes: Value Iteration and Policy Iteration
- RL Study Notes: Bellman Optimality Equation
- RL Study Notes: Monte Carlo Methods
- RL Study Notes: SA and SGD
- RL Study Notes: Policy Gradient Methods
- RL Study Notes: Temporal-Difference Learning
- RL Study Notes: Actor-Critic Algorithm
- RL Study Notes: Value Function Approximation
-
Miscellaneous