RL Study Notes: Bellman Optimality Equation
Derives Bellman Optimality and fixed-point properties. Analyzes Value Iteration (contraction mapping) and how models/rewards determine the optimal policy.
Bellman Optimality Equation#
Definition#
For all states , if the state value function of policy is not less than the state value function of any other policy , that is:
Then policy is called the optimal policy. The state value corresponding to the optimal policy is called the optimal state value function, denoted as .
Derivation of the Optimality Equation#
The optimal state value function satisfies the Bellman Optimality Equation. The core idea is: the optimal value equals the expected return obtained by taking the optimal action in the current state.
Scalar Form#
If written as a maximization over policy , we have:
Since a weighted average cannot exceed the maximum value:
The condition for equality is that the policy assigns probability completely to the action that maximizes . This implies that the optimal policy is deterministic:
Vector Form#
We treat the process of solving for as an operator operation. Defining the optimal Bellman operator , the Bellman Optimality Equation is a fixed-point equation:
Specifically expanded as:
Where:
- is the average immediate reward vector under policy , defined as
- is the state transition matrix under policy , defined as
Contraction Mapping and Fixed Points#
The Bellman optimal operator satisfies the Contraction Mapping Theorem when . This implies:
- Existence: There exists a unique fixed point satisfying .
- Convergence: For any initial value , the iterative sequence inevitably converges to .
- That is, .
- The convergence speed is geometric (exponential convergence), controlled by the discount factor .
The Essence of Value Iteration#
The Value Iteration algorithm utilizes the aforementioned fixed-point property with the iterative formula:
This step actually contains two implicit processes:
-
Implicit Policy Improvement: Based on the current value estimate , find a greedy policy, i.e., select the action that currently appears to have the highest value.
-
Value Update (Policy Evaluation): Assuming the greedy action is taken, calculate its one-step expected return as the new value estimate .
Summary: Value Iteration essentially “seizes” the currently best action in every round, calculates its value, and then “seizes” the best action again in the next round based on the new value.
Determinants of the Optimal Policy#
The optimal policy is determined by the following formula:
Key Factors#
- System Dynamics: and . These are the physical laws of the environment and are generally immutable.
- Discount Factor :
- : The agent becomes “myopic,” focusing only on immediate rewards.
- : The agent becomes “far-sighted,” valuing long-term cumulative returns.
- Reward Function :
- The relative numerical values of the reward are more important than the absolute values.
Affine Transformation of the Reward Function#
If a linear transformation is applied to the reward function:
Where and is a constant.
- Impact on Value Function: The new value function has a linear relationship with the original value function .
- Impact on Policy: The optimal policy remains unchanged.
This indicates that as long as the partial order relations and relative proportions between rewards are preserved, the specific numerical magnitude does not alter the optimal behavior pattern.