RL Study Notes: The Bellman Equation
A detailed overview of State Value and Action Value definitions, including the derivation of the Bellman Expectation Equation and its matrix representation.
The Bellman Equation#
1. Basic Definitions#
The interaction process in Reinforcement Learning can be described as follows:
Core Variables#
- : Discrete time steps.
- : The state at time .
- : The action taken at state .
- : The immediate reward received after taking .
- : The new state (Next State) transitioned to after taking .
Note: are all Random Variables. This means every step of the interaction is determined by a probability distribution; therefore, we can calculate their expectations.
Trajectory and Return#
The time-series trajectory formed by the interaction process is as follows:
Discounted Return is defined as the cumulative discounted reward starting from time :
Where is the discount factor.
2. State Value#
Definition#
is called the State-Value Function, or simply State Value. It is the mathematical expectation of the return :
- It is a function of the state .
- Its value depends on the current policy .
- It represents the “value” of being in that state. A higher value implies better prospects starting from that state under the given policy.
Core Distinction#
Return () vs. State Value ()
- Return is the realized cumulative reward based on a single trajectory; it is a random variable.
- State Value is the mathematical expectation (statistical mean) of the Return across all possible trajectories (under a specific policy ).
- They are numerically equivalent only when both the policy and the environment are fully deterministic (i.e., there is only one unique trajectory).
3. Derivation of the Bellman Equation#
The Bellman equation describes the recursive relationship between the value of the current state and the value of future states:
The expanded general form:
Part 1: Mean of Immediate Rewards#
Here, the Law of Total Expectation from probability theory is applied:
- : Weight (The probability of taking the action).
- : Conditional Expectation (The average reward under that action).
- : Weighted Sum.
Interpretation: If the policy is deterministic, the summation contains only one non-zero term. However, for a general stochastic policy, we must iterate through all possible actions and weight them accordingly.
Part 2: Mean of Future Rewards#
Essence: Calculating “the average value of the future, viewed from the current state.” This derivation decomposes the expectation of future returns into the product of three core elements:
- Policy : How we make choices.
- Dynamics : How the environment transitions states.
- Next State Value : How good the future is.
4. Matrix and Vector Forms#
To facilitate computation, we can define the following two auxiliary terms:
- Average Immediate Reward Vector :
Meaning: Fuses action probabilities with the rewards generated by those actions to calculate the comprehensive expected immediate reward for the current state .
-
State Transition Matrix :
Meaning: Ignores specific action choices and directly describes the statistical law of flowing from state to under the current policy .
We thus obtain the matrix form of the Bellman Equation (Bellman Expectation Equation):
Matrix Expansion Example#
Assuming there are 4 states:
Solution Methods#
1. Closed Form Solution Can be solved directly via matrix inversion:
Drawback: When the state space is large, the computational cost of matrix inversion is too high to be feasible.
2. Iterative Solution This is the basis of Policy Evaluation:
Conclusion: As , the sequence converges to the true value .
5. Action Value#
Definition and Comparison#
- State Value (): The average Return an agent gets starting from a State.
- Action Value (): The average Return an agent gets starting from a State, taking a specific Action first, and then following policy .
Definition formula:
It depends on two elements: the current State-Action Pair and the subsequent policy .
Relationship#
1. State Value is the expectation of Action Value
2. Expansion of Action Value Expanding into the sum of the immediate reward and the next state value:
Combining the above two points reconfirms the recursive structure of the Bellman Equation:
Summary: If you know all State Values, you can derive all Action Values, and vice-versa.
DOCS
-
CTF WP
-
WEB
-
Reinforcement Learning
- RL Study Notes: The Bellman Equation
- RL Study Notes: Basic Concepts
- RL Study Notes: Value Iteration and Policy Iteration
- RL Study Notes: Bellman Optimality Equation
- RL Study Notes: Monte Carlo Methods
- RL Study Notes: SA and SGD
- RL Study Notes: Policy Gradient Methods
- RL Study Notes: Temporal-Difference Learning
- RL Study Notes: Actor-Critic Algorithm
- RL Study Notes: Value Function Approximation
-
Miscellaneous