The Bellman Equation#
1. Basic Definitions#
The interaction process in Reinforcement Learning can be described as follows:
S t → A t R t + 1 , S t + 1 S_{t} \xrightarrow{A_{t}} R_{t+1}, S_{t+1} S t A t R t + 1 , S t + 1
Core Variables#
t , t + 1 t, t+1 t , t + 1 : Discrete time steps.
S t S_{t} S t : The state at time t t t .
A t A_{t} A t : The action taken at state S t S_{t} S t .
R t + 1 R_{t+1} R t + 1 : The immediate reward received after taking A t A_{t} A t .
S t + 1 S_{t+1} S t + 1 : The new state (Next State) transitioned to after taking A t A_{t} A t .
Note : S t , A t , R t + 1 S_{t}, A_{t}, R_{t+1} S t , A t , R t + 1 are all Random Variables . This means every step of the interaction is determined by a probability distribution; therefore, we can calculate their expectations.
Trajectory and Return#
The time-series trajectory formed by the interaction process is as follows:
S t → A t R t + 1 , S t + 1 → A t + 1 R t + 2 , S t + 2 → A t + 2 R t + 3 , … S_{t} \xrightarrow{A_{t}} R_{t+1}, S_{t+1} \xrightarrow{A_{t+1}} R_{t+2}, S_{t+2} \xrightarrow{A_{t+2}} R_{t+3}, \dots S t A t R t + 1 , S t + 1 A t + 1 R t + 2 , S t + 2 A t + 2 R t + 3 , …
Discounted Return is defined as the cumulative discounted reward starting from time t t t :
G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ = ∑ k = 0 ∞ γ k R t + k + 1 G_{t} = R_{t+1} + \gamma R_{t+2} + \gamma^{2}R_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} G t = R t + 1 + γ R t + 2 + γ 2 R t + 3 + ⋯ = k = 0 ∑ ∞ γ k R t + k + 1
Where γ ∈ [ 0 , 1 ] \gamma \in [0, 1] γ ∈ [ 0 , 1 ] is the discount factor.
2. State Value#
Definition#
v π ( s ) v_{\pi}(s) v π ( s ) is called the State-Value Function , or simply State Value. It is the mathematical expectation of the return G t G_t G t :
v π ( s ) = E [ G t ∣ S t = s ] v_{\pi}(s) = \mathbb{E}[G_{t} \mid S_{t}=s] v π ( s ) = E [ G t ∣ S t = s ]
It is a function of the state s s s .
Its value depends on the current policy π \pi π .
It represents the “value” of being in that state. A higher value implies better prospects starting from that state under the given policy.
Core Distinction#
Return (G t G_t G t ) vs. State Value (v π ( s ) v_{\pi}(s) v π ( s ) )
Return is the realized cumulative reward based on a single trajectory; it is a random variable.
State Value is the mathematical expectation (statistical mean) of the Return across all possible trajectories (under a specific policy π \pi π ).
They are numerically equivalent only when both the policy and the environment are fully deterministic (i.e., there is only one unique trajectory).
3. Derivation of the Bellman Equation#
The Bellman equation describes the recursive relationship between the value of the current state and the value of future states:
v π ( s ) = E [ R t + 1 + γ G t + 1 ∣ S t = s ] = E [ R t + 1 ∣ S t = s ] ⏟ Expectation of Immediate Reward + γ E [ G t + 1 ∣ S t = s ] ⏟ Expectation of Future Reward \begin{aligned}
v_{\pi}(s) &= \mathbb{E}[R_{t+1} + \gamma G_{t+1} \mid S_{t}=s] \\
&= \underbrace{\mathbb{E}[R_{t+1} \mid S_{t}=s]}_{\text{Expectation of Immediate Reward}} + \gamma \underbrace{\mathbb{E}[G_{t+1} \mid S_{t}=s]}_{\text{Expectation of Future Reward}}
\end{aligned} v π ( s ) = E [ R t + 1 + γ G t + 1 ∣ S t = s ] = Expectation of Immediate Reward E [ R t + 1 ∣ S t = s ] + γ Expectation of Future Reward E [ G t + 1 ∣ S t = s ]
The expanded general form:
v π ( s ) = ∑ a ∈ A π ( a ∣ s ) [ ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) ] , for all s ∈ S v_{\pi}(s) = \sum_{a \in \mathcal{A}}\pi(a|s) \left[ \sum_{r \in \mathcal{R}}p(r|s,a)r + \gamma \sum_{s^{\prime} \in \mathcal{S}}p(s^{\prime}|s,a)v_{\pi}(s^{\prime}) \right], \quad \text{for all } s \in \mathcal{S} v π ( s ) = a ∈ A ∑ π ( a ∣ s ) [ r ∈ R ∑ p ( r ∣ s , a ) r + γ s ′ ∈ S ∑ p ( s ′ ∣ s , a ) v π ( s ′ ) ] , for all s ∈ S
E [ R t + 1 ∣ S t = s ] = ∑ a ∈ A π ( a ∣ s ) E [ R t + 1 ∣ S t = s , A t = a ] = ∑ a ∈ A π ( a ∣ s ) ∑ r ∈ R p ( r ∣ s , a ) r \begin{aligned}
\mathbb{E}[R_{t+1}|S_t = s] &= \sum_{a \in \mathcal{A}} \pi(a|s) \mathbb{E}[R_{t+1}|S_t = s, A_t = a] \\
&= \sum_{a \in \mathcal{A}} \pi(a|s) \sum_{r \in \mathcal{R}} p(r|s, a)r
\end{aligned} E [ R t + 1 ∣ S t = s ] = a ∈ A ∑ π ( a ∣ s ) E [ R t + 1 ∣ S t = s , A t = a ] = a ∈ A ∑ π ( a ∣ s ) r ∈ R ∑ p ( r ∣ s , a ) r
Here, the Law of Total Expectation from probability theory is applied:
π ( a ∣ s ) \pi(a|s) π ( a ∣ s ) : Weight (The probability of taking the action).
E [ R ∣ s , a ] \mathbb{E}[R|s, a] E [ R ∣ s , a ] : Conditional Expectation (The average reward under that action).
∑ \sum ∑ : Weighted Sum .
Interpretation : If the policy is deterministic, the summation contains only one non-zero term. However, for a general stochastic policy, we must iterate through all possible actions and weight them accordingly.
Part 2: Mean of Future Rewards#
E [ G t + 1 ∣ S t = s ] = ∑ s ′ ∈ S E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] p ( s ′ ∣ s ) = ∑ s ′ ∈ S E [ G t + 1 ∣ S t + 1 = s ′ ] p ( s ′ ∣ s ) (Markov Property) = ∑ s ′ ∈ S v π ( s ′ ) p ( s ′ ∣ s ) = ∑ s ′ ∈ S v π ( s ′ ) ∑ a ∈ A p ( s ′ ∣ s , a ) π ( a ∣ s ) \begin{aligned}
\mathbb{E}[G_{t+1}|S_t = s] &= \sum_{s' \in \mathcal{S}} \mathbb{E}[G_{t+1}|S_t = s, S_{t+1} = s'] p(s'|s) \\
&= \sum_{s' \in \mathcal{S}} \mathbb{E}[G_{t+1}|S_{t+1} = s'] p(s'|s) \quad \text{(Markov Property)} \\
&= \sum_{s' \in \mathcal{S}} v_\pi(s') p(s'|s) \\
&= \sum_{s' \in \mathcal{S}} v_\pi(s') \sum_{a \in \mathcal{A}} p(s'|s, a)\pi(a|s)
\end{aligned} E [ G t + 1 ∣ S t = s ] = s ′ ∈ S ∑ E [ G t + 1 ∣ S t = s , S t + 1 = s ′ ] p ( s ′ ∣ s ) = s ′ ∈ S ∑ E [ G t + 1 ∣ S t + 1 = s ′ ] p ( s ′ ∣ s ) (Markov Property) = s ′ ∈ S ∑ v π ( s ′ ) p ( s ′ ∣ s ) = s ′ ∈ S ∑ v π ( s ′ ) a ∈ A ∑ p ( s ′ ∣ s , a ) π ( a ∣ s )
Essence : Calculating “the average value of the future, viewed from the current state.”
This derivation decomposes the expectation of future returns into the product of three core elements:
Policy π ( a ∣ s ) \pi(a|s) π ( a ∣ s ) : How we make choices.
Dynamics p ( s ′ ∣ s , a ) p(s'|s,a) p ( s ′ ∣ s , a ) : How the environment transitions states.
Next State Value v π ( s ′ ) v_\pi(s') v π ( s ′ ) : How good the future is.
To facilitate computation, we can define the following two auxiliary terms:
Average Immediate Reward Vector r π r_{\pi} r π :
r π ( s ) ≐ ∑ a ∈ A π ( a ∣ s ) ∑ r ∈ R p ( r ∣ s , a ) r r_{\pi}(s) \doteq \sum_{a \in \mathcal{A}} \pi(a|s) \sum_{r \in \mathcal{R}} p(r|s, a)r r π ( s ) ≐ a ∈ A ∑ π ( a ∣ s ) r ∈ R ∑ p ( r ∣ s , a ) r
Meaning: Fuses action probabilities with the rewards generated by those actions to calculate the comprehensive expected immediate reward for the current state s s s .
State Transition Matrix P π P_{\pi} P π :
p π ( s ′ ∣ s ) ≐ ∑ a ∈ A π ( a ∣ s ) p ( s ′ ∣ s , a ) p_{\pi}(s'|s) \doteq \sum_{a \in \mathcal{A}} \pi(a|s)p(s'|s, a) p π ( s ′ ∣ s ) ≐ a ∈ A ∑ π ( a ∣ s ) p ( s ′ ∣ s , a )
Meaning: Ignores specific action choices and directly describes the statistical law of flowing from state s s s to s ′ s' s ′ under the current policy π \pi π .
We thus obtain the matrix form of the Bellman Equation (Bellman Expectation Equation):
v π = r π + γ P π v π v_{\pi} = r_{\pi} + \gamma P_{\pi}v_{\pi} v π = r π + γ P π v π
Matrix Expansion Example#
Assuming there are 4 states:
[ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] ⏟ v π = [ r π ( s 1 ) r π ( s 2 ) r π ( s 3 ) r π ( s 4 ) ] ⏟ r π + γ [ p π ( s 1 ∣ s 1 ) p π ( s 2 ∣ s 1 ) p π ( s 3 ∣ s 1 ) p π ( s 4 ∣ s 1 ) p π ( s 1 ∣ s 2 ) p π ( s 2 ∣ s 2 ) p π ( s 3 ∣ s 2 ) p π ( s 4 ∣ s 2 ) p π ( s 1 ∣ s 3 ) p π ( s 2 ∣ s 3 ) p π ( s 3 ∣ s 3 ) p π ( s 4 ∣ s 3 ) p π ( s 1 ∣ s 4 ) p π ( s 2 ∣ s 4 ) p π ( s 3 ∣ s 4 ) p π ( s 4 ∣ s 4 ) ] ⏟ P π [ v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) ] ⏟ v π \underbrace{
\begin{bmatrix}
v_\pi(s_1) \\
v_\pi(s_2) \\
v_\pi(s_3) \\
v_\pi(s_4)
\end{bmatrix}
}_{v_\pi}
=
\underbrace{
\begin{bmatrix}
r_\pi(s_1) \\
r_\pi(s_2) \\
r_\pi(s_3) \\
r_\pi(s_4)
\end{bmatrix}
}_{r_\pi}
+ \gamma
\underbrace{
\begin{bmatrix}
p_\pi(s_1|s_1) & p_\pi(s_2|s_1) & p_\pi(s_3|s_1) & p_\pi(s_4|s_1) \\
p_\pi(s_1|s_2) & p_\pi(s_2|s_2) & p_\pi(s_3|s_2) & p_\pi(s_4|s_2) \\
p_\pi(s_1|s_3) & p_\pi(s_2|s_3) & p_\pi(s_3|s_3) & p_\pi(s_4|s_3) \\
p_\pi(s_1|s_4) & p_\pi(s_2|s_4) & p_\pi(s_3|s_4) & p_\pi(s_4|s_4)
\end{bmatrix}
}_{P_\pi}
\underbrace{
\begin{bmatrix}
v_\pi(s_1) \\
v_\pi(s_2) \\
v_\pi(s_3) \\
v_\pi(s_4)
\end{bmatrix}
}_{v_\pi} v π v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 ) = r π r π ( s 1 ) r π ( s 2 ) r π ( s 3 ) r π ( s 4 ) + γ P π p π ( s 1 ∣ s 1 ) p π ( s 1 ∣ s 2 ) p π ( s 1 ∣ s 3 ) p π ( s 1 ∣ s 4 ) p π ( s 2 ∣ s 1 ) p π ( s 2 ∣ s 2 ) p π ( s 2 ∣ s 3 ) p π ( s 2 ∣ s 4 ) p π ( s 3 ∣ s 1 ) p π ( s 3 ∣ s 2 ) p π ( s 3 ∣ s 3 ) p π ( s 3 ∣ s 4 ) p π ( s 4 ∣ s 1 ) p π ( s 4 ∣ s 2 ) p π ( s 4 ∣ s 3 ) p π ( s 4 ∣ s 4 ) v π v π ( s 1 ) v π ( s 2 ) v π ( s 3 ) v π ( s 4 )
Solution Methods#
1. Closed Form Solution
Can be solved directly via matrix inversion:
v π = ( I − γ P π ) − 1 r π v_\pi = (I - \gamma P_\pi)^{-1} r_\pi v π = ( I − γ P π ) − 1 r π
Drawback: When the state space is large, the computational cost of matrix inversion is too high to be feasible.
2. Iterative Solution
This is the basis of Policy Evaluation :
v k + 1 = r π + γ P π v k , k = 0 , 1 , 2 , … v_{k+1} = r_\pi + \gamma P_\pi v_k, \quad k = 0, 1, 2, \dots v k + 1 = r π + γ P π v k , k = 0 , 1 , 2 , …
Conclusion: As k → ∞ k \to \infty k → ∞ , the sequence converges to the true value v π v_{\pi} v π .
5. Action Value#
Definition and Comparison#
State Value (v π v_{\pi} v π ) : The average Return an agent gets starting from a State.
Action Value (q π q_{\pi} q π ) : The average Return an agent gets starting from a State, taking a specific Action first , and then following policy π \pi π .
Definition formula:
q π ( s , a ) ≐ E [ G t ∣ S t = s , A t = a ] q_\pi(s, a) \doteq \mathbb{E}[G_t \mid S_t = s, A_t = a] q π ( s , a ) ≐ E [ G t ∣ S t = s , A t = a ]
It depends on two elements: the current State-Action Pair and the subsequent policy π \pi π .
Relationship#
1. State Value is the expectation of Action Value
v π ( s ) = ∑ a ∈ A π ( a ∣ s ) q π ( s , a ) v_\pi(s) = \sum_{a \in \mathcal{A}} \pi(a|s) q_\pi(s, a) v π ( s ) = a ∈ A ∑ π ( a ∣ s ) q π ( s , a )
2. Expansion of Action Value
Expanding q π ( s , a ) q_\pi(s, a) q π ( s , a ) into the sum of the immediate reward and the next state value:
q π ( s , a ) = ∑ r ∈ R p ( r ∣ s , a ) r + γ ∑ s ′ ∈ S p ( s ′ ∣ s , a ) v π ( s ′ ) q_{\pi}(s,a) = \sum_{r \in \mathcal{R}} p(r|s,a)r + \gamma \sum_{s' \in \mathcal{S}} p(s'|s,a)v_{\pi}(s') q π ( s , a ) = r ∈ R ∑ p ( r ∣ s , a ) r + γ s ′ ∈ S ∑ p ( s ′ ∣ s , a ) v π ( s ′ )
Combining the above two points reconfirms the recursive structure of the Bellman Equation:
v π ( s ) = ∑ a π ( a ∣ s ) [ ∑ r p ( r ∣ s , a ) r + γ ∑ s ′ p ( s ′ ∣ s , a ) v π ( s ′ ) ] ⏟ q π ( s , a ) v_{\pi}(s) = \sum_{a} \pi(a|s) \underbrace{\left[ \sum_{r} p(r|s,a)r + \gamma \sum_{s'} p(s'|s,a)v_{\pi}(s') \right]}_{q_{\pi}(s,a)} v π ( s ) = a ∑ π ( a ∣ s ) q π ( s , a ) [ r ∑ p ( r ∣ s , a ) r + γ s ′ ∑ p ( s ′ ∣ s , a ) v π ( s ′ ) ]
Summary : If you know all State Values, you can derive all Action Values, and vice-versa.
DOCS CTF WP WEB Reinforcement Learning Miscellaneous