RL Study Notes: Temporal-Difference Learning
A summary of core Temporal-Difference learning concepts, comparing TD with MC, and detailing mechanisms for Sarsa, n-step Sarsa, and Q-learning.
Temporal-Difference Learning#
TD learning of state values#
- This algorithm is used to estimate the state value under a given policy .
- This section primarily introduces the foundational TD algorithm for estimating state values (commonly referred to as TD(0)).
Required data/experience:
- or . These data are generated by interacting with the environment according to a given policy .
TD learning algorithm update formulas:
where . Here, is the current estimate of the state value ; is the learning rate at time step for state .
- At time step , only the value of the currently visited state is updated, while the values of unvisited states remain unchanged.
- When the context is clear, the retention step in equation (2) is usually omitted.
Analyzing the core update equation:
- TD target:
- TD error:
The core idea of the algorithm is to continuously move the current estimate towards the target value . The derivation is as follows:
When the learning rate , it is clear that . This means the distance between the estimate and the target is gradually compressed, driving it to converge towards the target.
Regarding the TD error:
- It is the difference between value estimates at two consecutive time steps.
- It reflects the deficiency between the current estimate and the true value . To prove this, define the error under the true value:
Taking the expectation based on the definition of :
As seen from the above equation, the algorithm’s goal is to satisfy the Bellman equation in an expected sense.
- If , the expectation of should be zero.
- Conversely, if is not zero, it indicates that has not yet converged to .
- The TD error can essentially be viewed as innovation, representing the feedback obtained from the latest experience that can be used to correct current knowledge.
The above process is mainly used for Policy Evaluation. The advantage of the TD algorithm is that it does not require a complete model of the environment (such as transition probabilities) and can directly approximate the Bellman equation through sampled data.
Bellman expectation equation#
The state value of policy is defined as:
where is the discounted return. Since:
where is the next state, we can rewrite the equation as the Bellman expectation equation:
We can solve this equation using the concept of stochastic approximation (Robbins-Monro algorithm):
Since in practice we can only obtain samples and for and , our noisy observation is:
It can be decomposed into the sum of the true gradient and noise:
Applying the update rule:
Compared to the theoretical equation, the practically applied TD algorithm makes two core approximations (modifications):
- Replacing with actual trajectory samples . That is, instead of repeatedly sampling large amounts of data at a specific state , a single-step update is performed when is visited, while unvisited states remain unchanged.
- Replacing the unknown true value with the current estimate , thereby introducing the Bootstrapping mechanism.
TD vs MC#
| Feature | TD / Sarsa learning | MC learning |
|---|---|---|
| Online/Offline | Online: TD learning is online. It can update state/action values immediately after executing a single-step transition and receiving a reward. | Offline: MC learning is offline. It must wait until the end of an episode to update using the full trajectory return. |
| Task Type | Continuing tasks: Thanks to its single-step update feature, it can seamlessly handle both episodic and infinitely long continuing tasks. | Episodic tasks: Because it requires waiting for the terminal state to calculate the total return, it can only handle episodic tasks with clear termination states. |
| Bootstrapping | Bootstrapping: Updates rely on the current estimate of the next state’s value (updating estimates using estimates), thus requiring an initial guess. | Non-bootstrapping: Directly uses the actual sampled cumulative return to estimate values, without relying on existing estimates of other states. |
| Variance | Low estimation variance: Single-step updates involve fewer random variables. For example, Sarsa only introduces the single-step randomness of . | High estimation variance: To estimate the value, it needs to accumulate rewards across the entire trajectory . Assuming an episode length of , long sequences introduce significant randomness and variance. |
TD learning of action values: Sarsa#
Sarsa aims to estimate the action-value function for a given policy . The algorithm learns through the following sequence of experiences:
For the state-action pair visited at time , the update formula is:
For other unvisited state-action pairs , their values remain unchanged:
Where:
- : The current estimate of the action value .
- : The learning rate, usually decaying over time to ensure convergence.
- : TD target. This reflects the “On-policy” nature of Sarsa, strictly using the actual next action taken by the current policy to calculate the target and update the current value.
Strictly speaking, the single-step formula above belongs to the policy evaluation part of Sarsa. In actual control problems, the Sarsa algorithm typically refers to the complete control loop that combines this evaluation formula with a policy improvement mechanism (such as -greedy).
Expected Sarsa#
Expected Sarsa improves upon the target calculation method of standard Sarsa:
Where the expectation term expands to:
That is, the expected value of all possible action values in the next state under the current policy .
- Change in TD target: Sarsa relies on a specifically sampled action , whereas Expected Sarsa takes the expectation over all possible next actions weighted by their policy probabilities.
- Variance reduction: Although it requires calculating the values of all candidate actions at each time step (slightly increasing computation), it eliminates the randomness introduced by selecting the next action . The random variables involved in the target value are reduced from to , thereby effectively reducing estimation variance.
The mathematical expectation equation it essentially attempts to solve is:
Namely:
n-step Sarsa#
n-step Sarsa is a generalized method; both single-step Sarsa and full-view MC can be seen as its extreme cases. The core lies in the step difference of the truncated return:
The theoretical expectations they attempt to fit:
-
Sarsa (1-step):
-
MC learning:
-
n-step Sarsa:
-step Sarsa update algorithm:
The theoretical update formula is:
- In practical execution, because the environment only advances one step at time , the complete data sequence for the next steps is not yet available.
- Therefore, the actual -step update needs to be delayed by time steps. At time , after collecting the required data, it goes back to update the historical state-action pair :
Abbreviated as:
(Note: The subscript represents the passage of time steps, while represents the horizon of how far ahead to look.)
- -step Sarsa is an algorithm that balances between Sarsa and MC:
- When is large, more true reward terms are introduced, making it closer to the MC method. The advantage is smaller bias and lower susceptibility to erroneous initial estimates, but at the cost of accumulating more randomness, leading to higher variance.
- When is small, it relies more on bootstrapping, making it closer to Sarsa. The variance is lower here, but if the initial estimation error is large, the strong reliance on existing values can easily introduce significant bias.
- Like single-step methods, -step Sarsa can also be embedded into the policy iteration framework, combining with policy improvement steps to find the optimal policy.
Q-learning#
Q-learning is the representative off-policy TD control algorithm; its evaluation directly targets the optimal state-action value:
Mathematically, it attempts to solve the Bellman Optimality Equation (BOE):
on-policy vs off-policy#
In reinforcement learning control problems, we often deal with two types of policies:
- Behavior policy: The policy the agent actually uses to interact with the environment and generate experience samples (typically exploratory, e.g., -greedy).
- Target policy: The policy the algorithm actually evaluates, attempts to optimize, and ultimately converges to (typically a fully greedy policy).
Based on whether these two policies are identical, algorithms can be classified as:
- On-policy: The target policy is exactly the same as the behavior policy. The rules for generating samples and the rules for evaluation share the same logic.
- Off-policy: The target policy is separated from the behavior policy. The agent explores the world using one policy while using the explored data to silently optimize another (usually superior) policy in the background.
Core advantage of Off-policy: The algorithm can reuse experience data generated by any other policy (even human expert historical actions or random walks) for learning, resulting in higher data efficiency.
How to determine if an algorithm is On-policy or Off-policy? The core lies in examining the source mechanism of the next action used when calculating the TD Target:
- Sarsa is On-policy: Its target relies on the action actually executed in the next step, which is determined by the current behavior policy.
- Q-learning is Off-policy: Regardless of what action is actually taken next to explore the environment, when calculating the target value, it always aggressively assumes the optimal action under the current estimate will be taken ().
Unified Form and Summary of TD Algorithms#
Value updates can generally be unified into the following form:
where is the corresponding TD target for each algorithm:
| Algorithm | Expression of TD target () |
|---|---|
| Sarsa | |
| -step Sarsa | |
| Expected Sarsa | |
| Q-learning | |
| Monte Carlo |
(Note: When the learning rate , the MC form can be directly written as , i.e., directly replacing the estimate with the single-episode return. Standard MC usually maintains a mean, which is equivalent to a gradually decaying .)
Summary of the underlying equations theoretically solved by each algorithm:
| Algorithm | Equation aimed to solve |
|---|---|
| Sarsa | Bellman Expectation: |
| n-step Sarsa | Bellman Expectation: |
| Expected Sarsa | Bellman Expectation: |
| Q-learning | Bellman Optimality: |
| Monte Carlo | Bellman Expectation: |
DOCS
-
CTF WP
-
WEB
-
Reinforcement Learning
- RL Study Notes: The Bellman Equation
- RL Study Notes: Basic Concepts
- RL Study Notes: Value Iteration and Policy Iteration
- RL Study Notes: Bellman Optimality Equation
- RL Study Notes: Monte Carlo Methods
- RL Study Notes: SA and SGD
- RL Study Notes: Policy Gradient Methods
- RL Study Notes: Temporal-Difference Learning
- RL Study Notes: Actor-Critic Algorithm
- RL Study Notes: Value Function Approximation
-
Miscellaneous