This algorithm is used to estimate the state value under a given policy π.
This section primarily introduces the foundational TD algorithm for estimating state values (commonly referred to as TD(0)).
Required data/experience:
(s0,r1,s1,…,st,rt+1,st+1,…) or {(st,rt+1,st+1)}t. These data are generated by interacting with the environment according to a given policy π.
When the learning rate αt(st)∈(0,1), it is clear that 0<1−αt(st)<1. This means the distance between the estimate and the target is gradually compressed, driving it to converge towards the target.
Regarding the TD error:
δt=vt(st)−[rt+1+γvt(st+1)]
It is the difference between value estimates at two consecutive time steps.
It reflects the deficiency between the current estimate vt and the true value vπ. To prove this, define the error under the true value:
δπ,t≐vπ(st)−[rt+1+γvπ(st+1)]
Taking the expectation based on the definition of vπ(st):
As seen from the above equation, the algorithm’s goal is to satisfy the Bellman equation in an expected sense.
If vt=vπ, the expectation of δt should be zero.
Conversely, if δt is not zero, it indicates that vt has not yet converged to vπ.
The TD error can essentially be viewed as innovation, representing the feedback obtained from the latest experience (st,rt+1,st+1) that can be used to correct current knowledge.
The above process is mainly used for Policy Evaluation. The advantage of the TD algorithm is that it does not require a complete model of the environment (such as transition probabilities) and can directly approximate the Bellman equation through sampled data.
Compared to the theoretical equation, the practically applied TD algorithm makes two core approximations (modifications):
Replacing {s,r,s′} with actual trajectory samples {st,rt+1,st+1}. That is, instead of repeatedly sampling large amounts of data at a specific state s, a single-step update is performed when st is visited, while unvisited states remain unchanged.
Replacing the unknown true value vπ(sk′) with the current estimate vk(sk′), thereby introducing the Bootstrapping mechanism.
Online: TD learning is online. It can update state/action values immediately after executing a single-step transition and receiving a reward.
Offline: MC learning is offline. It must wait until the end of an episode to update using the full trajectory return.
Task Type
Continuing tasks: Thanks to its single-step update feature, it can seamlessly handle both episodic and infinitely long continuing tasks.
Episodic tasks: Because it requires waiting for the terminal state to calculate the total return, it can only handle episodic tasks with clear termination states.
Bootstrapping
Bootstrapping: Updates rely on the current estimate of the next state’s value (updating estimates using estimates), thus requiring an initial guess.
Non-bootstrapping: Directly uses the actual sampled cumulative return to estimate values, without relying on existing estimates of other states.
Variance
Low estimation variance: Single-step updates involve fewer random variables. For example, Sarsa only introduces the single-step randomness of Rt+1,St+1,At+1.
High estimation variance: To estimate the value, it needs to accumulate rewards across the entire trajectory Rt+1+γRt+2+…. Assuming an episode length of L, long sequences introduce significant randomness and variance.
For other unvisited state-action pairs (s,a)=(st,at), their values remain unchanged:
qt+1(s,a)=qt(s,a)
Where:
qt(st,at): The current estimate of the action value qπ(st,at).
αt(st,at): The learning rate, usually decaying over time to ensure convergence.
rt+1+γqt(st+1,at+1): TD target. This reflects the “On-policy” nature of Sarsa, strictly using the actual next action at+1 taken by the current policy π to calculate the target and update the current value.
Strictly speaking, the single-step formula above belongs to the policy evaluation part of Sarsa. In actual control problems, the Sarsa algorithm typically refers to the complete control loop that combines this evaluation formula with a policy improvement mechanism (such as ϵ-greedy).
That is, the expected value of all possible action values in the next state under the current policy πt.
Change in TD target: Sarsa relies on a specifically sampled action at+1, whereas Expected Sarsa takes the expectation over all possible next actions weighted by their policy probabilities.
Variance reduction: Although it requires calculating the values of all candidate actions at each time step (slightly increasing computation), it eliminates the randomness introduced by selecting the next action At+1. The random variables involved in the target value are reduced from {st,at,rt+1,st+1,at+1} to {st,at,rt+1,st+1}, thereby effectively reducing estimation variance.
The mathematical expectation equation it essentially attempts to solve is:
n-step Sarsa is a generalized method; both single-step Sarsa and full-view MC can be seen as its extreme cases. The core lies in the step difference of the truncated return:
In practical execution, because the environment only advances one step at time t, the complete data sequence (rt+1,…,rt+n,st+n,at+n) for the next n steps is not yet available.
Therefore, the actual n-step update needs to be delayed by n time steps. At time t+n, after collecting the required data, it goes back to update the historical state-action pair (st,at):
(Note: The subscript t represents the passage of time steps, while n represents the horizon of how far ahead to look.)
n-step Sarsa is an algorithm that balances between Sarsa and MC:
When n is large, more true reward terms are introduced, making it closer to the MC method. The advantage is smaller bias and lower susceptibility to erroneous initial estimates, but at the cost of accumulating more randomness, leading to higher variance.
When n is small, it relies more on bootstrapping, making it closer to Sarsa. The variance is lower here, but if the initial estimation error is large, the strong reliance on existing q values can easily introduce significant bias.
Like single-step methods, n-step Sarsa can also be embedded into the policy iteration framework, combining with policy improvement steps to find the optimal policy.
In reinforcement learning control problems, we often deal with two types of policies:
Behavior policy: The policy the agent actually uses to interact with the environment and generate experience samples (typically exploratory, e.g., ϵ-greedy).
Target policy: The policy the algorithm actually evaluates, attempts to optimize, and ultimately converges to (typically a fully greedy policy).
Based on whether these two policies are identical, algorithms can be classified as:
On-policy: The target policy is exactly the same as the behavior policy. The rules for generating samples and the rules for evaluation share the same logic.
Off-policy: The target policy is separated from the behavior policy. The agent explores the world using one policy while using the explored data to silently optimize another (usually superior) policy in the background.
Core advantage of Off-policy: The algorithm can reuse experience data generated by any other policy (even human expert historical actions or random walks) for learning, resulting in higher data efficiency.
How to determine if an algorithm is On-policy or Off-policy?
The core lies in examining the source mechanism of the next action at+1 used when calculating the TD Target:
Sarsa is On-policy: Its target relies on the action at+1 actually executed in the next step, which is determined by the current behavior policy.
Q-learning is Off-policy: Regardless of what action is actually taken next to explore the environment, when calculating the target value, it always aggressively assumes the optimal action under the current estimate will be taken (maxa).
where qˉt is the corresponding TD target for each algorithm:
Algorithm
Expression of TD target (qˉt)
Sarsa
qˉt=rt+1+γqt(st+1,at+1)
n-step Sarsa
qˉt=rt+1+γrt+2+⋯+γnqt(st+n,at+n)
Expected Sarsa
qˉt=rt+1+γ∑aπt(a∥st+1)qt(st+1,a)
Q-learning
qˉt=rt+1+γmaxaqt(st+1,a)
Monte Carlo
qˉt=rt+1+γrt+2+…
(Note: When the learning rate αt=1, the MC form can be directly written as qt+1(st,at)=qˉt, i.e., directly replacing the estimate with the single-episode return. Standard MC usually maintains a mean, which is equivalent to a gradually decaying α.)
Summary of the underlying equations theoretically solved by each algorithm: