Temporal-Difference Learning#

TD learning of state values#

This algorithm is used to estimate the state value under a given policy $\pi$ .
This section primarily introduces the foundational TD algorithm for estimating state values (commonly referred to as TD(0)).

Required data/experience:

$(s_0, r_1, s_1, \dots, s_t, r_{t+1}, s_{t+1}, \dots)$ or $\{(s_t, r_{t+1}, s_{t+1})\}_t$ . These data are generated by interacting with the environment according to a given policy $\pi$ .

TD learning algorithm update formulas:

v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \big[ v_t(s_t) - [r_{t+1} + \gamma v_t(s_{t+1})] \big], \quad (1)

v_{t+1}(s) = v_t(s), \quad \forall s \neq s_t, \quad (2)

where $t = 0, 1, 2, \dots$ . Here, $v_t(s_t)$ is the current estimate of the state value $v_\pi(s_t)$ ; $\alpha_t(s_t)$ is the learning rate at time step $t$ for state $s_t$ .

At time step $t$ , only the value of the currently visited state $s_t$ is updated, while the values of unvisited states $s \neq s_t$ remain unchanged.
When the context is clear, the retention step in equation (2) is usually omitted.

Analyzing the core update equation:

v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \big[ v_t(s_t) - [r_{t+1} + \gamma v_t(s_{t+1})] \big]

TD target: $\bar{v}_t \doteq r_{t+1} + \gamma v_t(s_{t+1})$
TD error: $\delta_t \doteq v_t(s_t) - [r_{t+1} + \gamma v_t(s_{t+1})] = v_t(s_t) - \bar{v}_t$

The core idea of the algorithm is to continuously move the current estimate $v_t(s_t)$ towards the target value $\bar{v}_t$ . The derivation is as follows:

\begin{aligned} & v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) [v_t(s_t) - \bar{v}_t] \\ \implies & v_{t+1}(s_t) - \bar{v}_t = v_t(s_t) - \bar{v}_t - \alpha_t(s_t) [v_t(s_t) - \bar{v}_t] \\ \implies & v_{t+1}(s_t) - \bar{v}_t = [1 - \alpha_t(s_t)] [v_t(s_t) - \bar{v}_t] \\ \implies & |v_{t+1}(s_t) - \bar{v}_t| = |1 - \alpha_t(s_t)| |v_t(s_t) - \bar{v}_t| \end{aligned}

When the learning rate $\alpha_t(s_t) \in (0, 1)$ , it is clear that $0 < 1 - \alpha_t(s_t) < 1$ . This means the distance between the estimate and the target is gradually compressed, driving it to converge towards the target.

Regarding the TD error:

\delta_t = v_t(s_t) - [r_{t+1} + \gamma v_t(s_{t+1})]

It is the difference between value estimates at two consecutive time steps.
It reflects the deficiency between the current estimate $v_t$ and the true value $v_\pi$ . To prove this, define the error under the true value:

\delta_{\pi,t} \doteq v_\pi(s_t) - [r_{t+1} + \gamma v_\pi(s_{t+1})]

Taking the expectation based on the definition of $v_\pi(s_t)$ :

\mathbb{E}[\delta_{\pi,t} | S_t = s_t] = v_\pi(s_t) - \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s_t] = 0

As seen from the above equation, the algorithm’s goal is to satisfy the Bellman equation in an expected sense.

If $v_t = v_\pi$ , the expectation of $\delta_t$ should be zero.
Conversely, if $\delta_t$ is not zero, it indicates that $v_t$ has not yet converged to $v_\pi$ .
The TD error can essentially be viewed as innovation, representing the feedback obtained from the latest experience $(s_t, r_{t+1}, s_{t+1})$ that can be used to correct current knowledge.

The above process is mainly used for Policy Evaluation. The advantage of the TD algorithm is that it does not require a complete model of the environment (such as transition probabilities) and can directly approximate the Bellman equation through sampled data.

Bellman expectation equation#

The state value of policy $\pi$ is defined as:

v_\pi(s) = \mathbb{E}[R + \gamma G | S = s], \quad s \in \mathcal{S}

where $G$ is the discounted return. Since:

\mathbb{E}[G | S = s] = \sum_{a} \pi(a|s) \sum_{s'} p(s'|s, a) v_\pi(s') = \mathbb{E}[v_\pi(S') | S = s]

where $S'$ is the next state, we can rewrite the equation as the Bellman expectation equation:

v_\pi(s) = \mathbb{E}[R + \gamma v_\pi(S') | S = s], \quad s \in \mathcal{S}.

We can solve this equation using the concept of stochastic approximation (Robbins-Monro algorithm):

g(v(s)) = v(s) - \mathbb{E}[R + \gamma v_\pi(S') | s] = 0

Since in practice we can only obtain samples $r$ and $s'$ for $R$ and $S'$ , our noisy observation is:

\tilde{g}(v(s)) = v(s) - [r + \gamma v_\pi(s')]

It can be decomposed into the sum of the true gradient and noise:

= \underbrace{\left(v(s) - \mathbb{E}[R + \gamma v_\pi(S') | s]\right)}_{g(v(s))} + \underbrace{\left(\mathbb{E}[R + \gamma v_\pi(S') | s] - [r + \gamma v_\pi(s')]\right)}_{\eta}.

Applying the update rule:

\begin{aligned} v_{k+1}(s) &= v_k(s) - \alpha_k \tilde{g}(v_k(s)) \\ &= v_k(s) - \alpha_k \left( v_k(s) - \left[ r_k + \gamma v_\pi(s'_k) \right] \right), \quad k=1, 2, 3, \dots \end{aligned}

Compared to the theoretical equation, the practically applied TD algorithm makes two core approximations (modifications):

Replacing $\{s, r, s'\}$ with actual trajectory samples $\{s_t, r_{t+1}, s_{t+1}\}$ . That is, instead of repeatedly sampling large amounts of data at a specific state $s$ , a single-step update is performed when $s_t$ is visited, while unvisited states remain unchanged.
Replacing the unknown true value $v_{\pi}(s_k')$ with the current estimate $v_k(s_k')$ , thereby introducing the Bootstrapping mechanism.

TD vs MC#

Feature	TD / Sarsa learning	MC learning
Online/Offline	Online: TD learning is online. It can update state/action values immediately after executing a single-step transition and receiving a reward.	Offline: MC learning is offline. It must wait until the end of an episode to update using the full trajectory return.
Task Type	Continuing tasks: Thanks to its single-step update feature, it can seamlessly handle both episodic and infinitely long continuing tasks.	Episodic tasks: Because it requires waiting for the terminal state to calculate the total return, it can only handle episodic tasks with clear termination states.
Bootstrapping	Bootstrapping: Updates rely on the current estimate of the next state’s value (updating estimates using estimates), thus requiring an initial guess.	Non-bootstrapping: Directly uses the actual sampled cumulative return to estimate values, without relying on existing estimates of other states.
Variance	Low estimation variance: Single-step updates involve fewer random variables. For example, Sarsa only introduces the single-step randomness of $R_{t+1}, S_{t+1}, A_{t+1}$ .	High estimation variance: To estimate the value, it needs to accumulate rewards across the entire trajectory $R_{t+1} + \gamma R_{t+2} + \dots$ . Assuming an episode length of $L$ , long sequences introduce significant randomness and variance.

TD learning of action values: Sarsa#

Sarsa aims to estimate the action-value function $q_\pi$ for a given policy $\pi$ . The algorithm learns through the following sequence of experiences:

\{(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})\}_t

For the state-action pair $(s_t, a_t)$ visited at time $t$ , the update formula is:

q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \left[ q_t(s_t, a_t) - (r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})) \right]

For other unvisited state-action pairs $(s, a) \neq (s_t, a_t)$ , their values remain unchanged:

q_{t+1}(s, a) = q_t(s, a)

Where:

$q_t(s_t, a_t)$ : The current estimate of the action value $q_\pi(s_t, a_t)$ .
$\alpha_t(s_t, a_t)$ : The learning rate, usually decaying over time to ensure convergence.
$r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})$ : TD target. This reflects the “On-policy” nature of Sarsa, strictly using the actual next action $a_{t+1}$ taken by the current policy $\pi$ to calculate the target and update the current value.

Strictly speaking, the single-step formula above belongs to the policy evaluation part of Sarsa. In actual control problems, the Sarsa algorithm typically refers to the complete control loop that combines this evaluation formula with a policy improvement mechanism (such as $\epsilon$ -greedy).

Expected Sarsa#

Expected Sarsa improves upon the target calculation method of standard Sarsa:

q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \left[ q_t(s_t, a_t) - \left(r_{t+1} + \gamma \mathbb{E}[q_t(s_{t+1}, A)]\right) \right]

q_{t+1}(s, a) = q_t(s, a), \quad \forall (s, a) \neq (s_t, a_t)

Where the expectation term expands to:

\mathbb{E}[q_t(s_{t+1}, A)] = \sum_a \pi_t(a|s_{t+1}) q_t(s_{t+1}, a) \doteq v_t(s_{t+1})

That is, the expected value of all possible action values in the next state under the current policy $\pi_t$ .

Change in TD target: Sarsa relies on a specifically sampled action $a_{t+1}$ , whereas Expected Sarsa takes the expectation over all possible next actions weighted by their policy probabilities.
Variance reduction: Although it requires calculating the values of all candidate actions at each time step (slightly increasing computation), it eliminates the randomness introduced by selecting the next action $A_{t+1}$ . The random variables involved in the target value are reduced from $\{s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\}$ to $\{s_t, a_t, r_{t+1}, s_{t+1}\}$ , thereby effectively reducing estimation variance.

The mathematical expectation equation it essentially attempts to solve is:

q_\pi(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \mathbb{E}_{A_{t+1} \sim \pi(\cdot|S_{t+1})} [q_\pi(S_{t+1}, A_{t+1})] \middle| S_t = s, A_t = a \right]

Namely:

q_\pi(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma v_\pi(S_{t+1}) \middle| S_t = s, A_t = a \right]

n-step Sarsa#

n-step Sarsa is a generalized method; both single-step Sarsa and full-view MC can be seen as its extreme cases. The core lies in the step difference of the truncated return:

\begin{aligned} \text{Sarsa (1-step)} \longleftarrow \quad G_t^{(1)} &= R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \\ G_t^{(2)} &= R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2}, A_{t+2}) \\ &\vdots \\ n\text{-step Sarsa} \longleftarrow \quad G_t^{(n)} &= R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n q_\pi(S_{t+n}, A_{t+n}) \\ &\vdots \\ \text{MC} \longleftarrow \quad G_t^{(\infty)} &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \end{aligned}

The theoretical expectations they attempt to fit:

Sarsa (1-step):
$q_{\pi}(s, a) = \mathbb{E}[G_t^{(1)} | s, a] = \mathbb{E}[R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1}) | s, a]$
MC learning:
$q_{\pi}(s, a) = \mathbb{E}[G_t^{(\infty)} | s, a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots | s, a]$
n-step Sarsa:
$q_{\pi}(s, a) = \mathbb{E}[G_t^{(n)} | s, a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \dots + \gamma^n q_{\pi}(S_{t+n}, A_{t+n}) | s, a]$

$n$ -step Sarsa update algorithm:

The theoretical update formula is:

q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \left[ q_t(s_t, a_t) - [r_{t+1} + \gamma r_{t+2} + \dots + \gamma^n q_t(s_{t+n}, a_{t+n})] \right]

In practical execution, because the environment only advances one step at time $t$ , the complete data sequence $(r_{t+1}, \dots, r_{t+n}, s_{t+n}, a_{t+n})$ for the next $n$ steps is not yet available.
Therefore, the actual $n$ -step update needs to be delayed by $n$ time steps. At time $t+n$ , after collecting the required data, it goes back to update the historical state-action pair $(s_t, a_t)$ :

q_{t+n}(s_t, a_t) = q_{t+n-1}(s_t, a_t) - \alpha_{t+n-1}(s_t, a_t) \left[ q_{t+n-1}(s_t, a_t) - \left[ r_{t+1} + \dots + \gamma^n q_{t+n-1}(s_{t+n}, a_{t+n}) \right] \right]

Abbreviated as:

q_{t+n}(s_t, a_t) = q_{t+n-1}(s_t, a_t) - \alpha_{t+n-1}(s_t, a_t) \left[ q_{t+n-1}(s_t, a_t) - Target_{n-step} \right]

(Note: The subscript $t$ represents the passage of time steps, while $n$ represents the horizon of how far ahead to look.)

$n$ $n$ -step Sarsa is an algorithm that balances between Sarsa and MC:
- When $n$ is large, more true reward terms are introduced, making it closer to the MC method. The advantage is smaller bias and lower susceptibility to erroneous initial estimates, but at the cost of accumulating more randomness, leading to higher variance.
- When $n$ is small, it relies more on bootstrapping, making it closer to Sarsa. The variance is lower here, but if the initial estimation error is large, the strong reliance on existing $q$ values can easily introduce significant bias.
Like single-step methods, $n$ -step Sarsa can also be embedded into the policy iteration framework, combining with policy improvement steps to find the optimal policy.

Q-learning#

Q-learning is the representative off-policy TD control algorithm; its evaluation directly targets the optimal state-action value:

\begin{align} q_{t+1}(s_t, a_t) &= q_t(s_t, a_t) - \alpha_t(s_t, a_t) \left[ q_t(s_t, a_t) - \left[ r_{t+1} + \gamma \max_{a \in \mathcal{A}} q_t(s_{t+1}, a) \right] \right], \\ q_{t+1}(s, a) &= q_t(s, a), \quad \forall (s, a) \neq (s_t, a_t), \end{align}

Mathematically, it attempts to solve the Bellman Optimality Equation (BOE):

q(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} q(S_{t+1}, a') \mid S_t = s, A_t = a \right], \quad \forall s, a.

on-policy vs off-policy#

In reinforcement learning control problems, we often deal with two types of policies:

Behavior policy: The policy the agent actually uses to interact with the environment and generate experience samples (typically exploratory, e.g., $\epsilon$ -greedy).
Target policy: The policy the algorithm actually evaluates, attempts to optimize, and ultimately converges to (typically a fully greedy policy).

Based on whether these two policies are identical, algorithms can be classified as:

On-policy: The target policy is exactly the same as the behavior policy. The rules for generating samples and the rules for evaluation share the same logic.
Off-policy: The target policy is separated from the behavior policy. The agent explores the world using one policy while using the explored data to silently optimize another (usually superior) policy in the background.

Core advantage of Off-policy: The algorithm can reuse experience data generated by any other policy (even human expert historical actions or random walks) for learning, resulting in higher data efficiency.

How to determine if an algorithm is On-policy or Off-policy? The core lies in examining the source mechanism of the next action $a_{t+1}$ used when calculating the TD Target:

Sarsa is On-policy: Its target relies on the action $a_{t+1}$ actually executed in the next step, which is determined by the current behavior policy.
Q-learning is Off-policy: Regardless of what action is actually taken next to explore the environment, when calculating the target value, it always aggressively assumes the optimal action under the current estimate will be taken ( $\max_a$ ).

Unified Form and Summary of TD Algorithms#

Value updates can generally be unified into the following form:

q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t)[q_t(s_t, a_t) - \bar{q}_t],

where $\bar{q}_t$ is the corresponding TD target for each algorithm:

Algorithm	Expression of TD target ( $\bar{q}_t$ )
Sarsa	$\bar{q}_t = r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})$
$n$ -step Sarsa	$\bar{q}_t = r_{t+1} + \gamma r_{t+2} + \dots + \gamma^n q_t(s_{t+n}, a_{t+n})$
Expected Sarsa	$\bar{q}_t = r_{t+1} + \gamma \sum_a \pi_t(a\\|s_{t+1})q_t(s_{t+1},a)$
Q-learning	$\bar{q}_t = r_{t+1} + \gamma \max_a q_t(s_{t+1}, a)$
Monte Carlo	$\bar{q}_t = r_{t+1} + \gamma r_{t+2} + \dots$

(Note: When the learning rate $\alpha_t=1$ , the MC form can be directly written as $q_{t+1}(s_t, a_t) = \bar{q}_t$ , i.e., directly replacing the estimate with the single-episode return. Standard MC usually maintains a mean, which is equivalent to a gradually decaying $\alpha$ .)

Summary of the underlying equations theoretically solved by each algorithm:

Algorithm	Equation aimed to solve
Sarsa	Bellman Expectation: $q_\pi(s, a) = \mathbb{E} [R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a]$
n-step Sarsa	Bellman Expectation: $q_\pi(s, a) = \mathbb{E} [R_{t+1} + \gamma R_{t+2} + \dots + \gamma^n q_\pi(s_{t+n}, a_{t+n}) \mid S_t = s, A_t = a]$
Expected Sarsa	Bellman Expectation: $q_\pi(s, a) = \mathbb{E} [R_{t+1} + \gamma \mathbb{E}_{A_{t+1}} [q_\pi(S_{t+1}, A_{t+1})] \mid S_t = s, A_t = a]$
Q-learning	Bellman Optimality: $q_(s, a) = \mathbb{E} [R_{t+1} + \gamma \max_a q_(S_{t+1}, a) \mid S_t = s, A_t = a]$
Monte Carlo	Bellman Expectation: $q_\pi(s, a) = \mathbb{E} [R_{t+1} + \gamma R_{t+2} + \dots \mid S_t = s, A_t = a]$

RL Study Notes: Temporal-Difference Learning