Maxton‘s Blog

Back

RL Study Notes: Temporal-Difference Learning

A summary of core Temporal-Difference learning concepts, comparing TD with MC, and detailing mechanisms for Sarsa, n-step Sarsa, and Q-learning.

Temporal-Difference Learning#

TD learning of state values#

  • This algorithm is used to estimate the state value under a given policy π\pi.
  • This section primarily introduces the foundational TD algorithm for estimating state values (commonly referred to as TD(0)).

Required data/experience:

  • (s0,r1,s1,,st,rt+1,st+1,)(s_0, r_1, s_1, \dots, s_t, r_{t+1}, s_{t+1}, \dots) or {(st,rt+1,st+1)}t\{(s_t, r_{t+1}, s_{t+1})\}_t. These data are generated by interacting with the environment according to a given policy π\pi.

TD learning algorithm update formulas:

vt+1(st)=vt(st)αt(st)[vt(st)[rt+1+γvt(st+1)]],(1)v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \big[ v_t(s_t) - [r_{t+1} + \gamma v_t(s_{t+1})] \big], \quad (1) vt+1(s)=vt(s),sst,(2)v_{t+1}(s) = v_t(s), \quad \forall s \neq s_t, \quad (2)

where t=0,1,2,t = 0, 1, 2, \dots. Here, vt(st)v_t(s_t) is the current estimate of the state value vπ(st)v_\pi(s_t); αt(st)\alpha_t(s_t) is the learning rate at time step tt for state sts_t.

  • At time step tt, only the value of the currently visited state sts_t is updated, while the values of unvisited states ssts \neq s_t remain unchanged.
  • When the context is clear, the retention step in equation (2) is usually omitted.

Analyzing the core update equation:

vt+1(st)=vt(st)αt(st)[vt(st)[rt+1+γvt(st+1)]]v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) \big[ v_t(s_t) - [r_{t+1} + \gamma v_t(s_{t+1})] \big]
  • TD target: vˉtrt+1+γvt(st+1)\bar{v}_t \doteq r_{t+1} + \gamma v_t(s_{t+1})
  • TD error: δtvt(st)[rt+1+γvt(st+1)]=vt(st)vˉt\delta_t \doteq v_t(s_t) - [r_{t+1} + \gamma v_t(s_{t+1})] = v_t(s_t) - \bar{v}_t

The core idea of the algorithm is to continuously move the current estimate vt(st)v_t(s_t) towards the target value vˉt\bar{v}_t. The derivation is as follows:

vt+1(st)=vt(st)αt(st)[vt(st)vˉt]    vt+1(st)vˉt=vt(st)vˉtαt(st)[vt(st)vˉt]    vt+1(st)vˉt=[1αt(st)][vt(st)vˉt]    vt+1(st)vˉt=1αt(st)vt(st)vˉt\begin{aligned} & v_{t+1}(s_t) = v_t(s_t) - \alpha_t(s_t) [v_t(s_t) - \bar{v}_t] \\ \implies & v_{t+1}(s_t) - \bar{v}_t = v_t(s_t) - \bar{v}_t - \alpha_t(s_t) [v_t(s_t) - \bar{v}_t] \\ \implies & v_{t+1}(s_t) - \bar{v}_t = [1 - \alpha_t(s_t)] [v_t(s_t) - \bar{v}_t] \\ \implies & |v_{t+1}(s_t) - \bar{v}_t| = |1 - \alpha_t(s_t)| |v_t(s_t) - \bar{v}_t| \end{aligned}

When the learning rate αt(st)(0,1)\alpha_t(s_t) \in (0, 1), it is clear that 0<1αt(st)<10 < 1 - \alpha_t(s_t) < 1. This means the distance between the estimate and the target is gradually compressed, driving it to converge towards the target.

Regarding the TD error:

δt=vt(st)[rt+1+γvt(st+1)]\delta_t = v_t(s_t) - [r_{t+1} + \gamma v_t(s_{t+1})]
  • It is the difference between value estimates at two consecutive time steps.
  • It reflects the deficiency between the current estimate vtv_t and the true value vπv_\pi. To prove this, define the error under the true value:
δπ,tvπ(st)[rt+1+γvπ(st+1)]\delta_{\pi,t} \doteq v_\pi(s_t) - [r_{t+1} + \gamma v_\pi(s_{t+1})]

Taking the expectation based on the definition of vπ(st)v_\pi(s_t):

E[δπ,tSt=st]=vπ(st)E[Rt+1+γvπ(St+1)St=st]=0\mathbb{E}[\delta_{\pi,t} | S_t = s_t] = v_\pi(s_t) - \mathbb{E}[R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s_t] = 0

As seen from the above equation, the algorithm’s goal is to satisfy the Bellman equation in an expected sense.

  • If vt=vπv_t = v_\pi, the expectation of δt\delta_t should be zero.
  • Conversely, if δt\delta_t is not zero, it indicates that vtv_t has not yet converged to vπv_\pi.
  • The TD error can essentially be viewed as innovation, representing the feedback obtained from the latest experience (st,rt+1,st+1)(s_t, r_{t+1}, s_{t+1}) that can be used to correct current knowledge.

The above process is mainly used for Policy Evaluation. The advantage of the TD algorithm is that it does not require a complete model of the environment (such as transition probabilities) and can directly approximate the Bellman equation through sampled data.

Bellman expectation equation#

The state value of policy π\pi is defined as:

vπ(s)=E[R+γGS=s],sSv_\pi(s) = \mathbb{E}[R + \gamma G | S = s], \quad s \in \mathcal{S}

where GG is the discounted return. Since:

E[GS=s]=aπ(as)sp(ss,a)vπ(s)=E[vπ(S)S=s]\mathbb{E}[G | S = s] = \sum_{a} \pi(a|s) \sum_{s'} p(s'|s, a) v_\pi(s') = \mathbb{E}[v_\pi(S') | S = s]

where SS' is the next state, we can rewrite the equation as the Bellman expectation equation:

vπ(s)=E[R+γvπ(S)S=s],sS.v_\pi(s) = \mathbb{E}[R + \gamma v_\pi(S') | S = s], \quad s \in \mathcal{S}.

We can solve this equation using the concept of stochastic approximation (Robbins-Monro algorithm):

g(v(s))=v(s)E[R+γvπ(S)s]=0g(v(s)) = v(s) - \mathbb{E}[R + \gamma v_\pi(S') | s] = 0

Since in practice we can only obtain samples rr and ss' for RR and SS', our noisy observation is:

g~(v(s))=v(s)[r+γvπ(s)]\tilde{g}(v(s)) = v(s) - [r + \gamma v_\pi(s')]

It can be decomposed into the sum of the true gradient and noise:

=(v(s)E[R+γvπ(S)s])g(v(s))+(E[R+γvπ(S)s][r+γvπ(s)])η.= \underbrace{\left(v(s) - \mathbb{E}[R + \gamma v_\pi(S') | s]\right)}_{g(v(s))} + \underbrace{\left(\mathbb{E}[R + \gamma v_\pi(S') | s] - [r + \gamma v_\pi(s')]\right)}_{\eta}.

Applying the update rule:

vk+1(s)=vk(s)αkg~(vk(s))=vk(s)αk(vk(s)[rk+γvπ(sk)]),k=1,2,3,\begin{aligned} v_{k+1}(s) &= v_k(s) - \alpha_k \tilde{g}(v_k(s)) \\ &= v_k(s) - \alpha_k \left( v_k(s) - \left[ r_k + \gamma v_\pi(s'_k) \right] \right), \quad k=1, 2, 3, \dots \end{aligned}

Compared to the theoretical equation, the practically applied TD algorithm makes two core approximations (modifications):

  • Replacing {s,r,s}\{s, r, s'\} with actual trajectory samples {st,rt+1,st+1}\{s_t, r_{t+1}, s_{t+1}\}. That is, instead of repeatedly sampling large amounts of data at a specific state ss, a single-step update is performed when sts_t is visited, while unvisited states remain unchanged.
  • Replacing the unknown true value vπ(sk)v_{\pi}(s_k') with the current estimate vk(sk)v_k(s_k'), thereby introducing the Bootstrapping mechanism.

TD vs MC#

FeatureTD / Sarsa learningMC learning
Online/OfflineOnline: TD learning is online. It can update state/action values immediately after executing a single-step transition and receiving a reward.Offline: MC learning is offline. It must wait until the end of an episode to update using the full trajectory return.
Task TypeContinuing tasks: Thanks to its single-step update feature, it can seamlessly handle both episodic and infinitely long continuing tasks.Episodic tasks: Because it requires waiting for the terminal state to calculate the total return, it can only handle episodic tasks with clear termination states.
BootstrappingBootstrapping: Updates rely on the current estimate of the next state’s value (updating estimates using estimates), thus requiring an initial guess.Non-bootstrapping: Directly uses the actual sampled cumulative return to estimate values, without relying on existing estimates of other states.
VarianceLow estimation variance: Single-step updates involve fewer random variables. For example, Sarsa only introduces the single-step randomness of Rt+1,St+1,At+1R_{t+1}, S_{t+1}, A_{t+1}.High estimation variance: To estimate the value, it needs to accumulate rewards across the entire trajectory Rt+1+γRt+2+R_{t+1} + \gamma R_{t+2} + \dots. Assuming an episode length of LL, long sequences introduce significant randomness and variance.

TD learning of action values: Sarsa#

Sarsa aims to estimate the action-value function qπq_\pi for a given policy π\pi. The algorithm learns through the following sequence of experiences:

{(st,at,rt+1,st+1,at+1)}t\{(s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1})\}_t

For the state-action pair (st,at)(s_t, a_t) visited at time tt, the update formula is:

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γqt(st+1,at+1))]q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \left[ q_t(s_t, a_t) - (r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})) \right]

For other unvisited state-action pairs (s,a)(st,at)(s, a) \neq (s_t, a_t), their values remain unchanged:

qt+1(s,a)=qt(s,a)q_{t+1}(s, a) = q_t(s, a)

Where:

  • qt(st,at)q_t(s_t, a_t): The current estimate of the action value qπ(st,at)q_\pi(s_t, a_t).
  • αt(st,at)\alpha_t(s_t, a_t): The learning rate, usually decaying over time to ensure convergence.
  • rt+1+γqt(st+1,at+1)r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1}): TD target. This reflects the “On-policy” nature of Sarsa, strictly using the actual next action at+1a_{t+1} taken by the current policy π\pi to calculate the target and update the current value.

Strictly speaking, the single-step formula above belongs to the policy evaluation part of Sarsa. In actual control problems, the Sarsa algorithm typically refers to the complete control loop that combines this evaluation formula with a policy improvement mechanism (such as ϵ\epsilon-greedy).

Expected Sarsa#

Expected Sarsa improves upon the target calculation method of standard Sarsa:

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)(rt+1+γE[qt(st+1,A)])]q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \left[ q_t(s_t, a_t) - \left(r_{t+1} + \gamma \mathbb{E}[q_t(s_{t+1}, A)]\right) \right] qt+1(s,a)=qt(s,a),(s,a)(st,at)q_{t+1}(s, a) = q_t(s, a), \quad \forall (s, a) \neq (s_t, a_t)

Where the expectation term expands to:

E[qt(st+1,A)]=aπt(ast+1)qt(st+1,a)vt(st+1)\mathbb{E}[q_t(s_{t+1}, A)] = \sum_a \pi_t(a|s_{t+1}) q_t(s_{t+1}, a) \doteq v_t(s_{t+1})

That is, the expected value of all possible action values in the next state under the current policy πt\pi_t.

  • Change in TD target: Sarsa relies on a specifically sampled action at+1a_{t+1}, whereas Expected Sarsa takes the expectation over all possible next actions weighted by their policy probabilities.
  • Variance reduction: Although it requires calculating the values of all candidate actions at each time step (slightly increasing computation), it eliminates the randomness introduced by selecting the next action At+1A_{t+1}. The random variables involved in the target value are reduced from {st,at,rt+1,st+1,at+1}\{s_t, a_t, r_{t+1}, s_{t+1}, a_{t+1}\} to {st,at,rt+1,st+1}\{s_t, a_t, r_{t+1}, s_{t+1}\}, thereby effectively reducing estimation variance.

The mathematical expectation equation it essentially attempts to solve is:

qπ(s,a)=E[Rt+1+γEAt+1π(St+1)[qπ(St+1,At+1)]|St=s,At=a]q_\pi(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \mathbb{E}_{A_{t+1} \sim \pi(\cdot|S_{t+1})} [q_\pi(S_{t+1}, A_{t+1})] \middle| S_t = s, A_t = a \right]

Namely:

qπ(s,a)=E[Rt+1+γvπ(St+1)|St=s,At=a]q_\pi(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma v_\pi(S_{t+1}) \middle| S_t = s, A_t = a \right]

n-step Sarsa#

n-step Sarsa is a generalized method; both single-step Sarsa and full-view MC can be seen as its extreme cases. The core lies in the step difference of the truncated return:

Sarsa (1-step)Gt(1)=Rt+1+γqπ(St+1,At+1)Gt(2)=Rt+1+γRt+2+γ2qπ(St+2,At+2)n-step SarsaGt(n)=Rt+1+γRt+2++γnqπ(St+n,At+n)MCGt()=Rt+1+γRt+2+γ2Rt+3+\begin{aligned} \text{Sarsa (1-step)} \longleftarrow \quad G_t^{(1)} &= R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \\ G_t^{(2)} &= R_{t+1} + \gamma R_{t+2} + \gamma^2 q_\pi(S_{t+2}, A_{t+2}) \\ &\vdots \\ n\text{-step Sarsa} \longleftarrow \quad G_t^{(n)} &= R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n q_\pi(S_{t+n}, A_{t+n}) \\ &\vdots \\ \text{MC} \longleftarrow \quad G_t^{(\infty)} &= R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots \end{aligned}

The theoretical expectations they attempt to fit:

  • Sarsa (1-step):

    qπ(s,a)=E[Gt(1)s,a]=E[Rt+1+γqπ(St+1,At+1)s,a]q_{\pi}(s, a) = \mathbb{E}[G_t^{(1)} | s, a] = \mathbb{E}[R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1}) | s, a]
  • MC learning:

    qπ(s,a)=E[Gt()s,a]=E[Rt+1+γRt+2+γ2Rt+3+s,a]q_{\pi}(s, a) = \mathbb{E}[G_t^{(\infty)} | s, a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \dots | s, a]
  • n-step Sarsa:

    qπ(s,a)=E[Gt(n)s,a]=E[Rt+1+γRt+2++γnqπ(St+n,At+n)s,a]q_{\pi}(s, a) = \mathbb{E}[G_t^{(n)} | s, a] = \mathbb{E}[R_{t+1} + \gamma R_{t+2} + \dots + \gamma^n q_{\pi}(S_{t+n}, A_{t+n}) | s, a]

nn-step Sarsa update algorithm:

The theoretical update formula is:

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γrt+2++γnqt(st+n,at+n)]]q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t) \left[ q_t(s_t, a_t) - [r_{t+1} + \gamma r_{t+2} + \dots + \gamma^n q_t(s_{t+n}, a_{t+n})] \right]
  • In practical execution, because the environment only advances one step at time tt, the complete data sequence (rt+1,,rt+n,st+n,at+n)(r_{t+1}, \dots, r_{t+n}, s_{t+n}, a_{t+n}) for the next nn steps is not yet available.
  • Therefore, the actual nn-step update needs to be delayed by nn time steps. At time t+nt+n, after collecting the required data, it goes back to update the historical state-action pair (st,at)(s_t, a_t):
qt+n(st,at)=qt+n1(st,at)αt+n1(st,at)[qt+n1(st,at)[rt+1++γnqt+n1(st+n,at+n)]]q_{t+n}(s_t, a_t) = q_{t+n-1}(s_t, a_t) - \alpha_{t+n-1}(s_t, a_t) \left[ q_{t+n-1}(s_t, a_t) - \left[ r_{t+1} + \dots + \gamma^n q_{t+n-1}(s_{t+n}, a_{t+n}) \right] \right]

Abbreviated as:

qt+n(st,at)=qt+n1(st,at)αt+n1(st,at)[qt+n1(st,at)Targetnstep]q_{t+n}(s_t, a_t) = q_{t+n-1}(s_t, a_t) - \alpha_{t+n-1}(s_t, a_t) \left[ q_{t+n-1}(s_t, a_t) - Target_{n-step} \right]

(Note: The subscript tt represents the passage of time steps, while nn represents the horizon of how far ahead to look.)

  • nn-step Sarsa is an algorithm that balances between Sarsa and MC:
    • When nn is large, more true reward terms are introduced, making it closer to the MC method. The advantage is smaller bias and lower susceptibility to erroneous initial estimates, but at the cost of accumulating more randomness, leading to higher variance.
    • When nn is small, it relies more on bootstrapping, making it closer to Sarsa. The variance is lower here, but if the initial estimation error is large, the strong reliance on existing qq values can easily introduce significant bias.
  • Like single-step methods, nn-step Sarsa can also be embedded into the policy iteration framework, combining with policy improvement steps to find the optimal policy.

Q-learning#

Q-learning is the representative off-policy TD control algorithm; its evaluation directly targets the optimal state-action value:

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)[rt+1+γmaxaAqt(st+1,a)]],qt+1(s,a)=qt(s,a),(s,a)(st,at),\begin{align} q_{t+1}(s_t, a_t) &= q_t(s_t, a_t) - \alpha_t(s_t, a_t) \left[ q_t(s_t, a_t) - \left[ r_{t+1} + \gamma \max_{a \in \mathcal{A}} q_t(s_{t+1}, a) \right] \right], \\ q_{t+1}(s, a) &= q_t(s, a), \quad \forall (s, a) \neq (s_t, a_t), \end{align}

Mathematically, it attempts to solve the Bellman Optimality Equation (BOE):

q(s,a)=E[Rt+1+γmaxaq(St+1,a)St=s,At=a],s,a.q(s, a) = \mathbb{E} \left[ R_{t+1} + \gamma \max_{a'} q(S_{t+1}, a') \mid S_t = s, A_t = a \right], \quad \forall s, a.

on-policy vs off-policy#

In reinforcement learning control problems, we often deal with two types of policies:

  • Behavior policy: The policy the agent actually uses to interact with the environment and generate experience samples (typically exploratory, e.g., ϵ\epsilon-greedy).
  • Target policy: The policy the algorithm actually evaluates, attempts to optimize, and ultimately converges to (typically a fully greedy policy).

Based on whether these two policies are identical, algorithms can be classified as:

  • On-policy: The target policy is exactly the same as the behavior policy. The rules for generating samples and the rules for evaluation share the same logic.
  • Off-policy: The target policy is separated from the behavior policy. The agent explores the world using one policy while using the explored data to silently optimize another (usually superior) policy in the background.

Core advantage of Off-policy: The algorithm can reuse experience data generated by any other policy (even human expert historical actions or random walks) for learning, resulting in higher data efficiency.

How to determine if an algorithm is On-policy or Off-policy? The core lies in examining the source mechanism of the next action at+1a_{t+1} used when calculating the TD Target:

  • Sarsa is On-policy: Its target relies on the action at+1a_{t+1} actually executed in the next step, which is determined by the current behavior policy.
  • Q-learning is Off-policy: Regardless of what action is actually taken next to explore the environment, when calculating the target value, it always aggressively assumes the optimal action under the current estimate will be taken (maxa\max_a).

Unified Form and Summary of TD Algorithms#

Value updates can generally be unified into the following form:

qt+1(st,at)=qt(st,at)αt(st,at)[qt(st,at)qˉt],q_{t+1}(s_t, a_t) = q_t(s_t, a_t) - \alpha_t(s_t, a_t)[q_t(s_t, a_t) - \bar{q}_t],

where qˉt\bar{q}_t is the corresponding TD target for each algorithm:

AlgorithmExpression of TD target (qˉt\bar{q}_t)
Sarsaqˉt=rt+1+γqt(st+1,at+1)\bar{q}_t = r_{t+1} + \gamma q_t(s_{t+1}, a_{t+1})
nn-step Sarsaqˉt=rt+1+γrt+2++γnqt(st+n,at+n)\bar{q}_t = r_{t+1} + \gamma r_{t+2} + \dots + \gamma^n q_t(s_{t+n}, a_{t+n})
Expected Sarsaqˉt=rt+1+γaπt(ast+1)qt(st+1,a)\bar{q}_t = r_{t+1} + \gamma \sum_a \pi_t(a\|s_{t+1})q_t(s_{t+1},a)
Q-learningqˉt=rt+1+γmaxaqt(st+1,a)\bar{q}_t = r_{t+1} + \gamma \max_a q_t(s_{t+1}, a)
Monte Carloqˉt=rt+1+γrt+2+\bar{q}_t = r_{t+1} + \gamma r_{t+2} + \dots

(Note: When the learning rate αt=1\alpha_t=1, the MC form can be directly written as qt+1(st,at)=qˉtq_{t+1}(s_t, a_t) = \bar{q}_t, i.e., directly replacing the estimate with the single-episode return. Standard MC usually maintains a mean, which is equivalent to a gradually decaying α\alpha.)

Summary of the underlying equations theoretically solved by each algorithm:

AlgorithmEquation aimed to solve
SarsaBellman Expectation: qπ(s,a)=E[Rt+1+γqπ(St+1,At+1)St=s,At=a]q_\pi(s, a) = \mathbb{E} [R_{t+1} + \gamma q_\pi(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a]
n-step SarsaBellman Expectation: qπ(s,a)=E[Rt+1+γRt+2++γnqπ(st+n,at+n)St=s,At=a]q_\pi(s, a) = \mathbb{E} [R_{t+1} + \gamma R_{t+2} + \dots + \gamma^n q_\pi(s_{t+n}, a_{t+n}) \mid S_t = s, A_t = a]
Expected SarsaBellman Expectation: qπ(s,a)=E[Rt+1+γEAt+1[qπ(St+1,At+1)]St=s,At=a]q_\pi(s, a) = \mathbb{E} [R_{t+1} + \gamma \mathbb{E}_{A_{t+1}} [q_\pi(S_{t+1}, A_{t+1})] \mid S_t = s, A_t = a]
Q-learningBellman Optimality: q(s,a)=E[Rt+1+γmaxaq(St+1,a)St=s,At=a]q_*(s, a) = \mathbb{E} [R_{t+1} + \gamma \max_a q_*(S_{t+1}, a) \mid S_t = s, A_t = a]
Monte CarloBellman Expectation: qπ(s,a)=E[Rt+1+γRt+2+St=s,At=a]q_\pi(s, a) = \mathbb{E} [R_{t+1} + \gamma R_{t+2} + \dots \mid S_t = s, A_t = a]