The dimensions of w and ϕ(s) increase, potentially making the numerical fitting more accurate.
Although v^(s,w) is non-linear with respect to state s, it remains linear with respect to parameter w. The non-linear features are encapsulated in the mapping ϕ(s).
The uniform distribution treats all states equally. However, in actual reinforcement learning, some states are visited more frequently and are more critical, so this distribution is often unsuitable.
The stationary distribution describes the long-run behavior of a Markov process. Here, {dπ(s)}s∈S represents the set of state distributions, satisfying dπ(s)≥0 and ∑s∈Sdπ(s)=1.
J(w)=s∈S∑dπ(s)(vπ(s)−v^(s,w))2
dπ(s) represents the stationary probability of being in a specific state under policy π. Using the stationary distribution allows for smaller fitting errors on frequently visited states.
The stationary distribution satisfies the following formula:
dπT=dπTPπ
Where Pπ is the state transition matrix in the Bellman equation.
In practice, Stochastic Gradient Descent (SGD) is commonly used:
wt+1=wt+αt(vπ(st)−v^(st,wt))∇wv^(st,wt)
Where st is a sample of S. For brevity, the constant 2 is absorbed into the learning rate αt. Since the true vπ(st) is unknown, we need to replace it with an estimate:
Monte Carlo (MC) based: Use the discounted return gt in an episode to approximate vπ(st).
wt+1=wt+αt(gt−v^(st,wt))∇wv^(st,wt)
Temporal Difference (TD) based: The target value rt+1+γv^(st+1,wt) is treated as an approximation of vπ(st).
In RL linear approximation, v^(s,w) is a scalar (predicted state value), and w is a vector (weight parameters).
Deconstructing the Linear Expression
For column vectors ϕ(s)=[ϕ1,…,ϕn]T and w=[w1,…,wn]T, the inner product is:
v^(s,w)=i=1∑nϕiwi
Deriving with Respect to a Vector
The essence of ∇wv^(s,w) is taking the partial derivative of the scalar function with respect to each component of vector w:
Motivation: Sequential data collected in reinforcement learning has strong correlations. Using it directly for training can easily lead to network instability.
Mechanism: Store the interaction data generated by the agent and the environment as tuples (s,a,r,s′) into a Replay Buffer B.
Sampling: During training, extract a batch of random samples (Mini-batch) from the buffer. This extraction process usually follows a uniform distribution, thereby breaking the temporal correlations between data and significantly improving data utilization efficiency.