RL Study Notes: Value Function Approximation
Summary of value function approximation in RL, covering linear/non-linear forms, state distributions, gradient methods, DQN, and experience replay.
Value Function Approximation#
Linear Function Form#
Where:
- is the parameter vector.
- is the feature vector of state .
- is linear with respect to .
Non-linear Function Form#
In this case:
- The dimensions of and increase, potentially making the numerical fitting more accurate.
- Although is non-linear with respect to state , it remains linear with respect to parameter . The non-linear features are encapsulated in the mapping .
State Value Estimation#
Objective Function:
- The core objective is to find the optimal parameters to minimize this objective function.
- is a random variable. Its probability distribution mainly considers the following two types:
Uniform Distribution#
- The uniform distribution treats all states equally. However, in actual reinforcement learning, some states are visited more frequently and are more critical, so this distribution is often unsuitable.
Stationary Distribution#
The stationary distribution describes the long-run behavior of a Markov process. Here, represents the set of state distributions, satisfying and .
- represents the stationary probability of being in a specific state under policy . Using the stationary distribution allows for smaller fitting errors on frequently visited states.
- The stationary distribution satisfies the following formula:
Where is the state transition matrix in the Bellman equation.
Optimization Methods#
Update parameters using gradient descent:
The derivation of the true gradient is as follows:
In practice, Stochastic Gradient Descent (SGD) is commonly used:
Where is a sample of . For brevity, the constant is absorbed into the learning rate . Since the true is unknown, we need to replace it with an estimate:
- Monte Carlo (MC) based: Use the discounted return in an episode to approximate .
- Temporal Difference (TD) based: The target value is treated as an approximation of .
TD-Linear Algorithm#
In the linear case of , the gradient is:
Substituting the gradient into the TD algorithm:
This is the TD learning algorithm with linear function approximation, briefly referred to as TD-Linear.
Derivative Analysis of Linear Approximation#
In RL linear approximation, is a scalar (predicted state value), and is a vector (weight parameters).
-
Deconstructing the Linear Expression For column vectors and , the inner product is:
-
Deriving with Respect to a Vector The essence of is taking the partial derivative of the scalar function with respect to each component of vector :
Putting it together, we get .
Tabular Representation#
The tabular method is a special case of linear function approximation. Assume the feature vector of state is a One-hot vector:
At this time:
That is, extracts the -th component of vector corresponding to state .
Action Value Function Approximation#
Sarsa with Function Approximation#
Q-learning with Function Approximation#
Deep Q-Network (DQN)#
DQN uses neural networks to approximate the non-linear Q function.
Loss Function:
This is essentially minimizing the Bellman Optimality Error. Define the target value as:
To ensure training stability and prevent the target value from constantly shifting with network updates, DQN introduces a dual-network architecture:
- Main Network: , responsible for current action evaluation and real-time parameter updates.
- Target Network: , providing a stable target value .
With the target network introduced, the loss function becomes:
During computation, assuming is a constant (i.e., not involved in gradient calculation), gradient descent only updates :
Note: The parameters of the main network are periodically copied to the target network .
Experience Replay#
- Motivation: Sequential data collected in reinforcement learning has strong correlations. Using it directly for training can easily lead to network instability.
- Mechanism: Store the interaction data generated by the agent and the environment as tuples into a Replay Buffer .
- Sampling: During training, extract a batch of random samples (Mini-batch) from the buffer. This extraction process usually follows a uniform distribution, thereby breaking the temporal correlations between data and significantly improving data utilization efficiency.
DOCS
-
CTF WP
-
WEB
-
Reinforcement Learning
- RL Study Notes: The Bellman Equation
- RL Study Notes: Basic Concepts
- RL Study Notes: Value Iteration and Policy Iteration
- RL Study Notes: Bellman Optimality Equation
- RL Study Notes: Monte Carlo Methods
- RL Study Notes: SA and SGD
- RL Study Notes: Policy Gradient Methods
- RL Study Notes: Temporal-Difference Learning
- RL Study Notes: Actor-Critic Algorithm
- RL Study Notes: Value Function Approximation
-
Miscellaneous