Maxton‘s Blog

Back

RL Study Notes: SA and SGD

A review of Stochastic Approximation and the Robbins-Monro algorithm, detailing the evolution, convergence properties, and sampling differences of SGD.

Stochastic Approximation Theory and Stochastic Gradient Descent#

Mean Estimation#

First, consider the problem of estimating the sample mean:

wk+11ki=1kxi,k=1,2,w_{k+1} \doteq \frac{1}{k} \sum_{i=1}^{k} x_{i}, \quad k=1,2, \dots

For the mean at the previous step, we have:

wk=1k1i=1k1xi,k=2,3,w_{k} = \frac{1}{k-1} \sum_{i=1}^{k-1} x_{i}, \quad k=2,3, \dots

Through algebraic transformation, the full batch computation can be converted into a recursive form:

wk+1=1ki=1kxi=1k(i=1k1xi+xk)=1k((k1)wk+xk)=wk1k(wkxk)w_{k+1} = \frac{1}{k} \sum_{i=1}^{k} x_{i} = \frac{1}{k}\left(\sum_{i=1}^{k-1} x_{i}+x_{k}\right) = \frac{1}{k}\left((k-1) w_{k}+x_{k}\right) = w_{k}-\frac{1}{k}\left(w_{k}-x_{k}\right)

That is:

wk+1=wk1k(wkxk)w_{k+1} = w_k - \frac{1}{k}(w_k - x_k)
  • This is an iterative algorithm that does not require storing and recalculating all historical data.
  • In the early stages of iteration, the estimate may be inaccurate due to insufficient sample size; however, as the sample size kk increases, the result gradually approaches the true mean.

From this, we can derive a more generalized iterative equation:

wk+1=wkαk(wkxk)w_{k+1} = w_{k} - \alpha_{k}(w_{k} - x_{k})
  • When the step size sequence {αk}\{\alpha_k\} satisfies certain conditions, the estimate wkw_k will still converge to the expectation E[X]\mathbb{E}[X].
  • This form can be viewed as a special case of the Stochastic Approximation (SA) algorithm, and it is also the foundational form of the Stochastic Gradient Descent (SGD) algorithm.

Robbins-Monro (RM) Algorithm#

Stochastic Approximation (SA)#

  • SA is a broad class of algorithms that rely on stochastic iteration to find the roots of equations or solve optimization problems.
  • The core advantage of SA is its black-box solving capability: it does not require knowledge of the analytical expression or global properties of the target equation, relying only on noisy observational data to update parameters.

RM Algorithm Definition#

A necessary condition for finding the extremum of a function is that the gradient is zero, i.e., g(w)wJ(w)=0g(w) \doteq \nabla_{w} J(w) = 0. Similarly, for the case of g(w)=cg(w) = c, it can be transformed into a root-finding problem by moving the terms. SGD is exactly a special case of the RM algorithm.

The RM algorithm solves g(w)=0g(w) = 0 through the following iterative format:

wk+1=wkakg~(wk,ηk),k=1,2,3,w_{k+1} = w_k - a_k \tilde{g}(w_k, \eta_k), \quad k = 1, 2, 3, \dots

Where:

  • wkw_k is the estimate of the root at the kk-th iteration.
  • g~(wk,ηk)=g(wk)+ηk\tilde{g}(w_k, \eta_k) = g(w_k) + \eta_k is the kk-th observation with random noise ηk\eta_k.
  • aka_k is a positive coefficient controlling the step size.

Convergence Theorem and Condition Analysis#

If the following conditions hold:

(a) 0<c1wg(w)c20 < c_1 \leq \nabla_w g(w) \leq c_2 for all ww; (b) k=1ak=\sum_{k=1}^{\infty} a_k = \infty and k=1ak2<\sum_{k=1}^{\infty} a_k^2 < \infty; (c) E[ηkHk]=0\mathbb{E}[\eta_k | \mathcal{H}_k] = 0 and E[ηk2Hk]<\mathbb{E}[\eta_k^2 | \mathcal{H}_k] < \infty;

Where Hk={wk,wk1,}\mathcal{H}_k = \{w_k, w_{k-1}, \dots\} represents the history up to time kk, then the sequence wkw_k will converge almost surely (with probability 1) to the root ww^* satisfying g(w)=0g(w^*) = 0.

  • Condition (a): Requires the derivative (or gradient) of the function g(w)g(w) to have strict upper and lower bounds. This ensures the function is sufficiently smooth and its slope will not approach zero or infinity, thereby ensuring the update direction points stably and continuously towards the target solution ww^*.
  • Condition (b): The classic step size (learning rate) constraint. ak=\sum a_k = \infty ensures the total step size is infinite, giving the algorithm the capability to cover any initial distance to reach the target point; ak2<\sum a_k^2 < \infty ensures the step size decays fast enough so the algorithm can ultimately converge without oscillating perpetually around the root.
  • Condition (c): Constraints on the properties of the random noise. A conditional expectation of zero given the historical sequence (forming a martingale difference sequence) implies the noise estimation is unbiased; a finite conditional variance limits the fluctuation amplitude during single-step exploration, preventing the algorithm from diverging.

Stochastic Gradient Descent (SGD)#

SGD is primarily used to solve expected risk minimization problems:

minwJ(w)=E[f(w,X)]\min_{w} \quad J(w) = \mathbb{E}[f(w, X)]
  • ww is the parameter to be optimized.
  • XX is a random variable, and the expectation is calculated with respect to the distribution of XX.
  • The function f()f(\cdot) outputs a scalar, and both the parameter ww and input XX can be scalars or vectors.

Gradient Descent Evolution#

1. Gradient Descent (GD)

wk+1=wkαkwE[f(wk,X)]=wkαkE[wf(wk,X)]w_{k+1} = w_k - \alpha_k \nabla_w \mathbb{E}[f(w_k, X)] = w_k - \alpha_k \mathbb{E}[\nabla_w f(w_k, X)]

2. Batch Gradient Descent (BGD)

Approximating the mathematical expectation through the empirical mean of a finite sample:

E[wf(wk,X)]1ni=1nwf(wk,xi)\mathbb{E}[\nabla_w f(w_k, X)] \approx \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i) wk+1=wkαk1ni=1nwf(wk,xi)w_{k+1} = w_k - \alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i)

3. Stochastic Gradient Descent (SGD)

wk+1=wkαkwf(wk,xk)w_{k+1} = w_k - \alpha_k \nabla_w f(w_k, x_k)
  • Compared to GD: Replaces the true gradient E[wf(wk,X)]\mathbb{E}[\nabla_w f(w_k, X)] with the single-sample stochastic gradient wf(wk,xk)\nabla_w f(w_k, x_k).
  • Compared to BGD: Equivalent to setting the batch size to n=1n = 1.

According to RM algorithm theory, if the Hessian matrix is positive definite and bounded (0<c1w2f(w,X)c20 < c_1 \le \nabla^2_w f(w, X) \le c_2), the learning rate satisfies the Robbins-Monro sequence conditions, and the sample sequence is independent and identically distributed (i.i.d.), then SGD will also converge almost surely (with probability 1) to the optimal solution.

SGD Relative Error and Randomness#

Introduce the relative error δk\delta_k to measure the degree to which the stochastic gradient deviates from the true gradient:

δkwf(wk,xk)E[wf(wk,X)]E[wf(wk,X)]\delta_k \doteq \frac{|\nabla_w f(w_k, x_k) - \mathbb{E}[\nabla_w f(w_k, X)]|}{|\mathbb{E}[\nabla_w f(w_k, X)]|}

At the optimal solution ww^*, E[wf(w,X)]=0\mathbb{E}[\nabla_w f(w^*, X)] = 0 holds. Substituting this into the denominator and applying the Mean Value Theorem for Integrals:

δk=wf(wk,xk)E[wf(wk,X)]E[wf(wk,X)]E[wf(w,X)]=wf(wk,xk)E[wf(wk,X)]E[w2f(w~k,X)(wkw)]\delta_k = \frac{|\nabla_w f(w_k, x_k) - \mathbb{E}[\nabla_w f(w_k, X)]|}{|\mathbb{E}[\nabla_w f(w_k, X)] - \mathbb{E}[\nabla_w f(w^*, X)]|} = \frac{|\nabla_w f(w_k, x_k) - \mathbb{E}[\nabla_w f(w_k, X)]|}{|\mathbb{E}[\nabla_w^2 f(\tilde{w}_k, X)(w_k - w^*)]|}

Due to the strong convexity assumption w2fc>0\nabla_w^2 f \ge c > 0, we can bound the denominator:

E[w2f(w~k,X)(wkw)]=E[w2f(w~k,X)](wkw)=E[w2f(w~k,X)]wkwcwkw\begin{aligned} |\mathbb{E}[\nabla_w^2 f(\tilde{w}_k, X)(w_k - w^*)]| &= |\mathbb{E}[\nabla_w^2 f(\tilde{w}_k, X)] \cdot (w_k - w^*)| \\ &= |\mathbb{E}[\nabla_w^2 f(\tilde{w}_k, X)]| \cdot |w_k - w^*| \ge c|w_k - w^*| \end{aligned}

Thus, we obtain the upper bound of the relative error:

δkwf(wk,xk)stochastic gradientE[wf(wk,X)]true gradientcwkwdistance to optimal solution\delta_k \leq \frac{|\overbrace{\nabla_w f(w_k, x_k)}^{\text{stochastic gradient}} - \overbrace{\mathbb{E}[\nabla_w f(w_k, X)]}^{\text{true gradient}}|}{\underbrace{c|w_k - w^*|}_{\text{distance to optimal solution}}}

This inequality strictly reveals an important convergence pattern of SGD:

  • The relative error δk\delta_k is inversely proportional to the distance to the optimal solution wkw|w_k - w^*|.
  • When wkw_k is far from the optimal solution, the denominator is large, and δk\delta_k is small. At this point, the update direction of SGD is very close to the true gradient, exhibiting a descent trajectory highly similar to GD.
  • As wkw_k gradually approaches the optimal solution ww^*, the denominator tends to zero, and the relative error δk\delta_k increases significantly. This means that in the neighborhood of the optimal solution, noise interference becomes dominant, causing the algorithm to exhibit strong random oscillation in the late stages of convergence (which is also why SGD requires learning rate decay).

Sampling Comparison: BGD, MBGD, and SGD#

This random sampling strategy shares similarities with truncated methods.

Assume we want to minimize J(w)=E[f(w,X)]J(w) = \mathbb{E}[f(w, X)] and have a set of random samples {xi}i=1n\{x_i\}_{i=1}^n for XX. The iterative formulas for the three algorithms are compared as follows:

wk+1=wkαk1ni=1nwf(wk,xi)(BGD)w_{k+1} = w_k - \alpha_k \frac{1}{n} \sum_{i=1}^n \nabla_w f(w_k, x_i) \quad \text{(BGD)} wk+1=wkαk1mjIkwf(wk,xj)(MBGD)w_{k+1} = w_k - \alpha_k \frac{1}{m} \sum_{j \in \mathcal{I}_k} \nabla_w f(w_k, x_j) \quad \text{(MBGD)} wk+1=wkαkwf(wk,xk)(SGD)w_{k+1} = w_k - \alpha_k \nabla_w f(w_k, x_k) \quad \text{(SGD)}
  • BGD: Computes the full nn samples at each iteration. When nn is sufficiently large, the update direction closely approximates the true expected gradient.
  • MBGD: Draws a subset Ik\mathcal{I}_k of size mm from the global samples at each iteration. This set is obtained through mm independent and identically distributed (i.i.d.) samples.
  • SGD: Randomly draws a single sample xkx_k at time step kk during each iteration.

Core Difference Note: Even when m=nm=n, MBGD is not equivalent to BGD. This is because the mm random samples in MBGD are typically drawn with replacement (potentially drawing duplicate samples), whereas BGD strictly iterates over all unique global sample data.