# Why control a continuous approximation of stochastic gradient descent instead of just the SGD?

In "Stochastic modified equations and adaptive stochastic gradient algorithms" (Li et. al 2015) the authors approximate stochastic gradient descent, as in

$$x_{k+1} = x_k - \eta u_k \nabla f_{\gamma_k}(x_k),$$

by an SDE, a so called stochastic modified equation (SME)

$$d X_t = -u_t \nabla f(X_t)dt + u_t \sqrt{\eta \Sigma(X_t)}dW_t,$$

in the sense of weak approximation of order 1 by viewing the SGD update as the application of the Euler scheme to the SDE.

Here

• $$\eta > 0$$ is the maximum learning rate
• $$T > 0$$
• $$u : [0,T] \to [0,1]$$ controls the learning rate
• $$(\gamma_k)_k$$ are iid RV with values in $$\{1,\dots, n\}$$ and $$\mathbb P(\gamma_k = i) = 1/n$$, which represent random samples from a training set (e.g.)
• $$f(x) = \frac 1 n \sum_{i=1}^n f_i(x)$$ is the objective
• $$(W_t)_t$$ is Brownian motion independent of the $$\gamma_k$$

and

$$\Sigma(x) := \frac 1 n \sum_{i=1}^n (\nabla f(x) - \nabla f_i(x))(\nabla f(x) - \nabla f_i(x))^T.$$

Of course, many conditions are left implicit here, s.t. the $$f_i$$ are continuously differentiable or that the coefficients of the SME satisfy growth conditions that guarantee the existence of a unique solution.

Now, the goal is to solve a control problem for the SME, such as

$$\min_{u} \mathbb E(f(X_T)),$$

and use the solution to control the learning rate of the original SGD. What I wonder is

Why can't we just control the original SGD directly? What can we gain by passing to this continuous problem?

Among other things, I think this is important because the difference between the real $$x_k$$ and its continuous counterpart ("$$|X_{\eta k} - x_k|$$") can be quite large since the approximation is only weak. I think this prevents us from letting $$u$$ be dependent on $$X_t$$ as well, because then the solution cannot be meaningfully used for the original SGD.

Introducing a SDE to approximate a random process is very natural as soon as the "jumps" (here the learning rate) are small and has been a standard procedure since Langevin (1908) The first advantage is that we obtain a "universal model" that doesn't depend on the particularity of the discrete model. The only random object here will be a brownian motion that is well understood. Moreover the SDP usually appears as simpler that the original model and some times even an complete explicite solution can be written. In any case because of the "universality", a lot of machinery and theorem are then available (stability, invariant measure, mixing time...). For example in a usual SPD $$dX_t = -\nabla f(X_t) + \sigma d W_t$$ as $$T\rightarrow \infty$$ $$X_T$$ converge in law the boltzmann distribution with density $$\rho(x)=\frac{1}{Z}\exp(-\beta f(x))$$ where $$Z$$ is just a normalising constant and $$\beta =\frac{1}{\sigma ^2}$$. So to carry on the example in your problem in the case $$u$$ is constant and $$f$$ has a local minimum $$x_0$$ I would guess that $$X_T$$ would converge to a gaussian centered at $$x_0$$ and covariance matrix get from by the Hessian of $$f$$ at $$x_0$$. (And you can then easy calculate $$\mathbb{E}(f(X_T))$$)
Deriving the SDE from the discrete model with an estimate on $$|X_{\eta k}-x_k|$$ is another story and can be very difficult even if I think in your case it should be OK.
Finally, a SDE is completely fine if its coefficient depend on $$X_t$$ and I don't think then that there is any restriction to have $$u$$ depending on $$X_t$$.
• What does SDP stand for? - So do you think it makes sense to find $u(t,X_t)$ and apply it to SGD with the update $x_{k+1} = x_k - \eta u(\eta k, x_k) \nabla f_{\gamma_k}(x_k)$ (I think)? I was afraid this would'nt really work and that this is why they haven't tried this. – Stefan Perko Mar 30 at 8:41