Scale learning rates by the ratio of accumulated gradients to accumulated updates, see [1] and notes for further description.

Parameters:
loss_or_gradssymbolic expression or list of expressions

A scalar loss expression, or a list of gradient expressions

paramslist of shared variables

The variables to generate update expressions for

learning_ratefloat or symbolic scalar

The learning rate controlling the size of update steps

rhofloat or symbolic scalar

Squared gradient moving average decay factor

epsilonfloat or symbolic scalar

Small value added for numerical stability

Returns:
OrderedDict

A dictionary mapping each parameter to its update expression

Notes

rho should be between 0 and 1. A value of rho close to 1 will decay the moving average slowly and a value close to 0 will decay the moving average fast.

rho = 0.95 and epsilon=1e-6 are suggested in the paper and reported to work for multiple datasets (MNIST, speech).

In the paper, no learning rate is considered (so learning_rate=1.0). Probably best to keep it at this value. epsilon is important for the very first update (so the numerator does not become 0).

Using the step size eta and a decay factor rho the learning rate is calculated as:

$\begin{split}r_t &= \\rho r_{t-1} + (1-\\rho)*g^2\\\\ \\eta_t &= \\eta \\frac{\\sqrt{s_{t-1} + \\epsilon}} {\sqrt{r_t + \epsilon}}\\\\ s_t &= \\rho s_{t-1} + (1-\\rho)*(\\eta_t*g)^2\end{split}$

Optimizer can be called without both loss_or_grads and params in that case partial function is returned

References

[1]

Examples

>>> a = pytensor.shared(1.)
>>> b = a*2