Scale learning rates by dividing with the square root of accumulated squared gradients. See [1] for further description.

Parameters
loss_or_grads: symbolic expression or list of expressions

A scalar loss expression, or a list of gradient expressions

params: list of shared variables

The variables to generate update expressions for

learning_rate: float or symbolic scalar

The learning rate controlling the size of update steps

epsilon: float or symbolic scalar

Small value added for numerical stability

Returns
OrderedDict

A dictionary mapping each parameter to its update expression

Notes

Using step size eta Adagrad calculates the learning rate for feature i at time step t as:

$\eta_{t,i} = \frac{\eta} {\sqrt{\sum^t_{t^\prime} g^2_{t^\prime,i}+\epsilon}} g_{t,i}$

as such the learning rate is monotonically decreasing.

Epsilon is not included in the typical formula, see [2].

Optimizer can be called without both loss_or_grads and params in that case partial function is returned

References

1

Duchi, J., Hazan, E., & Singer, Y. (2011): Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121-2159.

2

Examples

>>> a = pytensor.shared(1.)
>>> b = a*2