pymc.adagrad#

pymc.adagrad(loss_or_grads=None, params=None, learning_rate=1.0, epsilon=1e-06)[source]#

Adagrad updates

Scale learning rates by dividing with the square root of accumulated squared gradients. See [1] for further description.

Parameters:

loss_or_grads: symbolic expression or list of expressions: A scalar loss expression, or a list of gradient expressions
params: list of shared variables: The variables to generate update expressions for
learning_rate: float or symbolic scalar: The learning rate controlling the size of update steps
epsilon: float or symbolic scalar: Small value added for numerical stability

Returns:

OrderedDict: A dictionary mapping each parameter to its update expression

Notes

Using step size eta Adagrad calculates the learning rate for feature i at time step t as:

\[\eta_{t,i} = \frac{\eta} {\sqrt{\sum^t_{t^\prime} g^2_{t^\prime,i}+\epsilon}} g_{t,i}\]

as such the learning rate is monotonically decreasing.

Epsilon is not included in the typical formula, see [2].

Optimizer can be called without both loss_or_grads and params in that case partial function is returned

References

[1]

Duchi, J., Hazan, E., & Singer, Y. (2011): Adaptive subgradient methods for online learning and stochastic optimization. JMLR, 12:2121-2159.

[2]

Chris Dyer: Notes on AdaGrad. http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf

Examples

>>> a = pytensor.shared(1.)
>>> b = a*2
>>> updates = adagrad(b, [a], learning_rate=.01)
>>> isinstance(updates, dict)
True
>>> optimizer = adagrad(learning_rate=.01)
>>> callable(optimizer)
True
>>> updates = optimizer(b, [a])
>>> isinstance(updates, dict)
True