pymc.nesterov_momentum#

pymc.nesterov_momentum(loss_or_grads=None, params=None, learning_rate=0.001, momentum=0.9)[source]#

Stochastic Gradient Descent (SGD) updates with Nesterov momentum

Generates update expressions of the form:

  • velocity := momentum * velocity - learning_rate * gradient

  • param := param + momentum * velocity - learning_rate * gradient

Parameters
loss_or_grads: symbolic expression or list of expressions

A scalar loss expression, or a list of gradient expressions

params: list of shared variables

The variables to generate update expressions for

learning_rate: float or symbolic scalar

The learning rate controlling the size of update steps

momentum: float or symbolic scalar, optional

The amount of momentum to apply. Higher momentum results in smoothing over more update steps. Defaults to 0.9.

Returns
OrderedDict

A dictionary mapping each parameter to its update expression

See also

apply_nesterov_momentum

Function applying momentum to updates

Notes

Higher momentum also results in larger update steps. To counter that, you can optionally scale your learning rate by 1 - momentum.

The classic formulation of Nesterov momentum (or Nesterov accelerated gradient) requires the gradient to be evaluated at the predicted next position in parameter space. Here, we use the formulation described at https://github.com/lisa-lab/pylearn2/pull/136#issuecomment-10381617, which allows the gradient to be evaluated at the current parameters.

Optimizer can be called without both loss_or_grads and params in that case partial function is returned

Examples

>>> a = aesara.shared(1.)
>>> b = a*2
>>> updates = nesterov_momentum(b, [a], learning_rate=.01)
>>> isinstance(updates, dict)
True
>>> optimizer = nesterov_momentum(learning_rate=.01)
>>> callable(optimizer)
True
>>> updates = optimizer(b, [a])
>>> isinstance(updates, dict)
True