pymc.nesterov_momentum(loss_or_grads=None, params=None, learning_rate=0.001, momentum=0.9)[source]#

Stochastic Gradient Descent (SGD) updates with Nesterov momentum

Generates update expressions of the form:

  • velocity := momentum * velocity - learning_rate * gradient

  • param := param + momentum * velocity - learning_rate * gradient

loss_or_grads: symbolic expression or list of expressions

A scalar loss expression, or a list of gradient expressions

params: list of shared variables

The variables to generate update expressions for

learning_rate: float or symbolic scalar

The learning rate controlling the size of update steps

momentum: float or symbolic scalar, optional

The amount of momentum to apply. Higher momentum results in smoothing over more update steps. Defaults to 0.9.


A dictionary mapping each parameter to its update expression

See also


Function applying momentum to updates


Higher momentum also results in larger update steps. To counter that, you can optionally scale your learning rate by 1 - momentum.

The classic formulation of Nesterov momentum (or Nesterov accelerated gradient) requires the gradient to be evaluated at the predicted next position in parameter space. Here, we use the formulation described at lisa-lab/pylearn2#136, which allows the gradient to be evaluated at the current parameters.

Optimizer can be called without both loss_or_grads and params in that case partial function is returned


>>> a = pytensor.shared(1.)
>>> b = a*2
>>> updates = nesterov_momentum(b, [a], learning_rate=.01)
>>> isinstance(updates, dict)
>>> optimizer = nesterov_momentum(learning_rate=.01)
>>> callable(optimizer)
>>> updates = optimizer(b, [a])
>>> isinstance(updates, dict)