alf.optimizers#

alf.optimizers.adam_tf#

class AdamTF(params=[{'params': []}], lr=0.001, betas=(0.9, 0.999), eps=1e-07, weight_decay=0, amsgrad=False)[source]#

Bases: torch.optim.optimizer.Optimizer

Implementation of Adam algorithm following Tensorflow’s convention.

This class should not be direclty used as it will be wrapped for clipping gradients. Use the wrapped optimizer AdamTF in alf/optimizers/optimizers.py instead.

Parameters

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
lr (float, optional) – learning rate (default: 1e-3).
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999)).
eps (float, optional) – term added to the denominator to improve numerical stability which corresponds to the epsilon_hat in the Adam paper (default: 1e-7).
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0). This argument can be parameter specific, which means that if Parameter.opt_args[“weight_decay”] is not None, it will be used instead.
amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False).

References

reset_state()[source]#: Performs reset to all the states of the AdamTF optimizer, including exponential moving average of gradients and squared gradients etc.

step(closure=None)[source]#

Performs a single optimization step.

Parameters: closure (callable, optional) – A closure that reevaluates the model and returns the loss.

alf.optimizers.adamw#

class AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)[source]#

Bases: torch.optim.optimizer.Optimizer

AdamW optimizer.

The current implementation of AdamW in pytorch 1.8 has a bug introduced during refactoring (https://github.com/pytorch/pytorch/pull/50411). This method is copied from the latest fix (#52944), which has been merged into pytorch repo after the release of pytorch 1.8.1 (https://github.com/pytorch/pytorch/pull/52944).

TODO: remove this after upgrading to pytorch versions including fix (#52944).

The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.

Parameters

params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay coefficient (default: 1e-2)
amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False)

References: .. _Adam: A Method for Stochastic Optimization:

https://arxiv.org/abs/1412.6980

step(closure=None)[source]#

Performs a single optimization step.

Parameters: closure (callable, optional) – A closure that reevaluates the model and returns the loss.

alf.optimizers.nero_plus#

NeroPlus optimizer.

class NeroPlus(params=[{'params': []}], lr=0.01, betas=(0.9, 0.999), eps=1e-07, normalizing_grad_by_norm=False, max_norm=1, weight_decay=0, l2_regularization=0, fixed_norm=True, zero_mean=True)[source]#

Bases: torch.optim.optimizer.Optimizer

NeroPlus Optimizer

This is an enhanced version of the Nero optimizer described in the following paper:

Yang Liu et. al. Learning by Turning: Neural Architecture Aware Optimisation

The essence of this optimizer is to keep the norm of each parameter vector fixed and mean at zero during the optimization process. The parameter vector is defined as the part of parameter responsible for one dimension of the output. For example, FC(m, n) have two parameters, its weight of shape [m, n] and its bias of shape [n]. Its weight have m parameter vectors. Each of these m vectors is subject to the norm and mean constraint. For the bias, one element is responsible for one output dimension. So it is not subject to the norm and zero-mean constraint. Since the range of the output of a model should not be constrained, you should set opt_config for the output layers as dict(fixed_norm=False, max_norm=math.inf, zero_mean=False) or use a large finite max_norm or weight_decay to introduce some regularization.

For 2+ D parameter p, its parameter vectors are assumed to be p[0], p[1] … p[-1]. This is correct for many ALF layers (e.g. FC, Conv2D, TransformerBlock). But not all of ALF layers follows this rule. ParallelFC and other parallel layers are such examples. So you should not use NeroPlus if your model contains such layers.

By default, 1D parameters are not subject to the constraint (i.e. max_norm=math.inf, fixed_norm=False, zero_mean=False). If the constraints are desired, they can be specified using opt_args attribute of Parameter.

The main enhancements compared to the original Nero optimizer include:

Option for ADAM like update (normalizing_grad_by_norm=False)
Upper bound contraint of weight norm (fixed_norm=False)
Weight decay and L2 regularization.

To use this optimizer, you should first use NeroPlus.initialize() to normalize the parameter of your model for the given constrains before actually using your model for training.

Parameters

params – iterable of parameters to optimize or dicts defining parameter groups.
lr (float) – learning rate
betas (Tuple[float]) – coefficients between 0 and 1. They are used for computing running averages of gradient and its squared norm or elementwise square. betas[0] can be zero, in which case no running average will be performed. betas[1] must be greater than 0.
eps (float) – term added to the denominator to improve numerical stability which corresponds to the epsilon_hat in the Adam paper.
normalizing_grad_by_norm – whether to normalize the gradient by the running average of its squared norm or its elementwise square. Note that the original Nero optimizer uses True for this. However, we found the ADAM like behavoir is better.
max_norm (float) – maximal norm of each parameter vector. A parameter vector is part of a parameter responsible for one output dimension.
weight_decay (float) – weight decay. This is same as the weight decay of AdamW, which is implemented as substracting lr * weight_deday * w from parameter.
l2_regularization (float) – L2 penalty. This is same as the weight decay of Adam, which is implemented as adding weight_decay * w to gradient.
fixed_norm (bool) – whether to fix the norm of the parameter vector. If True, the norm will be fixed at max_norm.
zero_mean (bool) – whether to enfoce the mean of a parameter vector is zero.

lr, weight_decay, l2_regularization, fixed_norm, max_norm, zero_mean can be set individually for each parameter using opt_args attributes of Parameter. opt_args should be a dictionary. Additionally, lr_scale which can be used to scale the global learning for a specific parameter.

add_param_group(param_group)[source]#

Add a param group to the Optimizer s param_groups.

This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the Optimizer as training progresses.

Parameters

param_group (dict) – Specifies what Tensors should be optimized along with group
optimization options. (specific) –

static initialize(model, max_norm=1, fixed_norm=True, zero_mean=True)[source]#

step(closure=None)[source]#

Performs a single optimization step.

Parameters: closure (callable, optional) – A closure that reevaluates the model and returns the loss.

alf.optimizers.optimizers#

Adam#: alias of alf.optimizers.optimizers.Adam_

AdamTF#: alias of alf.optimizers.optimizers.AdamTF_

AdamW#: alias of alf.optimizers.optimizers.AdamW_

NeroPlus#: alias of alf.optimizers.optimizers.NeroPlus_

SGD#: alias of alf.optimizers.optimizers.SGD_

wrap_optimizer(cls)[source]#

A helper function to construct torch optimizers with params as [{‘params’: []}]. After construction, new parameter groups can be added by using the add_param_group() method.

This wrapper also clips gradients first before calling step().

alf.optimizers.traj_optimizers#

class CEMOptimizer(solution_dim, population_size, cost_func, upper_bound, lower_bound, elite_size=50, max_iter_num=5, epsilon=0.01, tau=0.9, min_var=1e-05)[source]#

Bases: alf.optimizers.traj_optimizers.TrajOptimizer

Creates a CEM Optimizer

This module optimizes a given cost function via the Cross-Enrtopy Method, which iterates between evaluating a population of samples generated from a probability distribution and updating the distribution based on the evaluation for generating better samples in the next iteration. In practice, a multi-dimensional Gaussian distribution with a diagonal covariance matrix is used.

Parameters

solution_dim (int) – the dimensionality of the problem space
population_size (int) – the number of candidate solutions to be sampled at every iteration
cost_func (Callable) – the cost function to be minimized. It takes as input: 1) init observation 2) action_sequence with the shape of [batch_size, population_size, solution_dim])
upper_bound (float|Tensor) – upper bounds for elements in solution
lower_bound (float|Tensor) – lower bounds for elements in solution
elite_size (int) – the number of elites selected in each round. Elites represent the group of the top-elite_size members from the population based on their cost values. They are used to update the mean and variance of the Gaussian population generation distribution.
max_iter_num (int|Tensor) – the maximum number of CEM iterations
epsilon (float) – a minimum variance threshold. If the maximum variance of the population falls below it, the CEM iteration will stop.
tau (float) –
a value in (0, 1) for softly updating the population mean and variance:

mean = (1 - tau) * mean + tau * new_mean var = (1 - tau) * var + tau * new_var
min_var (float) – minimum value of the variance for the Gaussian distribution to sample from

obtain_solution(observation, init_mean=None, init_var=None)[source]#

Minimize the cost function provided by using the CEM method.

Parameters

observation (Tensor) – the initial observation for cost calculation
init_mean (None|Tensor) – initial mean of the population. If None, the mean is initialized to have value as 0.5 * (self._upper_bound + self._lower_bound).
init_var (None|Tensor) – initial variance of the population. If None, the variance is initialized to have value as 0.5 * (upper_bound - lower_bound).

class RandomOptimizer(solution_dim, population_size, cost_func, upper_bound, lower_bound)[source]#

Bases: alf.optimizers.traj_optimizers.TrajOptimizer

Random Trajectory Optimizer

This module conducts trajectory optimization via random-shooting-based: optimization, i.e., generating a random population for each sample in the batch and select those having the lowest cost as the solution.

Parameters

solution_dim (int) – The dimensionality of the problem space
population_size (int) – The number of candidate solutions to be sampled at every iteration
cost_func (Callable) – the cost function to be minimized. It takes as input: 1) init observation 2) action_sequence with the shape of [batch_size, population_size, solution_dim])
returns a cost Tensor of shape [batch_size, population_size] (and) –
upper_bound (float|Tensor) – upper bounds for elements in solution
lower_bound (float|Tensor) – lower bounds for elements in solution

obtain_solution(observation)[source]#: Minimize the cost function provided :param observation: the initial observation for cost calculation :type observation: Tensor

class TrajOptimizer(*args, **kwargs)[source]#

Bases: object

Trajectory Optimizer Module

This module generates optimized solution by minimizing a given: cost function set through set_cost.

obtain_solution(*args, **kwargs)[source]#

reset()[source]#

alf.optimizers.trusted_updater#

TrustedUpdater.

class TrustedUpdater(parameters)[source]#

Bases: object

Adjust variables based on the change calculated by change_f()

The motivation is that if some quatity changes too much after an SGD update, the SGD step might be too big. We want to shink that step so that the concerned quatity does not change too much. We can also monitor multiple quantities to make sure none of them has sudden big jump.

It adjusts variables provided at __init__ if the change calculated by change_f is too big: ``` change = change_f() if change > max_change:

var <= old_var + 0.9 * (max_change/change) * (var - old_var)

``` The above procedure is repeated until change is not bigger than max_change. Note that change and max_change can be nests of scalars. In this case, the inequality is understood as if any one of the changes is greater than its corresponding max_change.

Create a TrustedUpdater instance.

Parameters: parameters (list[Parameter]) – parameters to be monitored.

adjust_step(change_f, max_change)[source]#

Adjust parameters based change calculated by change_f

This function will copy the new values of the variables to a backup to be used for the next call of adjust_step. :type change_f: Callable :param change_f: a function calculate a (nested) change based on

current variable.

Parameters: max_change (float) – (nested) max change allowed.
Returns: the initial change before variables are adjusted the number of steps to adjust variables. 0 for no adjustment

alf.optimizers.utils#

class GradientNoiseScaleEstimator(batch_size_ratio=0.1, update_rate=0.001, gradient_norm_clip=None, mode='alternative', name='GNSEstimator')[source]#

Bases: torch.nn.modules.module.Module

Implement the simple Gradient Noise Scale estimator as detailed in Appendix A, “An Empirical Model of Large-Batch Training”, McCandlish et al., arXiv, 2018.

The simplified GNS is defined as:

\[B_{simple} = \frac{tr(\Sigma(\theta))}{|G(\theta)|^2},\]

where \(\Sigma\) is the per-sample covariance matrix defined as

\[\Sigma(\theta) = cov_{x\sim p} (\Nabla_{\theta} L_x(\theta)),\]

and \(G(\theta)\) is the true gradient given the entire data distribution.

Generally, GNS indicates the noise-to-signal value of SGD. The authors suggest that we should choose a batch size close to GNS in order to average out the noise in the gradient. In other words, GNS is positively correlated to the current gradient descent difficulty. We would expect a high GNS for a difficult learning task, especially when different training samples generate opposite gradient directions.

Note

You can turn on this estimator in TrainerConfig. However, this will increase the back-propagation overhead.

Note that the expectation of the estimated GNS is independent with the batch size in theory, but does depend on the learning rate. A good practice of using this estimator given a learning rate is to make sure:

1. the learning rate is reasonable. If it’s too large, then GNS is unstable. 1. that the batch size is large enough (smaller variance), and 2. the batch data can represent samples from the true data distribution.

For example, if your batch is too large but the replay buffer is too small, then the estimate won’t make sense (consider increasing the initial_collect_steps).

We also provide an alternative way of estimating GNS. Given the gradients of two sampled batches \(G_{est1}\) and \(G_{est2}\), we have

\[\begin{split}\begin{array}{l} \alpha\triangleq \mathbb{E}[<G_{est1}\circ G_{est2}>] = |G|^2 \\ \beta\triangleq\mathbb{E}[\frac{|G_{est1}|^2 + |G_{est2}|^2}{2}] = \frac{1}{B}tr[\Sigma] + |G|^2 \\ \end{array}\end{split}\]

Then we can maintain a moving average of \(\bar{\alpha}\) and \(\bar{\beta}\), and use \((\frac{\bar{\beta}}{\bar{\alpha}}-1)B\) as the estimated GNS.

Parameters

batch_size_ratio (float) – the portion of a batch to be used as a “smaller” batch. In theory, another smaller batch should be sampled independently from the data distribution. However, for simplicity, this estimator samples the smaller batch from a batch and uses the remaining as the larger batch. So this ratio should be small (<0.5). If the ratio is too small, the calculated smaller batch size will be clipped at 1.
update_rate (float) – the update rate for computing moving averages of the quantities needed by GNS. Generally, a smaller value (slower update) makes the estimated GNS more biased (because quantities at different training steps are averaged) while a larger value (quicker update) makes it have more variances.
gradient_norm_clip (Optional[float]) – a clipping value for global gradient norm. If None, no clipping is performed. Usually, a clipping value is required for a stable GNS estimate. Depending on how stable the GNS is estimated, this value could also suggest a clipping norm for the optimizer.
mode (str) – either “paper” or “alternative”. “paper” uses the calculation in the paper. “alternative” is the default mode as its calculation is easier to understand.
name (str) –

forward(loss, tensors)[source]#

Given a loss tensor and a nest of tensors, return the estimated GNS.

Parameters

loss (Tensor) – a loss tensor before taking the mean. Each entry of the tensor represents an individual loss on a single training sample. Ideally, the samples used for computing these losses should be sampled with replacement independently. The loss can have a shape of either [T,B] or [B]. The estimate will be more stable if B is large and the batch could represent samples from the data distribution well.
tensors (Union[Tensor, List[ForwardRef], Tuple[()], Tuple[ForwardRef, …], Dict[str, ForwardRef]]) – a nest of tensors whose gradients are considered

Returns

the estimated gradient noise scale (a scalar). A smaller value: means more effective grad steps.

Return type

gns

training: bool#

get_opt_arg(p, argname, default=None)[source]#

Get parameter specific optimizer arguments.

Parameters

p (Parameter) – the parameter
argname (str) – name of the argument
default (Optional[Any]) – the default value

Returns

The parameter specific value if it is found, otherwise default