alf.optimizers#
alf.optimizers.adam_tf#
- class AdamTF(params=[{'params': []}], lr=0.001, betas=(0.9, 0.999), eps=1e-07, weight_decay=0, amsgrad=False)[source]#
Bases:
torch.optim.optimizer.OptimizerImplementation of Adam algorithm following Tensorflow’s convention.
This class should not be direclty used as it will be wrapped for clipping gradients. Use the wrapped optimizer
AdamTFinalf/optimizers/optimizers.pyinstead.- Parameters
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups.
lr (float, optional) – learning rate (default: 1e-3).
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999)).
eps (float, optional) – term added to the denominator to improve numerical stability which corresponds to the epsilon_hat in the Adam paper (default: 1e-7).
weight_decay (float, optional) – weight decay (L2 penalty) (default: 0). This argument can be parameter specific, which means that if Parameter.opt_args[“weight_decay”] is not None, it will be used instead.
amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False).
References
alf.optimizers.adamw#
- class AdamW(params, lr=0.001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0.01, amsgrad=False)[source]#
Bases:
torch.optim.optimizer.OptimizerAdamW optimizer.
The current implementation of AdamW in pytorch 1.8 has a bug introduced during refactoring (https://github.com/pytorch/pytorch/pull/50411). This method is copied from the latest fix (#52944), which has been merged into pytorch repo after the release of pytorch 1.8.1 (https://github.com/pytorch/pytorch/pull/52944).
TODO: remove this after upgrading to pytorch versions including fix (#52944).
The original Adam algorithm was proposed in Adam: A Method for Stochastic Optimization. The AdamW variant was proposed in Decoupled Weight Decay Regularization.
- Parameters
params (iterable) – iterable of parameters to optimize or dicts defining parameter groups
lr (float, optional) – learning rate (default: 1e-3)
betas (Tuple[float, float], optional) – coefficients used for computing running averages of gradient and its square (default: (0.9, 0.999))
eps (float, optional) – term added to the denominator to improve numerical stability (default: 1e-8)
weight_decay (float, optional) – weight decay coefficient (default: 1e-2)
amsgrad (boolean, optional) – whether to use the AMSGrad variant of this algorithm from the paper On the Convergence of Adam and Beyond (default: False)
References: .. _Adam: A Method for Stochastic Optimization:
alf.optimizers.nero_plus#
NeroPlus optimizer.
- class NeroPlus(params=[{'params': []}], lr=0.01, betas=(0.9, 0.999), eps=1e-07, normalizing_grad_by_norm=False, max_norm=1, weight_decay=0, l2_regularization=0, fixed_norm=True, zero_mean=True)[source]#
Bases:
torch.optim.optimizer.OptimizerNeroPlus Optimizer
This is an enhanced version of the Nero optimizer described in the following paper:
Yang Liu et. al. Learning by Turning: Neural Architecture Aware Optimisation
The essence of this optimizer is to keep the norm of each parameter vector fixed and mean at zero during the optimization process. The parameter vector is defined as the part of parameter responsible for one dimension of the output. For example,
FC(m, n)have two parameters, its weight of shape [m, n] and its bias of shape [n]. Its weight have m parameter vectors. Each of these m vectors is subject to the norm and mean constraint. For the bias, one element is responsible for one output dimension. So it is not subject to the norm and zero-mean constraint. Since the range of the output of a model should not be constrained, you should set opt_config for the output layers as dict(fixed_norm=False, max_norm=math.inf, zero_mean=False) or use a large finitemax_normorweight_decayto introduce some regularization.For 2+ D parameter p, its parameter vectors are assumed to be p[0], p[1] … p[-1]. This is correct for many ALF layers (e.g. FC, Conv2D, TransformerBlock). But not all of ALF layers follows this rule.
ParallelFCand other parallel layers are such examples. So you should not use NeroPlus if your model contains such layers.By default, 1D parameters are not subject to the constraint (i.e. max_norm=math.inf, fixed_norm=False, zero_mean=False). If the constraints are desired, they can be specified using
opt_argsattribute ofParameter.The main enhancements compared to the original Nero optimizer include:
Option for ADAM like update (normalizing_grad_by_norm=False)
Upper bound contraint of weight norm (fixed_norm=False)
Weight decay and L2 regularization.
To use this optimizer, you should first use
NeroPlus.initialize()to normalize the parameter of your model for the given constrains before actually using your model for training.- Parameters
params – iterable of parameters to optimize or dicts defining parameter groups.
lr (
float) – learning ratebetas (
Tuple[float]) – coefficients between 0 and 1. They are used for computing running averages of gradient and its squared norm or elementwise square.betas[0]can be zero, in which case no running average will be performed.betas[1]must be greater than 0.eps (
float) – term added to the denominator to improve numerical stability which corresponds to the epsilon_hat in the Adam paper.normalizing_grad_by_norm – whether to normalize the gradient by the running average of its squared norm or its elementwise square. Note that the original Nero optimizer uses
Truefor this. However, we found the ADAM like behavoir is better.max_norm (
float) – maximal norm of each parameter vector. A parameter vector is part of a parameter responsible for one output dimension.weight_decay (
float) – weight decay. This is same as the weight decay of AdamW, which is implemented as substracting lr * weight_deday * w from parameter.l2_regularization (
float) – L2 penalty. This is same as the weight decay of Adam, which is implemented as adding weight_decay * w to gradient.fixed_norm (
bool) – whether to fix the norm of the parameter vector. If True, the norm will be fixed atmax_norm.zero_mean (
bool) – whether to enfoce the mean of a parameter vector is zero.
lr,weight_decay,l2_regularization,fixed_norm,max_norm,zero_meancan be set individually for each parameter usingopt_argsattributes ofParameter.opt_argsshould be a dictionary. Additionally,lr_scalewhich can be used to scale the global learning for a specific parameter.- add_param_group(param_group)[source]#
Add a param group to the
Optimizers param_groups.This can be useful when fine tuning a pre-trained network as frozen layers can be made trainable and added to the
Optimizeras training progresses.- Parameters
param_group (dict) – Specifies what Tensors should be optimized along with group
optimization options. (specific) –
alf.optimizers.optimizers#
- Adam#
alias of
alf.optimizers.optimizers.Adam_
- AdamTF#
alias of
alf.optimizers.optimizers.AdamTF_
- AdamW#
alias of
alf.optimizers.optimizers.AdamW_
- NeroPlus#
alias of
alf.optimizers.optimizers.NeroPlus_
- SGD#
alias of
alf.optimizers.optimizers.SGD_
alf.optimizers.traj_optimizers#
- class CEMOptimizer(solution_dim, population_size, cost_func, upper_bound, lower_bound, elite_size=50, max_iter_num=5, epsilon=0.01, tau=0.9, min_var=1e-05)[source]#
Bases:
alf.optimizers.traj_optimizers.TrajOptimizerCreates a CEM Optimizer
This module optimizes a given cost function via the Cross-Enrtopy Method, which iterates between evaluating a population of samples generated from a probability distribution and updating the distribution based on the evaluation for generating better samples in the next iteration. In practice, a multi-dimensional Gaussian distribution with a diagonal covariance matrix is used.
- Parameters
solution_dim (int) – the dimensionality of the problem space
population_size (int) – the number of candidate solutions to be sampled at every iteration
cost_func (Callable) – the cost function to be minimized. It takes as input: 1) init observation 2) action_sequence with the shape of [batch_size, population_size, solution_dim])
upper_bound (float|Tensor) – upper bounds for elements in solution
lower_bound (float|Tensor) – lower bounds for elements in solution
elite_size (int) – the number of elites selected in each round. Elites represent the group of the top-elite_size members from the population based on their cost values. They are used to update the mean and variance of the Gaussian population generation distribution.
max_iter_num (int|Tensor) – the maximum number of CEM iterations
epsilon (float) – a minimum variance threshold. If the maximum variance of the population falls below it, the CEM iteration will stop.
tau (float) –
a value in (0, 1) for softly updating the population mean and variance:
mean = (1 - tau) * mean + tau * new_mean var = (1 - tau) * var + tau * new_var
min_var (float) – minimum value of the variance for the Gaussian distribution to sample from
- obtain_solution(observation, init_mean=None, init_var=None)[source]#
Minimize the cost function provided by using the CEM method.
- Parameters
observation (Tensor) – the initial observation for cost calculation
init_mean (None|Tensor) – initial mean of the population. If None, the mean is initialized to have value as 0.5 * (self._upper_bound + self._lower_bound).
init_var (None|Tensor) – initial variance of the population. If None, the variance is initialized to have value as 0.5 * (upper_bound - lower_bound).
- class RandomOptimizer(solution_dim, population_size, cost_func, upper_bound, lower_bound)[source]#
Bases:
alf.optimizers.traj_optimizers.TrajOptimizerRandom Trajectory Optimizer
- This module conducts trajectory optimization via random-shooting-based
optimization, i.e., generating a random population for each sample in the batch and select those having the lowest cost as the solution.
- Parameters
solution_dim (int) – The dimensionality of the problem space
population_size (int) – The number of candidate solutions to be sampled at every iteration
cost_func (Callable) – the cost function to be minimized. It takes as input: 1) init observation 2) action_sequence with the shape of [batch_size, population_size, solution_dim])
returns a cost Tensor of shape [batch_size, population_size] (and) –
upper_bound (float|Tensor) – upper bounds for elements in solution
lower_bound (float|Tensor) – lower bounds for elements in solution
alf.optimizers.trusted_updater#
TrustedUpdater.
- class TrustedUpdater(parameters)[source]#
Bases:
objectAdjust variables based on the change calculated by change_f()
The motivation is that if some quatity changes too much after an SGD update, the SGD step might be too big. We want to shink that step so that the concerned quatity does not change too much. We can also monitor multiple quantities to make sure none of them has sudden big jump.
It adjusts variables provided at __init__ if the change calculated by change_f is too big: ``` change = change_f() if change > max_change:
var <= old_var + 0.9 * (max_change/change) * (var - old_var)
``` The above procedure is repeated until change is not bigger than max_change. Note that change and max_change can be nests of scalars. In this case, the inequality is understood as if any one of the changes is greater than its corresponding max_change.
Create a TrustedUpdater instance.
- Parameters
parameters (list[Parameter]) – parameters to be monitored.
- adjust_step(change_f, max_change)[source]#
Adjust parameters based change calculated by change_f
This function will copy the new values of the variables to a backup to be used for the next call of adjust_step. :type change_f:
Callable:param change_f: a function calculate a (nested) change based oncurrent variable.
- Parameters
max_change (float) – (nested) max change allowed.
- Returns
the initial change before variables are adjusted the number of steps to adjust variables. 0 for no adjustment
alf.optimizers.utils#
- class GradientNoiseScaleEstimator(batch_size_ratio=0.1, update_rate=0.001, gradient_norm_clip=None, mode='alternative', name='GNSEstimator')[source]#
Bases:
torch.nn.modules.module.ModuleImplement the simple Gradient Noise Scale estimator as detailed in Appendix A, “An Empirical Model of Large-Batch Training”, McCandlish et al., arXiv, 2018.
The simplified GNS is defined as:
\[B_{simple} = \frac{tr(\Sigma(\theta))}{|G(\theta)|^2},\]where \(\Sigma\) is the per-sample covariance matrix defined as
\[\Sigma(\theta) = cov_{x\sim p} (\Nabla_{\theta} L_x(\theta)),\]and \(G(\theta)\) is the true gradient given the entire data distribution.
Generally, GNS indicates the noise-to-signal value of SGD. The authors suggest that we should choose a batch size close to GNS in order to average out the noise in the gradient. In other words, GNS is positively correlated to the current gradient descent difficulty. We would expect a high GNS for a difficult learning task, especially when different training samples generate opposite gradient directions.
Note
You can turn on this estimator in
TrainerConfig. However, this will increase the back-propagation overhead.Note that the expectation of the estimated GNS is independent with the batch size in theory, but does depend on the learning rate. A good practice of using this estimator given a learning rate is to make sure:
1. the learning rate is reasonable. If it’s too large, then GNS is unstable. 1. that the batch size is large enough (smaller variance), and 2. the batch data can represent samples from the true data distribution.
For example, if your batch is too large but the replay buffer is too small, then the estimate won’t make sense (consider increasing the
initial_collect_steps).We also provide an alternative way of estimating GNS. Given the gradients of two sampled batches \(G_{est1}\) and \(G_{est2}\), we have
\[\begin{split}\begin{array}{l} \alpha\triangleq \mathbb{E}[<G_{est1}\circ G_{est2}>] = |G|^2 \\ \beta\triangleq\mathbb{E}[\frac{|G_{est1}|^2 + |G_{est2}|^2}{2}] = \frac{1}{B}tr[\Sigma] + |G|^2 \\ \end{array}\end{split}\]Then we can maintain a moving average of \(\bar{\alpha}\) and \(\bar{\beta}\), and use \((\frac{\bar{\beta}}{\bar{\alpha}}-1)B\) as the estimated GNS.
- Parameters
batch_size_ratio (
float) – the portion of a batch to be used as a “smaller” batch. In theory, another smaller batch should be sampled independently from the data distribution. However, for simplicity, this estimator samples the smaller batch from a batch and uses the remaining as the larger batch. So this ratio should be small (<0.5). If the ratio is too small, the calculated smaller batch size will be clipped at 1.update_rate (
float) – the update rate for computing moving averages of the quantities needed by GNS. Generally, a smaller value (slower update) makes the estimated GNS more biased (because quantities at different training steps are averaged) while a larger value (quicker update) makes it have more variances.gradient_norm_clip (
Optional[float]) – a clipping value for global gradient norm. If None, no clipping is performed. Usually, a clipping value is required for a stable GNS estimate. Depending on how stable the GNS is estimated, this value could also suggest a clipping norm for the optimizer.mode (
str) – either “paper” or “alternative”. “paper” uses the calculation in the paper. “alternative” is the default mode as its calculation is easier to understand.name (
str) –
- forward(loss, tensors)[source]#
Given a loss tensor and a nest of tensors, return the estimated GNS.
- Parameters
loss (
Tensor) – a loss tensor before taking the mean. Each entry of the tensor represents an individual loss on a single training sample. Ideally, the samples used for computing these losses should be sampled with replacement independently. The loss can have a shape of either[T,B]or[B]. The estimate will be more stable ifBis large and the batch could represent samples from the data distribution well.tensors (
Union[Tensor,List[ForwardRef],Tuple[()],Tuple[ForwardRef, …],Dict[str,ForwardRef]]) – a nest of tensors whose gradients are considered
- Returns
- the estimated gradient noise scale (a scalar). A smaller value
means more effective grad steps.
- Return type
gns
- training: bool#