alf.algorithms#
alf.algorithms.actor_critic_algorithm#
Actor critic algorithm.
- class ActorCriticAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), reward_weights=None, actor_network_ctor=<class 'alf.networks.actor_distribution_networks.ActorDistributionNetwork'>, value_network_ctor=<class 'alf.networks.value_networks.ValueNetwork'>, epsilon_greedy=None, env=None, config=None, loss=None, loss_class=<class 'alf.algorithms.actor_critic_loss.ActorCriticLoss'>, optimizer=None, checkpoint=None, debug_summaries=False, name='ActorCriticAlgorithm')[source]#
Bases:
alf.algorithms.on_policy_algorithm.OnPolicyAlgorithmActor critic algorithm.
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions.
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
reward_weights (None|list[float]) – this is only used when the reward is multidimensional. In that case, the weighted sum of the v values is used for training the actor if reward_weights is not None. Otherwise, the sum of the v values is used.
env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. env only needs to be provided to the root Algorithm.
epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from
config.epsilon_greedyand thenalf.get_config_value(TrainerConfig.epsilon_greedy).config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs
train_iter()by itself.actor_network_ctor (Callable) – Function to construct the actor network.
actor_network_ctorneeds to acceptinput_tensor_specandaction_specas its arguments and return an actor network. The constructed network will be called withforward(observation, state).value_network_ctor (None | Callable) – Function to construct the value network.
value_network_ctorneeds to acceptinput_tensor_specas its arguments and return a value netwrok. The contructed network will be called withforward(observation, state)and returns value tensor for each observation given observation and network state. Note that if the algorithm is constructed for evaluation or deployment only, the value_network_ctor can be set to None and the value network will not be constructed at all.loss (None|ActorCriticLoss) – an object for calculating loss. If None, a default loss of class loss_class will be used.
loss_class (type) – the class of the loss. The signature of its constructor:
loss_class(debug_summaries)optimizer (torch.optim.Optimizer) – The optimizer for training
checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.debug_summaries (bool) – True if debug summaries should be created.
name (str) – Name of this algorithm.
- convert_train_state_to_predict_state(state)[source]#
Convert RNN state for
train_step()to RNN state forpredict_step().
- training: bool#
- class ActorCriticInfo(step_type, discount, reward, action, log_prob, action_distribution, value, reward_weights)#
Bases:
tupleCreate new instance of ActorCriticInfo(step_type, discount, reward, action, log_prob, action_distribution, value, reward_weights)
- action#
Alias for field number 3
- action_distribution#
Alias for field number 5
- discount#
Alias for field number 1
- log_prob#
Alias for field number 4
- reward#
Alias for field number 2
- reward_weights#
Alias for field number 7
- step_type#
Alias for field number 0
- value#
Alias for field number 6
alf.algorithms.actor_critic_loss#
- class ActorCriticLoss(gamma=0.99, td_error_loss_fn=<function element_wise_squared_loss>, use_gae=False, td_lambda=0.95, use_td_lambda_return=True, normalize_advantages=False, advantage_clip=None, entropy_regularization=None, td_loss_weight=1.0, debug_summaries=False, name='ActorCriticLoss')[source]#
Bases:
alf.algorithms.algorithm.LossAn actor-critic loss equals to
(policy_gradient_loss + td_loss_weight * td_loss - entropy_regularization * entropy)
- Parameters
gamma (float|list[float]) – A discount factor for future rewards. For multi-dim reward, this can also be a list of discounts, each discount applies to a reward dim.
td_errors_loss_fn (Callable) – A function for computing the TD errors loss. This function takes as input the target and the estimated Q values and returns the loss for each element of the batch.
use_gae (bool) – If True, uses generalized advantage estimation for computing per-timestep advantage. Else, just subtracts value predictions from empirical return.
use_td_lambda_return (bool) – Only effective if use_gae is True. If True, uses
td_lambda_returnfor training value function.(td_lambda_return = gae_advantage + value_predictions).td_lambda (float) – Lambda parameter for TD-lambda computation.
normalize_advantages (bool) – If True, normalize advantage to zero mean and unit variance within batch for caculating policy gradient. This is commonly used for PPO.
advantage_clip (float) – If set, clip advantages to \([-x, x]\)
entropy_regularization (float) – Coefficient for entropy regularization loss term.
td_loss_weight (float) – the weigt for the loss of td error.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- forward(info)[source]#
Cacluate actor critic loss. The first dimension of all the tensors is time dimension and the second dimesion is the batch dimension.
- Parameters
info (namedtuple) – information for calculating loss. All tensors are time-major. It should contain the following fields: - reward: - step_type: - discount: - action: - action_distribution: - value:
- Returns
with
extrabeingActorCriticLossInfo.- Return type
- property gamma#
- training: bool#
alf.algorithms.agent#
Agent for integrating multiple algorithms.
- class Agent(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), env=None, config=None, rl_algorithm_cls=<class 'alf.algorithms.actor_critic_algorithm.ActorCriticAlgorithm'>, reward_weight_algorithm_cls=None, representation_learner_cls=None, representation_use_rl_state=False, goal_generator=None, intrinsic_reward_module=None, intrinsic_reward_coef=1.0, extrinsic_reward_coef=1.0, enforce_entropy_target=False, entropy_target_cls=None, optimizer=None, debug_summaries=False, name='AgentAlgorithm')[source]#
Bases:
alf.algorithms.rl_algorithm.RLAlgorithmAgent is a master algorithm that integrates different algorithms together.
Args: observation_spec (nested TensorSpec): representing the observations. action_spec (nested BoundedTensorSpec): representing the actions. reward_spec (TensorSpec): a rank-1 or rank-0 tensor spec representing
the reward(s).
- env (Environment): The environment to interact with.
envis a batched environment, which means that it runs multiple simulations simultaneously. Running multiple environments in parallel is crucial to on-policy algorithms as it increases the diversity of data and decreases temporal correlation.
envonly needs to be provided to the rootAlgorithm.- config (TrainerConfig): config for training. config only needs to be
provided to the algorithm which performs
train_iter()by itself.- rl_algorithm_cls (type): The algorithm class for learning the policy.
It will be called as
rl_algorithm_cls(observation_spec=?, action_spec=?, reward_spec=?, config=?, debug_summaries=?).- reward_weight_algorithm_cls (type): The algorithm class for adjusting
reward weights when multi-dim rewards are used. If provided, the the default
reward_weightsofrl_algorithmwill be overwritten by this algorithm.- representation_learner_cls (type): The algorithm class for learning
the representation. If provided, the constructed learner will calculate the representation from the original observation as the observation for downstream algorithms such as
rl_algorithm. Similar to rl_algorithm_cls, it will be called asrl_algorithm_cls(observation_spec=?, action_spec=?, reward_spec=?, config=?, debug_summaries=?).- representation_use_rl_state: When set to True, representation learner
will receive (previous) state from the RL algorithm as input instead of its own state for
rollout_step()andpredict_step(). This is particularly useful for algorithm such as MuZero representation learner, whose reanalyze component requires access to the RL algorithm’s state.- intrinsic_reward_module (Algorithm): an algorithm whose outputs
is a scalar intrinsic reward.
- goal_generator (Algorithm): an algorithm which outputs a tuple of goal
vector and a reward. The reward can be
()if no reward is given.
intrinsic_reward_coef (float): Coefficient for intrinsic reward extrinsic_reward_coef (float): Coefficient for extrinsic reward enforce_entropy_target (bool): If True, use
(Nested)EntropyTargetAlgorithmto dynamically adjust entropy regularization so that entropy is not smaller than
entropy_targetsupplied for constructing(Nested)EntropyTargetAlgorithm. If this is enabled, make sure you don’t useentropy_regularizationfor loss (seeActorCriticLossorPPOLoss). In order to use this, TheAlgStep.infofromrl_algorithm_cls.train_step()andrl_algorithm_cls.rollout_step()needs to containaction_distribution.- entropy_target_cls (type): If provided, will be used to dynamically
adjust entropy regularization.
optimizer (optimizer): The optimizer for training debug_summaries (bool): True if debug summaries should be created. name (str): Name of this algorithm.
- after_train_iter(experience, info)[source]#
Call
after_train_iter()of the RL algorithm and goal generator, respectively.
- after_update(experience, train_info)[source]#
Call
after_update()of the RL algorithm and goal generator, respectively.
- preprocess_experience(root_inputs, rollout_info, batch_info)[source]#
Add intrinsic rewards to extrinsic rewards if there is an intrinsic reward module. Also call
preprocess_experience()of the rl algorithm.
- set_path(path)[source]#
Set the path from the root algorithm to this algorithm.
See
AlgorithmInterface.pathfor description about path. This function is called by the trainer before training starts. It needs to be implemented if the algorithm contains some other sub-algorithms.If an algorithm does not have any sub-algorithm or its sub-algorithm does not need to access the root replay buffer directly, it does not implement this function.
- summarize_rollout(experience)[source]#
First call
RLAlgorithm.summarize_rollout()to summarize basic rollout statisics. If the rl algorithm has overridden this function, then also call its customized version.
- train_step(time_step, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- train_step_offline(time_step, state, rollout_info, pre_train)[source]#
Perform one step of offline training computation.
It is called to calculate output for every time step for a batch of experience from offline replay buffer. It also needs to generate necessary information for
calc_loss_offline(). By default, this function callstrain_stepas its default implementation.- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- env (Environment): The environment to interact with.
- class AgentInfo(rl, irm, goal_generator, entropy_target, repr, rw, rewards)#
Bases:
tupleCreate new instance of AgentInfo(rl, irm, goal_generator, entropy_target, repr, rw, rewards)
- entropy_target#
Alias for field number 3
- goal_generator#
Alias for field number 2
- irm#
Alias for field number 1
- repr#
Alias for field number 4
- rewards#
Alias for field number 6
- rl#
Alias for field number 0
- rw#
Alias for field number 5
alf.algorithms.agent_helpers#
Some helper functions for constructing an Agent instance.
- class AgentHelper(state_ctor)[source]#
Bases:
objectCreate three state specs given the state creator.
- static accumulate_algorithm_rewards(rewards, weights, names, summary_prefix, summarize_fn)[source]#
Sum a list of rewards by their weights. Also summarize the rewards statistics given their names.
- Parameters
rewards (list[Tensor]) – a list of rewards tensors
weights (list[float]) – a list of floating numbers
names (list[str]) – a list of reward names
summary_prefix (str) – a string prefix for summary
summarize_fn (Callable) – a summarize function that accepts a name and a reward.
- Returns
A single reward after accumulation.
- Return type
Tensor
- accumulate_loss_info(algorithms, train_info, offline=False, pre_train=False)[source]#
Given an overall Agent training info that contains various training infos for different algorithms, compute the accumulated loss info for updating parameters.
- Parameters
algorithms (list[Algorithm]) – the list of algorithms whose loss infos are to be accumulated.
experience (Experience) – experience used for gradient update.
train_info (nested Tensor) – information collected for training algorithms. It is batched from each
AlgStep.inforeturned bytrain_step()orrollout_step().offline (bool) – whether the accumulation is done for offline RL part or the online RL part.
pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.
- Returns
the accumulated loss info.
- Return type
- after_train_iter(algorithms, root_inputs, rollout_info=None)[source]#
For each provided algorithm, call its
after_train_iter()to do things after the agent finishes one training iteration (i.e.,train_iter()).- Parameters
- after_update(algorithms, root_inputs, train_info)[source]#
For each provided algorithm, call its
after_update()to do things after the agent completes one gradient update (i.e.update_with_gradient()).- Parameters
algorithms (list[Algorithm]) – the list of algorithms whose
after_updateis to be called.root_inputs (TimeStep) – experience used for the gradient update.
train_info (AgentInfo) – information collected for training algorithms. It is batched from each
AlgStep.inforeturned bytrain_step()orrollout_step().
- register_algorithm(alg, alg_field)[source]#
Collect state specs from algorithms. For code conciseness, we collect all three state specs even though some of them will not be used during
unrollortrain.This function also registers
algwithalg_field.- Parameters
alg (Algorithm) – a child algorithm in the agent.
alg_field (str) – the corresponding algorithm field in an
AgentStateorAgentInfo.
alf.algorithms.algorithm#
Algorithm base class.
- class Algorithm(train_state_spec=(), rollout_state_spec=None, predict_state_spec=None, is_on_policy=None, optimizer=None, checkpoint=None, config=None, debug_summaries=False, name='Algorithm')[source]#
Bases:
alf.algorithms.algorithm_interface.AlgorithmInterfaceBase implementation for AlgorithmInterface.
Each algorithm can have a default optimimzer. By default, the parameters and/or modules under an algorithm are optimized by the default optimizer. One can also specify an optimizer for a set of parameters and/or modules using add_optimizer. You can find out which parameter is handled by which optimizer using
get_optimizer_info().A requirement for this optimizer structure to work is that there is no algorithm which is a submodule of a non-algorithm module. Currently, this is not checked by the framework. It’s up to the user to make sure this is true.
- Parameters
train_state_spec (nested TensorSpec) – for the network state of
train_step().rollout_state_spec (nested TensorSpec) – for the network state of
rollout_step(). If None, it’s assumed to be the same astrain_state_spec.predict_state_spec (nested TensorSpec) – for the network state of
predict_step(). If None, it’s assume to be same asrollout_state_spec.is_on_policy (None|bool) –
optimizer (None|Optimizer) – The default optimizer for training. See comments above for detail.
checkpoint (None|str) – a string in the format of “prefix@path”, where the - “prefix” is the prefix to the contents in the checkpoint to be loaded. It can be a multi-step path denoted by “A.B.C”. If the checkpoint comes from a previous ALF training session, the standard prefix starts with “alg” (e.g. “alg._sub_alg1”). If prefix is omitted, the effects is the same as providing “alg”, which will load the full ‘alg’ part of the checkpoint. - “path” is the full path to the checkpoint file saved by ALF, e.g. “/path_to_experiment/train/algorithm/ckpt-100”. Therefore, an example value for
checkpointis “alg._sub_alg1@/path_to_experiment/train/algorithm/ckpt-100”.config (TrainerConfig) – config for training.
configonly needs to be provided to the algorithm which performs a training iteration by itself.debug_summaries (bool) – True if debug summaries should be created.
name (str) – name of this algorithm.
- activate_ddp(rank)[source]#
Prepare the Algorithm with DistributedDataParallel wrapper
Note that Algorithm does not need to remember the rank of the device.
- Parameters
rank (int) – DDP wrapper needs to know on which GPU device this module’s parameters and buffers are supposed to be.
- add_optimizer(optimizer, modules_and_params)[source]#
Add an optimizer.
Note that the modules and params contained in
modules_and_paramsshould still be the attributes of the algorithm (i.e., they can be retrieved inself.children()orself.parameters()).- Parameters
optimizer (Optimizer) – optimizer
modules_and_params (list of Module or Parameter) – The modules and parameters to be optimized by
optimizer.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- calc_loss_offline(info_offline, pre_train=False)[source]#
Calculate the hybrid loss at each step for each sample. By default, this function calls
calc_lossas its default implementation.- Parameters
info_offline (nest) – information collected for training from the offline training branch. It is returned by
train_step_offline()(hybrid off-policy training).pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.
- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- compute_paras_statistics()[source]#
Compute some simple statistics of the algorithm’s parameters.
This function uses L1, L2, mean, std as the statistics.
- Returns
- a dict of 1D numpy arrays, each containing simple
parameter statistics, which can be used as a proxy for checking the consistency between two parameter set. The keys are parameter names of the module.
- Return type
Dict[np.ndarray]
- convert_train_state_to_predict_state(state)[source]#
Convert RNN state for
train_step()to RNN state forpredict_step().
- property default_optimizer#
Get the default optimizer for this algorithm.
- property experience_spec#
Spec for experience.
- property force_params_visible_to_parent: bool#
Whether the already optimizer-handled parameters are seen by the paranet algorithm.
Normally, when the parameters of this algorithm is handled by its optimizer,
_setup_optimizers_will prevent the parent algorithm’s optimizer to see and more importantly, handle them. Setting this value to true will force the parameters to be seen and handled by the parent algorithm, even if they are already handled by this algorithm.Note that parameters ignored by
_trainable_attributes_to_ignore()will stay invisible to the parent algorithm.It is by default False, and can be changed with the following setter.
- Return type
bool
- forward(*input)[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- get_optimizer_info()[source]#
Return the optimizer info for all the modules in a string.
TODO: for a subalgorithm that’s an ignored attribute, its optimizer info won’t be obtained.
- Returns
the json string of the information about all the optimizers.
- Return type
str
- get_param_name(param)[source]#
Get the name of the parameter.
- Returns
the name if the parameter can be found; otherwise
None.- Return type
string
- get_unoptimized_parameter_info()[source]#
Return the information about the parameters not being optimized.
Note: the difference of this with the parameters contained in the optimizer ‘None’ from
get_optimizer_info()is thatget_optimizer_info()does not traverse all the parameters (e.g., parameters in list, tuple, dict, or set).- Returns
path of all parameters not being optimized
- Return type
str
- property has_offline#
Whether has offline data for RL algorithms. Always return False for non-RL algorithms.
- load_state_dict(state_dict, strict=True, skip_preloded=True)[source]#
Load state dictionary for the algorithm.
- Parameters
state_dict (dict) – a dict containing parameters and persistent buffers.
strict (bool, optional) – whether to strictly enforce that the keys in
state_dictmatch the keys returned by this module’storch.nn.Module.state_dictfunction. Ifstrict=True, will keep lists of missing and unexpected keys; ifstrict=False, missing/unexpected keys will be omitted. (Default:True)skip_preloded (bool) – whether to skip the modules that support pre-loading and have been pre-loaded. Currently only Algorithm and its derivatives support pre-loading. (Default:
True)
- Returns
missing_keys: a list of str containing the missing keys.
unexpected_keys: a list of str containing the unexpected keys.
- Return type
namedtuple
- property name#
The name of this algorithm.
- need_full_rollout_state()[source]#
Whether
AlgStep.statefromrollout_stepshould be full.If True, it means that
rollout_step()should return the complete state fortrain_step().
- observe_for_metrics(time_step)[source]#
Observe a time step for recording environment metrics.
- Parameters
time_step (TimeStep) – the current time step during
unroll().
- observe_for_replay(exp)[source]#
Record an experience in a replay buffer.
- Parameters
exp (nested Tensor) – exp (nested Tensor): The shape is \([B, \ldots]\), where \(B\) is the batch size of the batched environment.
- property on_policy#
Whether is on-policy training.
For on-policy training,
train_step()will not be called. Andinfopassed tocalc_loss()is info collected fromrollout_step().For off-policy training,
train_step()will be called with the experience from replay buffer. Andinfopassed tocalc_loss()is info collected fromtrain_step.An algorithm can override this to indicate whether it is an on-policy or off-policy algorithm. If an algorithm does not override this, it needs to support both on-policy and off-policy training, which means that
rollout_step()andtrain_step()need to have the correct behavior for on-policy and off-policy training. It can check wether it is on-policy training by calling this function.- Returns
- True if on-policy training, False if off-policy training,
None if not set.
- Return type
bool | None
- optimizers(recurse=True, include_ignored_attributes=False)[source]#
Get all the optimizers used by this algorithm.
- Parameters
recurse (bool) – If True, including all the sub-algorithms
include_ignored_attributes (bool) – If True, still include all child attributes without ignoring any.
- Returns
list of ``Optimizer``s.
- Return type
list
- property path#
Path from the root algorithm to this algorithm.
Currently, path is useful when an algorithm needs to directly access the data about itself in replay buffer. There are two types of data about an algorithm are stored in replay buffer: one is
rollout_info, which isAlgStep.inforeturned by rollout_step(), the other isstate, which is thestateargument used to callrollout_step(). The data in replay buffer is organized asExperiencewhich includesrollout_infoandstate.Given an experience structure, the input state to
rollout_step()can be retrieved by:nest.get_field(experience.state, self.path)
The info from
rollout_step()can be retrieved by:nest.get_field(experience.rollout_info, self.path)
- Returns
path from the root algorithm to this algorithm
- Return type
str
- property pre_loaded#
A property indicating whether a checkpoint for the current instance has been pre-loaded, by specifying
checkpoint_prefix@checkpoint_pathwherecheckpoint_prefix@is optional.
- property predict_state_spec#
Returns the RNN state spec for
predict_step().
- property processed_experience_spec#
Spec for processed experience.
- Returns
Spec for the experience returned by
preprocess_experience().- Return type
- property rollout_state_spec#
Returns the RNN state spec for
rollout_step().
- set_on_policy(is_on_policy)[source]#
Set whether this algorithm is on-policy or not.
- Parameters
is_on_policy (bool) –
- set_path(path)[source]#
Set the path from the root algorithm to this algorithm.
See
AlgorithmInterface.pathfor description about path. This function is called by the trainer before training starts. It needs to be implemented if the algorithm contains some other sub-algorithms.If an algorithm does not have any sub-algorithm or its sub-algorithm does not need to access the root replay buffer directly, it does not implement this function.
- set_replay_buffer(num_envs, max_length, prioritized_sampling=False)[source]#
Set the parameters for the replay buffer.
- Parameters
num_envs (int) – the total number of environments from all batched environments.
max_length (int) – the maximum number of steps the replay buffer store for each environment.
prioritized_sampling (bool) – Use prioritized sampling if this is True.
- state_dict(destination=None, prefix='', visited=None)[source]#
Get state dictionary recursively, including both model state and optimizers’ state (if any). It can handle a number of special cases:
graph with cycle: save all the states and avoid infinite loop
parameter sharing: save only one copy of the shared module/param
optimizers: save the optimizers for all the (sub-)algorithms
- Parameters
destination (OrderedDict) – the destination for storing the state.
prefix (str) – a string to be added before the name of the items (modules, params, algorithms etc) as the key used in the state dictionary.
visited (set) – a set keeping track of the visited objects.
- Returns
the dictionary including both model state and optimizers’ state (if any).
- Return type
OrderedDict
- summarize_train(experience, train_info, loss_info, params)[source]#
Generate summaries for training & loss info after each gradient update. The default implementation of this function only summarizes params (with grads) and the loss. An algorithm can override this for additional summaries. See
RLAlgorithm.summarize_train()for an example.- Parameters
experience (nested Tensor) – samples used for the most recent
update_with_gradient(). By default it’s not summarized.train_info (nested Tensor) –
AlgStep.inforeturned by eitherrollout_step()(on-policy training) ortrain_step()(off-policy training). By default it’s not summarized.loss_info (LossInfo) – loss
params (list[Parameter]|None) – list of parameters with gradients
- train_from_replay_buffer(**kwargs)#
This function can be called by any algorithm that has its own replay buffer configured.
- Parameters
update_global_counter (bool) – controls whether this function changes the global counter for summary. If there are multiple algorithms, then only the parent algorithm should change this quantity and child algorithms should disable the flag. When it’s
True, it will affect the counter only ifconfig.update_counter_every_mini_batch=True.
- train_from_unroll(experience, train_info)[source]#
Train given the info collected from
unroll(). This function can be called by any child algorithm that doesn’t have the unroll logic but has a different training logic with its parent (e.g., off-policy).- Parameters
experience (Experience) – collected during
unroll().train_info (nest) –
AlgStep.inforeturned byrollout_step().
- Returns
number of steps that have been trained
- Return type
int
- property train_info_spec#
The spec for the
AlgStep.inforeturned fromtrain_step().
- property train_state_spec#
Returns the RNN state spec for
train_step().
- train_step_offline(inputs, state, rollout_info, pre_train=False)[source]#
Perform one step of offline training computation.
It is called to calculate output for every time step for a batch of experience from offline replay buffer. It also needs to generate necessary information for
calc_loss_offline(). By default, this function callstrain_stepas its default implementation.- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- transform_experience(experience)[source]#
Transform an Experience structure.
This is used on the experience data retrieved from replay buffer.
- Parameters
experience (Experience) – the experience retrieved from replay buffer. Note that
experience.batch_info,experience.replay_bufferneed to be set.- Returns
transformed experience
- Return type
- transform_timestep(time_step, state)[source]#
Transform time_step.
transform_timestepis called for all raw time_step got from the environment before passing topredict_stepandrollout_step. For off-policy algorithms, the replay buffer stores raw time_step. So when experiences are retrieved from the replay buffer, they are tranformed bytransform_timestepinOffPolicyAlgorithmbefore passing to_update().The transformation should be stateless. By default, only observation is transformed.
- Parameters
time_step (TimeStep or Experience) – time step
state (nested Tensor) – state of the transformer(s)
- Returns
transformed time step
- Return type
- update_with_gradient(loss_info, valid_masks=None, weight=1.0, batch_info=None)[source]#
Complete one iteration of training.
Update parameters using the gradient with respect to
loss_info.- Parameters
loss_info (LossInfo) – loss with shape \((T, B)\) (except for
loss_info.scalar_loss)valid_masks (Tensor) – masks indicating which samples are valid. (
shape=(T, B), dtype=torch.float32)weight (float) – weight for this batch. Loss will be multiplied with this weight before calculating gradient.
batch_info (BatchInfo) – information about this batch returned by
ReplayBuffer.get_batch()
- Returns
loss_info (LossInfo): loss information.
params (list[(name, Parameter)]): list of parameters being updated.
- Return type
tuple
- property use_rollout_state#
If True, when off-policy training, the RNN states will be taken from the replay buffer; otherwise they will be set to 0.
In the case of True, the
train_state_specof an algorithm should always be a subset of therollout_state_spec.
- class Loss(loss_weight=1.0, name='LossAlg')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmAlgorithm that uses its input as loss.
It can be subclassed to customize calc_loss().
Each algorithm can have a default optimimzer. By default, the parameters and/or modules under an algorithm are optimized by the default optimizer. One can also specify an optimizer for a set of parameters and/or modules using add_optimizer. You can find out which parameter is handled by which optimizer using
get_optimizer_info().A requirement for this optimizer structure to work is that there is no algorithm which is a submodule of a non-algorithm module. Currently, this is not checked by the framework. It’s up to the user to make sure this is true.
- Parameters
train_state_spec (nested TensorSpec) – for the network state of
train_step().rollout_state_spec (nested TensorSpec) – for the network state of
rollout_step(). If None, it’s assumed to be the same astrain_state_spec.predict_state_spec (nested TensorSpec) – for the network state of
predict_step(). If None, it’s assume to be same asrollout_state_spec.is_on_policy (None|bool) –
optimizer (None|Optimizer) – The default optimizer for training. See comments above for detail.
checkpoint (None|str) – a string in the format of “prefix@path”, where the - “prefix” is the prefix to the contents in the checkpoint to be loaded. It can be a multi-step path denoted by “A.B.C”. If the checkpoint comes from a previous ALF training session, the standard prefix starts with “alg” (e.g. “alg._sub_alg1”). If prefix is omitted, the effects is the same as providing “alg”, which will load the full ‘alg’ part of the checkpoint. - “path” is the full path to the checkpoint file saved by ALF, e.g. “/path_to_experiment/train/algorithm/ckpt-100”. Therefore, an example value for
checkpointis “alg._sub_alg1@/path_to_experiment/train/algorithm/ckpt-100”.config (TrainerConfig) – config for training.
configonly needs to be provided to the algorithm which performs a training iteration by itself.debug_summaries (bool) – True if debug summaries should be created.
name (str) – name of this algorithm.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(inputs, state=None)[source]#
Predict for one step of inputs.
- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
predict_state_spec.- info (nest): information for analyzing the agent. In particular,
if an element of the info is
alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.
- Return type
- rollout_step(inputs, state=None)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- train_step(inputs, state=None, rollout_info=None)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
alf.algorithms.algorithm_interface#
- class AlgorithmInterface[source]#
Bases:
torch.nn.modules.module.ModuleThe interface for algorithm.
It is a generic interface for reinforcement learning (RL) and non-RL algorithms. The key interface functions are:
predict_step(): one step of computation of action for evaluation.rollout_step(): one step of computation for rollout. It is used for collecting experiences during training. Different frompredict_step,rollout_stepmay include addtional computations for training. An algorithm could immediately use the collected experiences to update parameters after one rollout (multiple rollout steps) is performed; or it can put these collected experiences into a replay buffer.train_step(): only used by algorithms that put experiences into replay buffers. The training data are sampled from the replay buffer filled byrollout_step().train_from_unroll(): perform a training iteration from the unrolled result.train_from_replay_buffer(): perform a training iteration from a replay buffer.update_with_gradient(): do one gradient update based on the loss. It is used by the defaulttrain_from_unroll()andtrain_from_replay_buffer()implementations. You can override to implement your ownupdate_with_gradient().calc_loss(): calculate loss based on theinfocollected fromrollout_step()ortrain_step(). It is used by the default implementations oftrain_from_unroll()andtrain_from_replay_buffer(). If you want to use these two functions, you need to implementcalc_loss().after_update(): called bytrain_iter()after every call toupdate_with_gradient(), mainly for some postprocessing steps such as copying a training model to a target model in SAC or DQN.after_train_iter(): called bytrain_iter()after every call totrain_from_unroll()(on-policy training iter) ortrain_from_replay_buffer(off-policy training iter). It’s mainly for training additional modules that have their own training logic (e.g., on/off-policy, replay buffers, etc). Other things might also be possible as long as they should be done once every training iteration.
For algorithms that have additional offline training flows, they can be implemented by using the following additional interface functions: 10.
train_step_offline(): only used by algorithms that has offlinetraining flows. The training data are sampled from a replay buffer that is loaded from an offline replay buffer checkpoint.
calc_loss_offline(): It calculates the loss based on theinfocollected fromtrain_step_offline().
The offline training flows can be invoked by specifying a valid path to a replay buffer for
TrainerConfig.offline_buffer_dir.Note
A non-RL algorithm will not directly interact with an environment. The interation loop will always be driven by an
RLAlgorithmthat outputs actions and gets rewards. So a non-RL algorithm is always attached to anRLAlgorithmand cannot change the timing of (when to launch) a training iteration. However, it can have its own logic of a training iteration (e.g.,train_from_unroll()andtrain_from_replay_buffer()) which can be triggered by a parentRLAlgorithminside itsafter_train_iter().Initializes internal Module state, shared by both nn.Module and ScriptModule.
- after_train_iter(root_inputs, rollout_info)[source]#
Do things after completing one training iteration (i.e.
train_iter()that consists of one or multiple gradient updates). This function can be used for training additional modules that have their own training logic (e.g., on/off-policy, replay buffers, etc). These modules should be added to_trainable_attributes_to_ignorein the parent algorithm.Other things might also be possible as long as they should be done once every training iteration.
This function will serve the same purpose with
after_updateif there is always only one gradient update in each training iteration. Otherwise it’s less frequently called thanafter_update.- Parameters
root_inputs (nest|None) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll(). In the case where no data is available from therollout_step()(e.g. in a offline pre-training phase where the online interaction is not started yet)root_inputswill be None.rollout_info (nest|None) – information collected from
rollout_step()for this algorithm duringunroll(). In the case where no data is available from therollout_step()(e.g. in a offline pre-training phase where the online interaction is not started yet)rollout_infowill be None.
- after_update(root_inputs, info)[source]#
Do things after completing one gradient update (i.e.
update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.- Parameters
root_inputs (nest) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll().info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()for on-policy training ortrain_step()for off-policy training.
- calc_loss(info)[source]#
Calculate the loss for one mini-batch.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training). The shape of the tensors in info is(T, B, ...), where T is the mini-batch length and B is the mini-batch size.- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- calc_loss_offline(info, pre_train=False)[source]#
Calculate the loss for one mini-batch.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training). The shape of the tensors in info is(T, B, ...), where T is the mini-batch length and B is the mini-batch size.pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.
- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- property on_policy#
Whether is on-policy training.
For on-policy training,
train_step()will not be called. Andinfopassed tocalc_loss()is info collected fromrollout_step().For off-policy training,
train_step()will be called with the experience from replay buffer. Andinfopassed tocalc_loss()is info collected fromtrain_step.An algorithm can override this to indicate whether it is an on-policy or off-policy algorithm. If an algorithm does not override this, it needs to support both on-policy and off-policy training, which means that
rollout_step()andtrain_step()need to have the correct behavior for on-policy and off-policy training. It can check wether it is on-policy training by calling this function.- Returns
- True if on-policy training, False if off-policy training,
None if not set.
- Return type
bool | None
- property path#
Path from the root algorithm to this algorithm.
Currently, path is useful when an algorithm needs to directly access the data about itself in replay buffer. There are two types of data about an algorithm are stored in replay buffer: one is
rollout_info, which isAlgStep.inforeturned by rollout_step(), the other isstate, which is thestateargument used to callrollout_step(). The data in replay buffer is organized asExperiencewhich includesrollout_infoandstate.Given an experience structure, the input state to
rollout_step()can be retrieved by:nest.get_field(experience.state, self.path)
The info from
rollout_step()can be retrieved by:nest.get_field(experience.rollout_info, self.path)
- Returns
path from the root algorithm to this algorithm
- Return type
str
- predict_step(inputs, state)[source]#
Predict for one step of inputs.
- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
predict_state_spec.- info (nest): information for analyzing the agent. In particular,
if an element of the info is
alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.
- Return type
- preprocess_experience(root_inputs, rollout_info, batch_info)[source]#
This function is called on the experiences obtained from a replay buffer. An example usage of this function is to calculate advantages and returns in
PPOAlgorithm.The shapes of tensors in experience are assumed to be \((B, T, ...)\).
- Parameters
root_inputs (nest) – input for rollout_step() of the root algorithm. This is from replay buffer. Note this is not same as the input of rollout_step() of self unless self is the root algorithm.
rollout_info (nested Tensor) –
AlgStep.infofrom rollout_step() for this algorithm.batch_info (BatchInfo) – information about this batch of data
- Returns
processed root_inputs
processed rollout_info
- Return type
tuple
- rollout_step(inputs, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- set_on_policy(is_on_policy)[source]#
Set whether this algorithm is on-policy or not.
- Parameters
is_on_policy (bool) –
- set_path(path)[source]#
Set the path from the root algorithm to this algorithm.
See
AlgorithmInterface.pathfor description about path. This function is called by the trainer before training starts. It needs to be implemented if the algorithm contains some other sub-algorithms.If an algorithm does not have any sub-algorithm or its sub-algorithm does not need to access the root replay buffer directly, it does not implement this function.
- train_from_replay_buffer(update_global_counter=False)[source]#
This function can be called by any algorithm that has its own replay buffer configured.
- Parameters
update_global_counter (bool) – controls whether this function changes the global counter for summary. If there are multiple algorithms, then only the parent algorithm should change this quantity and child algorithms should disable the flag. When it’s
True, it will affect the counter only ifconfig.update_counter_every_mini_batch=True.
- train_from_unroll(experience, train_info)[source]#
Train given the info collected from
unroll(). This function can be called by any child algorithm that doesn’t have the unroll logic but has a different training logic with its parent.- Parameters
experience (Experience) – collected during
unroll().train_info (nest) –
AlgStep.inforeturned byrollout_step().
- Returns
number of steps that have been trained
- Return type
int
- train_iter()[source]#
Perform one iteration of training.
Users may choose to implement their own
train_iter().- Returns
number of samples being trained on (including duplicates).
- Return type
int
- train_step(inputs, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- train_step_offline(inputs, state, rollout_info, pre_train=False)[source]#
Perform one step of offline training computation.
It is called to calculate output for every time step for a batch of experience from offline replay buffer. It also needs to generate necessary information for
calc_loss_offline().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
alf.algorithms.async_unroller#
- class AsyncUnroller(algorithm, config)[source]#
Bases:
objectA helper class for unroll asynchronously.
The asynchronous unroll is performed in a different process. The unroll results are transmitted to the main process through a Queue. The main process should call
gather_unroll_results()to retrieve the unroll results. Since the unroll process has its own algorithm parameters, the main process needs to callupdate_parameters()to update the parameters for the unroll process periodically. Once the main process finishes, it should call close() to release the resouces.The following settings in
TrainerConfigare related to the functionality ofAsyncUnroller: unroll_length, async_unroll, max_unroll_length, unroll_queue_size, unroll_step_interval. See algorithms.config.py for their documentation.TODO: redirect the log and summary to the training process. Currently, all the logs are written to a different log file and summary during rollout_step() is not enabled.
- Parameters
algorithm – the root RL algorithm
unroll_queue_size – the size of the queue for transmitting the unroll results to the main process
root_dir – directory for saving summary and checkpoints
conf_file – config file name
- gather_unroll_results(unroll_length, max_unroll_length)[source]#
Gather the unroll results:
- Parameters
unroll_length (
int) – the desired unroll length. If is 0, any length up tomax_unroll_lengthis possible (including zero length) depending on how much data is in the queue.max_unroll_length (
int) – maximal length of unroll results. This is only used ifunroll_lengthis 0.
- Return type
List[UnrollResult]- Returns
A list of
UnrollResult
- update_parameter(algorithm)[source]#
Update the the model parameter for unroll.
- Parameters
algorithm (RLAlgorithm) – the root RL algorithm
- class UnrollJob(type, step_metrics, global_counter, state_dict)#
Bases:
tupleCreate new instance of UnrollJob(type, step_metrics, global_counter, state_dict)
- global_counter#
Alias for field number 2
- state_dict#
Alias for field number 3
- step_metrics#
Alias for field number 1
- type#
Alias for field number 0
- class UnrollResult(time_step, policy_step, policy_state, env_step_time, step_time)#
Bases:
tupleCreate new instance of UnrollResult(time_step, policy_step, policy_state, env_step_time, step_time)
- env_step_time#
Alias for field number 3
- policy_state#
Alias for field number 2
- policy_step#
Alias for field number 1
- step_time#
Alias for field number 4
- time_step#
Alias for field number 0
alf.algorithms.bc_algorithm#
Behavior Cloning (BC) Algorithm.
- class BcAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), actor_network_cls=<class 'alf.networks.actor_networks.ActorNetwork'>, actor_optimizer=None, env=None, config=None, checkpoint=None, debug_summaries=False, epsilon_greedy=None, name='BcAlgorithm')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmBehavior cloning algorithm. Behavior cloning is an offline approach to learn a policy \(\pi_{\theta}(a|s)\), which is a function that maps an input observation \(s\) to an action \(a\). The paramerates (\(\theta\)) of this policy is learned by using the expert action as supervision for training, e.g., by maximizing the probability of the expert actions on the training data \(D\): \(\max_{\theta} E_{(s,a)~D}\log \pi_{\theta}(a|s)\)
Reference:
Pomerleau ALVINN: An Autonomous Land Vehicle in a Neural Network, NeurIPS 1988.
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions; can be a mixture of discrete and continuous actions. The number of continuous actions can be arbitrary while only one discrete action is allowed currently. If it’s a mixture, then it must be a tuple/list
(discrete_action_spec, continuous_action_spec).reward_spec (Callable) – a rank-1 or rank-0 tensor spec representing the reward(s). For interface compatiblity purpose. Not actually used in BcAlgorithm.
actor_network_cls (Callable) – is used to construct the actor network. The constructed actor network is a determinstic network and will be used to generate continuous actions.
actor_optimizer (torch.optim.optimizer) – The optimizer for actor.
env (Environment) – The environment to interact with.
envis a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.config (TrainerConfig) – config for training. It only needs to be provided to the algorithm which performs
train_iter()by itself.checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.debug_summaries (bool) – True if debug summaries should be created.
epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from
config.epsilon_greedyand thenalf.get_config_value(TrainerConfig.epsilon_greedy).name (str) – The name of this algorithm.
- calc_loss_offline(info, pre_train=False)[source]#
Calculate the hybrid loss at each step for each sample. By default, this function calls
calc_lossas its default implementation.- Parameters
info_offline (nest) – information collected for training from the offline training branch. It is returned by
train_step_offline()(hybrid off-policy training).pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.
- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(inputs, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- train_step_offline(inputs, state, rollout_info, pre_train=False)[source]#
Perform one step of offline training computation.
It is called to calculate output for every time step for a batch of experience from offline replay buffer. It also needs to generate necessary information for
calc_loss_offline(). By default, this function callstrain_stepas its default implementation.- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class BcInfo(actor)#
Bases:
tupleCreate new instance of BcInfo(actor,)
- actor#
Alias for field number 0
- BcLossInfo#
alias of
alf.algorithms.bc_algorithm.LossInfo
alf.algorithms.causal_bc_algorithm#
Causal Behavior Cloning Algorithm.
- class BcInfo(actor, discriminator, target)#
Bases:
tupleCreate new instance of BcInfo(actor, discriminator, target)
- actor#
Alias for field number 0
- discriminator#
Alias for field number 1
- target#
Alias for field number 2
- BcLossInfo#
alias of
alf.algorithms.causal_bc_algorithm.LossInfo
- class BcState(actor)#
Bases:
tupleCreate new instance of BcState(actor,)
- actor#
Alias for field number 0
- class CausalBcAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), actor_network_cls=<class 'alf.networks.actor_networks.ActorNetwork'>, discriminator_network_cls=<class 'alf.networks.encoding_networks.EncodingNetwork'>, actor_optimizer=None, discriminator_optimizer=None, f_norm_penalty_weight=0.001, bc_regulatization_weight=0.05, env=None, config=None, checkpoint=None, debug_summaries=False, epsilon_greedy=None, name='CausalBcAlgorithm')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmCausal behavior cloning algorithm. This is the implementation of ResiduIL algorithm proposed in the following paper:
Swamy et al. Causal Imitation Learning under Temporally Correlated Noise, ICML 2022
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions; can be a mixture of discrete and continuous actions. The number of continuous actions can be arbitrary while only one discrete action is allowed currently. If it’s a mixture, then it must be a tuple/list
(discrete_action_spec, continuous_action_spec).reward_spec (Callable) – a rank-1 or rank-0 tensor spec representing the reward(s). For interface compatiblity purpose. Not actually used in CausalBcAlgorithm.
actor_network_cls (Callable) – is used to construct the actor network. The constructed actor network is a determinstic network and will be used to generate continuous actions.
discriminator_network_cls (Callable) – is used to construct the discriminator network. The discrimonator is trained in a way that is adversarial to the training of the policy, to help with the learning of a robust policy. It takes the observation from the previous time step to generate the lagrange multiplier for the current step.
actor_optimizer (torch.optim.optimizer) – The optimizer for actor.
discriminator_optimizer (torch.optim.optimizer) – the optimizer for discriminator.
f_norm_penalty_weight (float) – penalty weight for the output of the discriminator.
bc_regulatization_weight (float) – weight for the squared prediction error based regularization term.
env (Environment) – The environment to interact with.
envis a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.config (TrainerConfig) – config for training. It only needs to be provided to the algorithm which performs
train_iter()by itself.checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.debug_summaries (bool) – True if debug summaries should be created.
epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from
config.epsilon_greedyand thenalf.get_config_value(TrainerConfig.epsilon_greedy).name (str) – The name of this algorithm.
- calc_loss_offline(info, pre_train=False)[source]#
Calculate the hybrid loss at each step for each sample. By default, this function calls
calc_lossas its default implementation.- Parameters
info_offline (nest) – information collected for training from the offline training branch. It is returned by
train_step_offline()(hybrid off-policy training).pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.
- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(inputs, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- train_step_offline(inputs, state, rollout_info, pre_train=False)[source]#
Perform one step of offline training computation.
It is called to calculate output for every time step for a batch of experience from offline replay buffer. It also needs to generate necessary information for
calc_loss_offline(). By default, this function callstrain_stepas its default implementation.- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
alf.algorithms.config#
- class TrainerConfig(root_dir, conf_file='', ml_type='rl', algorithm_ctor=None, data_transformer_ctor=None, random_seed=None, num_iterations=1000, num_env_steps=0, unroll_length=8, unroll_with_grad=False, async_unroll=False, max_unroll_length=0, unroll_queue_size=200, unroll_step_interval=0, unroll_parameter_update_period=10, use_rollout_state=False, temporally_independent_train_step=None, num_checkpoints=10, confirm_checkpoint_upon_crash=True, no_thread_env_for_conf=False, evaluate=False, num_evals=None, eval_interval=10, epsilon_greedy=0.0, eval_uncertainty=False, num_eval_episodes=10, num_eval_environments=1, async_eval=True, ddp_paras_check_interval=0, num_summaries=None, summary_interval=50, summarize_first_interval=True, update_counter_every_mini_batch=False, summaries_flush_secs=1, summary_max_queue=10, metric_min_buffer_size=10, debug_summaries=False, profiling=False, enable_amp=False, code_snapshots=None, summarize_grads_and_vars=False, summarize_gradient_noise_scale=False, summarize_action_distributions=False, summarize_output=False, initial_collect_steps=0, num_updates_per_train_iter=4, mini_batch_length=None, mini_batch_size=None, whole_replay_buffer_training=True, replay_buffer_length=1024, priority_replay=False, priority_replay_alpha=0.7, priority_replay_beta=0.4, priority_replay_eps=1e-06, offline_buffer_dir=None, offline_buffer_length=None, rl_train_after_update_steps=0, rl_train_every_update_steps=1, empty_cache=False, normalize_importance_weights_by_max=False, clear_replay_buffer=True)[source]#
Bases:
objectConfiguration for training.
- Parameters
root_dir (str) – directory for saving summary and checkpoints
ml_type (str) – type of learning task, one of [‘rl’, ‘sl’]
algorithm_ctor (Callable) – callable that create an
OffPolicyAlgorithmorOnPolicyAlgorithminstancedata_transformer_ctor (Callable|list[Callable]) – Function(s) for creating data transformer(s). Each of them will be called as
data_transformer_ctor(observation_spec)to create a data transformer. Available transformers are inalgorithms.data_transformer. The data transformer constructed by this can be access asTrainerConfig.data_transformer. Important Note:HindsightExperienceTransformer,FrameStackeror any data transformer that need to access the replay buffer for additional data need to be before all other data transformers. The reason is the following: In off policy training, the replay buffer stores raw input w/o being processed by any data transformer. If sayObservationNormalizeris applied before hindsight, then data retrieved by replay will be normalized whereas hindsight data directly pulled from the replay buffer will not be normalized. Data will be in mismatch, causing training to suffer and potentially fail.random_seed (None|int) – random seed, a random seed is used if None
num_iterations (int) – For RL trainer, indicates number of update iterations (ignored if 0). Note that for off-policy algorithms, if
initial_collect_steps>0, then the firstinitial_collect_steps//(unroll_length*num_envs)iterations won’t perform any training. For SL trainer, indicates the number of training epochs. If both num_iterations and num_env_steps are set, num_iterations must be big enough to consume so many environment steps. And after num_env_steps enviroment steps are generated, the training will not interact with environments anymore, which means that it will only train on replay buffer.num_env_steps (int) – number of environment steps (ignored if 0). The total number of FRAMES will be (
num_env_steps*frame_skip) for calculating sample efficiency. See alf/environments/wrappers.py for the definition of FrameSkip.unroll_length (float) – number of time steps each environment proceeds per iteration. The total number of time steps from all environments per iteration can be computed as:
num_envs * env_batch_size * unroll_length. Ifunroll_lengthis not an integer, the actual unroll_length being used will fluctuate betweenfloor(unroll_length)andceil(unroll_length)and the expectation will be equal tounroll_length.unroll_with_grad (bool) – a bool flag indicating whether we require grad during
unroll(). This flag is only used byOffPolicyAlgorithmwhere unrolling with grads is usually unnecessary and turned off for saving memory. However, when there is an on-policy sub-algorithm, we can enable this flag for its training.OnPolicyAlgorithmalways unrolls with grads and this flag doesn’t apply to it.async_unroll (
bool) – whether to unroll asynchronously. If True, unroll will be performed in parallel with training.max_unroll_length (
int) – the maximal length of unroll results for each iteration. If the time for one step of training is less than the time for unrollingmax_unroll_lengthsteps, the length of the unroll results will be less thanmax_unroll_length. Only used ifasync_unrollis True and unroll_length==0.unroll_queue_size (
int) – the size of the queue for transmitting unroll results from the unroll process to the main process. Only used ifasync_unrollis True. If the queue is full, the unroll process will wait for the main process to retrieve unroll results from the queue before performing more unrolls.unroll_step_interval (
float) – if not zero, the time interval in second between each two environment steps. Only used ifasync_unrollis True. This is useful if the interaction with the environment happens in real time (e.g. real world robot or real time simulation) and you want a fixed interaction frequency with the environment. Note that this will not has any effect if environment step and rollout step together spend more than unroll_step_interval.unroll_parameter_update_period (
int) – update the parameter for the asynchronous unroll every so many interations. Only used ifasync_unrollis True.use_rollout_state (bool) – If True, when off-policy training, the RNN states will be taken from the replay buffer; otherwise they will be set to 0. In the case of True, the
train_state_specof an algorithm should always be a subset of therollout_state_spec.temporally_independent_train_step (bool|None) – If True, the
train_stepis called with all the experiences in one batch instead of being called sequentially withmini_batch_lengthbatches. Only used byOffPolicyAlgorithm. In general, this option can only be used if the algorithm has no state. For Algorithm with state (e.g.SarsaAlgorithmnot using RNN), if there is no need to recompute state at train_step, this option can also be used. IfNone, its value is inferred based on whether the algorithm has RNN state (Trueif there is RNN state,Falseif not).num_checkpoints (int) – how many checkpoints to save for the training
confirm_checkpoint_upon_crash (bool) – whether to prompt for whether do checkpointing after crash.
no_thread_env_for_conf (bool) – not to create an unwrapped env for the purpose of showing operative configurations. If True, no
ThreadEnvironmentwill ever be created, regardless of the value ofTrainerConfig.evaluate. If False, aThreadEnvironmentwill be created ifTrainerConfig.evaluateor the training env is aParallelAlfEnvironmentinstance. For an env that consume lots of resources, this flag can be set toTrueif no evaluation is needed to save resources. The decision of creating an unwrapped env won’t affect training; it’s used to correctly display inoperative configurations in subprocesses.evaluate (bool) – A bool to evaluate when training
num_evals (int) – how many evaluations are needed throughout the training. If not None, an automatically calculated
eval_intervalwill replaceconfig.eval_interval.eval_interval (int) – evaluate every so many iteration
epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation.
eval_uncertainty (bool) – whether to evluate uncertainty after training.
num_eval_episodes (int) – number of episodes for one evaluation.
num_eval_environments (
int) – the number of environments for evaluation.async_eval (
bool) – whether to do evaluation asynchronously in a different process. Note that this may use more memory.ddp_paras_check_interval (
int) – if >0, then every so many iterations the trainer will perform a consistency check of the model parameters across different worker processes, if multi-gpu training is used.num_summaries (int) – how many summary calls are needed throughout the training. If not None, an automatically calculated
summary_intervalwill replaceconfig.summary_interval. Note that this number doesn’t include the summary steps of the first interval ifsummarize_first_interval=True. In this case, the actual number of summaries will be roughly this number plus the calculated summary interval.summary_interval (int) – write summary every so many training steps
summarize_first_interval (bool) – whether to summarize every step of the first interval (default True). It might be better to turn this off for an easier post-processing of the curve.
update_counter_every_mini_batch (bool) – whether to update counter for every mini batch. The
summary_intervalis based on this counter. Typically, this should be False. Set to True if you want to have summary for every mini batch for the purpose of debugging. Only used byOffPolicyAlgorithm.summaries_flush_secs (int) – flush summary to disk every so many seconds
summary_max_queue (int) – flush to disk every so mary summaries
metric_min_buffer_size (int) – a minimal size of the buffer used to construct some average episodic metrics used in
RLAlgorithm.debug_summaries (bool) – A bool to gather debug summaries.
profiling (bool) – If True, use cProfile to profile the training. The profile result will be written to
root_dir/py_train.INFO.enable_amp – whether to use automatic mixed precision for training. This can makes the training faster if the algorithm is GPU intensive. However, the result may be different (mostly likely due to random fluctuation).
code_snapshots (list[str]) – an optional list of code files to write to tensorboard text. Note: the code file path should be relative to “<ALF_ROOT>/alf”, e.g., “algorithms/agent.py”. This can be useful for tracking code changes when running a job.
summarize_grads_and_vars (bool) – If True, gradient and network variable summaries will be written during training.
summarize_gradient_noise_scale (bool) – whether summarize gradient noise scale. See
alf.optimizers.utils.pyfor details.summarize_output (bool) – If True, summarize output of certain networks.
initial_collect_steps (int) – if positive, number of steps each single environment steps before perform first update. Only used by
OffPolicyAlgorithm.num_updates_per_train_iter (int) – number of optimization steps for one iteration. Only used by
OffPolicyAlgorithm.mini_batch_size (int) – number of sequences for each minibatch. If None, it’s set to the replayer’s
batch_size. Only used byOffPolicyAlgorithm.mini_batch_length (int) – the length of the sequence for each sample in the minibatch. Only used by
OffPolicyAlgorithm.whole_replay_buffer_training (bool) – whether use all data in replay buffer to perform one update. Only used by
OffPolicyAlgorithm.clear_replay_buffer (bool) – whether use all data in replay buffer to perform one update and then wiped clean. Only used by
OffPolicyAlgorithm.replay_buffer_length (int) – the maximum number of steps the replay buffer store for each environment. Only used by
OffPolicyAlgorithm.priority_replay (bool) – Use prioritized sampling if this is True.
priority_replay_alpha (float|Scheduler) – The priority from LossInfo is powered to this as an argument for
ReplayBuffer.update_priority(). Note that the effect ofReplayBuffer.initial_prioritymay change with different values ofpriority_replay_alpha. Hence you may need to adjustReplayBuffer.initial_priorityaccordingly.priority_replay_beta (float|Scheduler) – weight the loss of each sample by
importance_weight**(-priority_replay_beta), whereimportance_weightis from the BatchInfo returned byReplayBuffer.get_batch(). This is only useful ifprioritized_samplingis enabled forReplayBuffer.priority_replay_eps (float) – minimum priority for priority replay.
offline_buffer_dir (str|[str]) – path to the offline replay buffer checkpoint to be loaded. If a list of strings provided, each will represent the directory to one replay buffer checkpoint.
offline_buffer_length (int) – the maximum length will be loaded from each replay buffer checkpoint. Therefore the total buffer length is offline_buffer_length * len(offline_buffer_dir). If None, all the samples from all the provided replay buffer checkpoints will be loaded.
rl_train_after_update_steps (int) – only used in the hybrid training mode. It is used as a starting criteria for the normal (non-offline) part of the RL training, which only starts after so many number of update steps (according to
global_counter).rl_train_every_update_steps (int) – only used in the hybrid training mode. It is used to control the update frequency of the normal (non-offline) part of the RL training (according to
global_counter). Through this flag, we can have a more fine grained control over the update frequencies of online and offline RL training (currently assumes the training frequency of offline RL is always higher or equal to the online RL part). For example, we can setrl_train_every_update_steps = 2to have a train config that executes online RL training at the half frequency of that of the offline RL training.empty_cache (
bool) – empty GPU memory cache at the start of every iteration to reduce GPU memory usage. This option may slightly slow down the overall speed.normalize_importance_weights_by_max (
bool) – if True, normalize the importance weights by its max to prevent instability caused by large importance weight.
alf.algorithms.containers#
- class AlgorithmContainer(algs, train_state_spec, rollout_state_spec, predict_state_spec, is_on_policy, debug_summaries, name)[source]#
Bases:
alf.algorithms.algorithm.AlgorithmAlgorithm that contains several sub-algorithms.
It provides sensible implementation of several interface functions of Algorithm.
- Parameters
algs (dict[Algorithm]) – a dictionary of algorithms.
train_state_spec (nested TensorSpec) – for the network state of
train_step().rollout_state_spec (nested TensorSpec) – for the network state of
predict_step(). If None, it’s assumed to be the same astrain_state_spec.predict_state_spec (nested TensorSpec) – for the network state of
predict_step(). If None, it’s assume to be same asrollout_state_spec.is_on_policy (None|bool) – whether the algorithm is on-policy or not. If None, the on-policiness will be decided based on the on-policiness of each sub-algorithm.
debug_summaries (bool) – True if debug summaries should be created.
name (str) – name of this algorithm.
- preprocess_experience(root_inputs, rollout_info, batch_info)[source]#
Call the preprocess_experience of each sub-algorithm.
- training: bool#
- class EchoAlg(alg, echo_spec, name='EchoAlg')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmEcho Algorithm.
Echo algorithm uses part of the output of
algof current step as part of the input ofalgfor the next step. It assumes that the input ofalgis a dict with two keys: ‘input’ and ‘echo’, and the output ofalgis a dict with two keys: ‘output’ and ‘echo’. The ‘echo’ output of current step will be the ‘echo’ input of the next step. ‘input’ ofalg’s input is from the input ofEchoAlgand ‘output’ ofalg’s output is the output ofEchoAlg.- Parameters
alg (Algorithm) – the module for performing the actual computation
echo_spec (nested TensorSpec) – describe the data format of echo.
name (str) –
- after_train_iter(root_inputs, rollout_info)[source]#
Do things after completing one training iteration (i.e.
train_iter()that consists of one or multiple gradient updates). This function can be used for training additional modules that have their own training logic (e.g., on/off-policy, replay buffers, etc). These modules should be added to_trainable_attributes_to_ignorein the parent algorithm.Other things might also be possible as long as they should be done once every training iteration.
This function will serve the same purpose with
after_updateif there is always only one gradient update in each training iteration. Otherwise it’s less frequently called thanafter_update.- Parameters
root_inputs (nest|None) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll(). In the case where no data is available from therollout_step()(e.g. in a offline pre-training phase where the online interaction is not started yet)root_inputswill be None.rollout_info (nest|None) – information collected from
rollout_step()for this algorithm duringunroll(). In the case where no data is available from therollout_step()(e.g. in a offline pre-training phase where the online interaction is not started yet)rollout_infowill be None.
- after_update(root_inputs, info)[source]#
Do things after completing one gradient update (i.e.
update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.- Parameters
root_inputs (nest) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll().info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()for on-policy training ortrain_step()for off-policy training.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(inputs, state)[source]#
Predict for one step of inputs.
- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
predict_state_spec.- info (nest): information for analyzing the agent. In particular,
if an element of the info is
alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.
- Return type
- preprocess_experience(root_inputs, rollout_info, batch_info)[source]#
This function is called on the experiences obtained from a replay buffer. An example usage of this function is to calculate advantages and returns in
PPOAlgorithm.The shapes of tensors in experience are assumed to be \((B, T, ...)\).
- Parameters
root_inputs (nest) – input for rollout_step() of the root algorithm. This is from replay buffer. Note this is not same as the input of rollout_step() of self unless self is the root algorithm.
rollout_info (nested Tensor) –
AlgStep.infofrom rollout_step() for this algorithm.batch_info (BatchInfo) – information about this batch of data
- Returns
processed root_inputs
processed rollout_info
- Return type
tuple
- rollout_step(inputs, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- set_on_policy(is_on_policy)[source]#
Set whether this algorithm is on-policy or not.
- Parameters
is_on_policy (bool) –
- set_path(path)[source]#
Set the path from the root algorithm to this algorithm.
See
AlgorithmInterface.pathfor description about path. This function is called by the trainer before training starts. It needs to be implemented if the algorithm contains some other sub-algorithms.If an algorithm does not have any sub-algorithm or its sub-algorithm does not need to access the root replay buffer directly, it does not implement this function.
- train_step(inputs, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class RLAlgWrapper(observation_spec, action_spec, algorithm, env=None, reward_spec=TensorSpec(shape=(), dtype=torch.float32), config=None, optimizer=None, debug_summaries=False, name='RLAlgWrapper')[source]#
Bases:
alf.algorithms.rl_algorithm.RLAlgorithmWrap an
Algorithminstance as anRLAlgorithminstance so that it can be used for RLTrainer.- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions.
algorithm (Algorithm) – algorithm to be wrapped. It should take
TimeStepas input and its output will be used as action.reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
env (Environment) – The environment to interact with.
envis a batched environment, which means that it runs multiple simulations simultaneously. Running multiple environments in parallel is crucial to on-policy algorithms as it increases the diversity of data and decreases temporal correlation.envonly needs to be provided to the rootAlgorithm.config (TrainerConfig) – config for training.
configonly needs to be provided to the algorithm which performs a training iteration by itself.optimizer (torch.optim.Optimizer) – The default optimizer for training.
debug_summaries (bool) – If True, debug summaries will be created.
name (str) – Name of this algorithm.
- after_train_iter(root_inputs, rollout_info)[source]#
Do things after completing one training iteration (i.e.
train_iter()that consists of one or multiple gradient updates). This function can be used for training additional modules that have their own training logic (e.g., on/off-policy, replay buffers, etc). These modules should be added to_trainable_attributes_to_ignorein the parent algorithm.Other things might also be possible as long as they should be done once every training iteration.
This function will serve the same purpose with
after_updateif there is always only one gradient update in each training iteration. Otherwise it’s less frequently called thanafter_update.- Parameters
root_inputs (nest|None) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll(). In the case where no data is available from therollout_step()(e.g. in a offline pre-training phase where the online interaction is not started yet)root_inputswill be None.rollout_info (nest|None) – information collected from
rollout_step()for this algorithm duringunroll(). In the case where no data is available from therollout_step()(e.g. in a offline pre-training phase where the online interaction is not started yet)rollout_infowill be None.
- after_update(root_inputs, info)[source]#
Do things after completing one gradient update (i.e.
update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.- Parameters
root_inputs (nest) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll().info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()for on-policy training ortrain_step()for off-policy training.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(inputs, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- preprocess_experience(root_inputs, rollout_info, batch_info)[source]#
This function is called on the experiences obtained from a replay buffer. An example usage of this function is to calculate advantages and returns in
PPOAlgorithm.The shapes of tensors in experience are assumed to be \((B, T, ...)\).
- Parameters
root_inputs (nest) – input for rollout_step() of the root algorithm. This is from replay buffer. Note this is not same as the input of rollout_step() of self unless self is the root algorithm.
rollout_info (nested Tensor) –
AlgStep.infofrom rollout_step() for this algorithm.batch_info (BatchInfo) – information about this batch of data
- Returns
processed root_inputs
processed rollout_info
- Return type
tuple
- rollout_step(inputs, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- set_on_policy(is_on_policy)[source]#
Set whether this algorithm is on-policy or not.
- Parameters
is_on_policy (bool) –
- set_path(path)[source]#
Set the path from the root algorithm to this algorithm.
See
AlgorithmInterface.pathfor description about path. This function is called by the trainer before training starts. It needs to be implemented if the algorithm contains some other sub-algorithms.If an algorithm does not have any sub-algorithm or its sub-algorithm does not need to access the root replay buffer directly, it does not implement this function.
- train_step(inputs, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- SequentialAlg(*modules, output='', is_on_policy=None, name='SequentialAlg', **named_modules)[source]#
Compose Algorithms Networks sequentially as a new Algorithm.
All the modules provided through
modulesandnamed_modulesare calculated sequentially in the same order as they appear in the call toSequentialAlg. By default, each module takes the output of the previous module as its input (or the input to the SequentialAlg if it is the first module), and the output of the last module is the output of theSequentialAlg. Note that the output of a module means differently depending on the type of the module:- Algorithm:
AlgStep.outputfield frompredict_step,rollout_step or
train_step
- Algorithm:
Network: the first element of the tuple returned from
forward()torch.nn.Module or Callable: the return value of the Callable.
In addition to using the output of the previous module as input,
SequentialAlgalso allow using other output, state or info from previous module as the input to a module. To do this, one can pass a tuple of (nested_str, module) instead of module as an argument toSequentialAlg. With this, the inputs to the module will be obtained usingget_nested_field(named_results, nested_str), wherenamed_resultsis a dictionary containing the inputs toSequentialAlgand all the results calulcated by previous modules. More specifically,named_results['input']is the inputs to this algorithm.named_results['a']is the output of the module named ‘a’.named_results['info']['a']is the info output of the algorithm named ‘a’. Andnamed_results['state']['a']is state output of the algorithm/network named ‘a’.Example 1:
The following contructs an algorithm which predicts the future of its input:
predictor = EncodingNetwork(...) alg = SequentialAlg( predicted=predictor, delayed=networks.Delay(), error=(('delayed', 'input'), lambda xy: (xy[0] - xy[1]) ** 2), loss=Loss(), output='predicted', )
It is equivalent to the following:
class PredictAlgorithm(Algorithm): def __init__(self, predictor): super().__init__(train_state_spec=( predictor.state_spec, predictor.input_tensor_spec)) self._predictor = predictor self._loss = Loss() def rollout_step(self, inputs, state): return self._step(inputs, state) def train_step(self, inputs, state, rollout_info): return self._step(inputs, state) def _step(self, inputs, state): predictor_state, delayed = state predicted, predictor_state = self._predictor(inputs, predictor_state) error = (delayed - inputs) ** 2 loss_step = self._loss.rollout_step(error) return AlgStep( output=predicted, state=(predictor_state, predicted), info=loss_step.info) def calc_loss(info): return self._loss.calc_loss(info) alg = PredictAlgorithm(predictor)
Example 2:
The following example constructs an actor-critic algorithm:
value_net = ValueNetwork(...) actor_net = ActorDistributionNetwork(...) alg = SequentialAlg( is_on_policy=True, value=('input.observation', value_net), action_dist=('input.observation', actor_net), action=dist_utils.sample_action_distribution, loss=(ActorCriticInfo( reward='input.reward', step_type='input.step_type', discount='input.discount', action_distribution='action_dist', action='action', value='value'), ActorCriticLoss()), output='action')
It is equivalent to the following:
class ACAlgorithm(Algorithm): def __init__(self, value_net, actor_net): super().__init__( train_state_spec=(value_net.state_spec, actor_net.state_spec), is_on_policy=True) self._value_net = value_net self._actor_net = actor_net self._loss = ActorCriticLoss() def rollout_step(self, inputs, state): value, value_state = self._value_net(inputs.observation, state[0]) action_dist, actor_state = self._actor_net(inputs.observation, state[1]) action = dist_utils.sample_action_distribution(action_dist) loss_step = self._loss.rollout_step(ActorCriticInfo( reward=inputs.reward, step_type=inputs.step_type, discount=inputs.discount, action_distribution=action_dist, action=action, value=value)) ) return AlgStep( output=action, state=(value_state, actor_state), info=loss_step.info) def calc_loss(self, info): self._loss.calc_loss(info) alg = ACAlgorithm(value_net, actor_net)
- Parameters
modules (Callable | Algorithm | (nested str, Callable) | (nested str, Algorithm)) – The
Callablecan be atorch.nn.Module,alf.nn.Networkor plainCallable. Optionally, their inputs can be specified by the first element of the tuple. If input is not provided, it is assumed to be the result of the previous module (or input to thisSequentialfor the first module). If input is provided, it should be a nested str. It will be used to retrieve results from the dictionary of the currentnamed_results. For modules specified bymodules, because nonamed_moduleshas been invoked,named_outputsis{'input': input}.named_modules (Callable | Algorithm | (nested str, Callable) | (nested str, Algorithm)) – The
Callablecan be atorch.nn.Module,alf.nn.Networkor plainCallable. Optionally, their inputs can be specified by the first element of the tuple. If input is not provided, it is assumed to be the result of the previous module (or input to thisSequentialfor the first module). If input is provided, it should be a nested str. It will be used to retrieve results from the dictionary of the currentnamed_results.named_resultsis updated once the result of a named module is calculated.output (nested str) – if not provided, the result from the last module will be used as output. Otherwise, it will be used to retrieve results from
named_resultsafter the results of all modules have been calculated.is_on_policy (bool) – wether this supports on-policy or off-policy training. If is None, it should supports both on-policy and off-policy training.
name (str) – name of this algorithm
alf.algorithms.data_transformer#
Data transformers for transforming data from environment or replay buffer.
- class DataTransformer(transformed_observation_spec, state_spec)[source]#
Bases:
torch.nn.modules.module.ModuleBase class for data transformers.
DataTransformer is used for transforming raw data from environment before passing to actual algorithms.
Most data transformers can subclass from
SimpleDataTransformer, which provides a simpler interface.- Parameters
transformed_observation_spec (nested TensorSpec) – describing the transformed observation
state_spec (nested TensorSpec) – describing the state of the transformer when it is used to transform
TimeStep
- property stack_size#
The number of frames being stacked as one observation.
- property state_spec#
Get the state spec of this transformer.
- training: bool#
- transform_experience(experience)[source]#
Transform an Experience structure.
This is used on the experience data retrieved from replay buffer.
- Parameters
experience (Experience) – the experience retrieved from replay buffer. Note that
experience.batch_info,experience.replay_bufferneed to be set.- Returns
transformed experience
- Return type
- transform_timestep(timestep, state)[source]#
Transform a TimeStep structure.
This is used during unroll or predict.
- Parameters
timestep (TimeStep) – the TimeStep needs to be transformed
state (nested Tensor) – the state of the transformer running over the timestep sequence. It should be the returned state from the previous call to transform_timestep. For the initial call to
transform_timestepan zero state following thestate_speccan be used.
- Returns
transformed TimeStep
state of the transformer
- Return type
tuple
- property transformed_observation_spec#
Get the transformed observation_spec.
- class FrameStackState(steps, prev_frames)#
Bases:
tupleCreate new instance of FrameStackState(steps, prev_frames)
- prev_frames#
Alias for field number 1
- steps#
Alias for field number 0
- class FrameStacker(observation_spec, stack_size=4, stack_axis=0, fields=None)[source]#
Bases:
alf.algorithms.data_transformer.DataTransformerCreate a FrameStacker object.
- Parameters
observation_spec (nested TensorSpec) – describing the observation in timestep
stack_size (int) – stack so many frames
stack_axis (int) – the dimension to stack the observation.
fields (list[str]) – fields to be stacked, A field str is a multi-level path denoted by “A.B.C”. If None, then non-nested observation is stacked.
- property stack_size#
Get stack_size.
- training: bool#
- transform_experience(experience)[source]#
Transform an Experience structure.
This is used on the experience data retrieved from replay buffer.
- Parameters
experience (Experience) – the experience retrieved from replay buffer. Note that
experience.batch_info,experience.replay_bufferneed to be set.- Returns
transformed experience
- Return type
- transform_timestep(time_step, state)[source]#
Transform a TimeStep structure.
This is used during unroll or predict.
- Parameters
timestep (TimeStep) – the TimeStep needs to be transformed
state (nested Tensor) – the state of the transformer running over the timestep sequence. It should be the returned state from the previous call to transform_timestep. For the initial call to
transform_timestepan zero state following thestate_speccan be used.
- Returns
transformed TimeStep
state of the transformer
- Return type
tuple
- class FunctionalRewardTransformer(func, observation_spec=())[source]#
Bases:
alf.algorithms.data_transformer.RewardTransformerTransform reward according to a provided function.
Can be used as a reward shaping function passed to an algorithm (e.g.
ActorCriticAlgorithm).- Parameters
func (Callable) – the transformation function to be applied to the reward. It takes reward as input and outputs a transformed reward.
observation_spec (nested TensorSpec) – describing the observation in timestep
- forward(reward)[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- training: bool#
- class HindsightExperienceTransformer(observation_spec, her_proportion=0.8, achieved_goal_field='time_step.observation.achieved_goal', desired_goal_field='time_step.observation.desired_goal', reward_fn=<function l2_dist_close_reward_fn>)[source]#
Bases:
alf.algorithms.data_transformer.DataTransformerRandomly transform her_proportion of batch_size trajectories with hindsight relabel.
This transformer assumes that input observation is a dict of at least two fields: 1) an
achieved_goalfield, indicating the current state of the environment, and 2) adesired_goalfield, indicating the desired state of the environment. The achieved_goal from a future timestep will be used to relabel the desired_goal of the current timestep. The exact field names can be provided via arguments to the class__init__.To use this class, add it to any existing data transformers, e.g. use this config if
ObservationNormalizeris an existing data transformer:ReplayBuffer.keep_episodic_info=True HindsightExperienceTransformer.her_proportion=0.8 TrainerConfig.data_transformer_ctor=[@HindsightExperienceTransformer, @ObservationNormalizer]
See unit test for more details on behavior.
- Parameters
her_proportion (float) – proportion of hindsight relabeled experience.
achieved_goal_field (str) – path to the achieved_goal field in the exp nest.
desired_goal_field (str) – path to the desired_goal field in the exp nest.
reward_fn (Callable) – function to recompute reward based on achieve_goal and desired_goal. Default gives reward 0 when L2 distance less than 0.05 and -1 otherwise, same as is done in suite_robotics environments.
- training: bool#
- transform_experience(experience)[source]#
Hindsight relabel experience Note: The environments where the samples are from are ordered in the
returned batch.
- Parameters
experience (Experience) – experience sampled from replay buffer with batch_info and batch_info.replay_buffer both populated.
- Returns
the relabeled experience, with batch_info potentially changed.
- Return type
- transform_timestep(timestep, state)[source]#
Transform a TimeStep structure.
This is used during unroll or predict.
- Parameters
timestep (TimeStep) – the TimeStep needs to be transformed
state (nested Tensor) – the state of the transformer running over the timestep sequence. It should be the returned state from the previous call to transform_timestep. For the initial call to
transform_timestepan zero state following thestate_speccan be used.
- Returns
transformed TimeStep
state of the transformer
- Return type
tuple
- class IdentityDataTransformer(observation_spec=None)[source]#
Bases:
alf.algorithms.data_transformer.SimpleDataTransformerA data transformer that keeps the data unchanged.
- observation_spec (nested TensorSpec): describing the observation. This
should be provided when
transformed_observation_specpropery needs to be accessed.
- training: bool#
- class ImageScaleTransformer(observation_spec, min=- 1.0, max=1.0, fields=None)[source]#
Bases:
alf.algorithms.data_transformer.SimpleDataTransformerScale image to min and max (0->min, 255->max).
- Parameters
observation_spec (nested TensorSpec) – describing the observation in timestep
fields (list[str]) – the fields to be applied with the transformation. If None, then
observationmust be aTensorwith dtypeuint8. A field str can be a multi-step path denoted by “A.B.C”.min (float) – normalize minimum to this value
max (float) – normalize maximum to this value
- training: bool#
- class ObservationNormalizer(observation_spec, fields=None, clipping=0.0, window_size=10000, update_rate=0.0001, speed=8.0, zero_mean=True, update_mode='replay', mode='adaptive')[source]#
Bases:
alf.algorithms.data_transformer.SimpleDataTransformerCreate an observation normalizer with optional value clipping to be used as the
data_transformerof an algorithm. It will be called before bothrollout_step()andtrain_step().The normalizer by default doesn’t automatically update the mean and std. Instead, it will check when
self.forward()is called, whether an algorithm is unrolling or training. It only updates the mean and std during unroll. This is the suggested way of using an observation normalizer (i.e., update the stats when encountering new data for the first time). This same strategy has been used by OpenAI’s baselines for training their Robotics environments.- Parameters
observation_spec (nested TensorSpec) – describing the observation in timestep
fields (None|list[str]) – If None, normalize all fields. Otherwise, only normalized the specified fields. Each string in
fieldsis a a multi-step path denoted by “A.B.C”.clipping (float) – a floating value for clipping the normalized observation into
[-clipping, clipping]. Only valid if it’s greater than 0.window_size (int) – the window size of
WindowNormalizer.update_rate (float) – the update rate of
EMNormalizer.speed (float) – the speed of updating for
AdaptiveNormalizer.zero_mean (bool) – whether to make the normalized value be zero-mean
update_mode (str) – update stats during specified mode in [“replay”, “rollout”, “pretrain”].
mode (str) – a value in [“adaptive”, “window”, “em”] indicates which normalizer to use.
- training: bool#
- class RewardClipping(observation_spec=(), minmax=(- 1, 1))[source]#
Bases:
alf.algorithms.data_transformer.RewardTransformerClamp immediate rewards to the range \([min, max]\).
Can be used as a reward shaping function passed to an algorithm (e.g.
ActorCriticAlgorithm).Note that if the reward is multi-dimensional, the clipping is applied to all the dimensions. If per-dimension operation is desired,
- Parameters
observation_spec (nested TensorSpec) – describing the observation in timestep
minmax (tuple[float]) – clip this range
- forward(reward)[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- training: bool#
- class RewardNormalizer(observation_spec=(), normalizer=None, update_max_calls=0, clip_value=- 1.0, update_mode='replay')[source]#
Bases:
alf.algorithms.data_transformer.RewardTransformerTransform reward to be zero-mean and unit-variance.
- Parameters
observation_spec (nested TensorSpec) – describing the observation in timestep
normalizer (Normalizer) – the normalizer to be used to normalizer the reward. If None, will use
AdaptiveNormalizeraccording to env reward spec.update_max_calls (int) – If >0, then the normalier’s statistics will only be updated so many first calls of
_transform().clip_value (float) – if > 0, will clip the normalized reward within [-clip_value, clip_value]. Do not clip if
clip_value< 0update_mode (str) – update stats during either “replay” or “rollout”.
- property clip_value#
- forward(reward)[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- property normalizer#
- training: bool#
- class RewardScaling(scale, observation_spec=())[source]#
Bases:
alf.algorithms.data_transformer.RewardTransformerScale immediate rewards by a factor of
scale.Can be used as a reward shaping function passed to an algorithm (e.g.
ActorCriticAlgorithm).Note that if the reward is multi-dimensional, the scaling is applied to all the dimensions. If per-dimension operation is desired,
FunctionalRewardTransformercan be used.- Parameters
scale (float) – scale factor
observation_spec (nested TensorSpec) – describing the observation in timestep
- forward(reward)[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- training: bool#
- class RewardShifting(bias, observation_spec=())[source]#
Bases:
alf.algorithms.data_transformer.RewardTransformerShift immediate rewards by a displacement of
bias.Note that if the reward is multi-dimensional, the shifting is applied to all the dimensions. If per-dimension operation is desired,
FunctionalRewardTransformercan be used.- Parameters
bias (float) – displacement amount
observation_spec (nested TensorSpec) – describing the observation in timestep
- forward(reward)[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- training: bool#
- class RewardTransformer(observation_spec)[source]#
Bases:
alf.algorithms.data_transformer.SimpleDataTransformerBase class for transforming reward.
- Parameters
observation_spec (nested TensorSpec) – describing the observation in timestep
- training: bool#
- class SequentialDataTransformer(data_transformer_ctors, observation_spec)[source]#
Bases:
alf.algorithms.data_transformer.DataTransformerA data transformer consisting of a sequence of data transformers.
- Parameters
data_transformer_ctor (list[Callable]) – Functions for creating data transformers. Each of them will be called as
data_transformer_ctors[i](observation_spec)to create a data transformer.observation_spec (nested TensorSpec) – describing the raw observation in timestep. It is the observation passed to the first data transfomer.
- property stack_size#
The number of frames being stacked as one observation.
- training: bool#
- transform_experience(experience)[source]#
Transform an Experience structure.
This is used on the experience data retrieved from replay buffer.
- Parameters
experience (Experience) – the experience retrieved from replay buffer. Note that
experience.batch_info,experience.replay_bufferneed to be set.- Returns
transformed experience
- Return type
- transform_timestep(timestep, state)[source]#
Transform a TimeStep structure.
This is used during unroll or predict.
- Parameters
timestep (TimeStep) – the TimeStep needs to be transformed
state (nested Tensor) – the state of the transformer running over the timestep sequence. It should be the returned state from the previous call to transform_timestep. For the initial call to
transform_timestepan zero state following thestate_speccan be used.
- Returns
transformed TimeStep
state of the transformer
- Return type
tuple
- class SimpleDataTransformer(transformed_observation_spec)[source]#
Bases:
alf.algorithms.data_transformer.DataTransformerBase class for simple data transformers.
For simple data transformers, there is no state for
transform_timestepandtransform_experience. Andtransform_experienceuse the same function_transformto do the transformation of thetime_stepfield of the experience.Args: transformed_observation_spec (nested TensorSpec): describing the
transformed observation
- state_spec (nested TensorSpec): describing the state of the
transformer when it is used to transform
TimeStep
- training: bool#
- transform_experience(experience)[source]#
Transform Experience.
For Experience, the shapes are [B, T, …]
- Parameters
experience (
Experience) – data to be transformed- Returns
transformed Experience
- class UntransformedTimeStep(observation_spec=None)[source]#
Bases:
alf.algorithms.data_transformer.SimpleDataTransformerPut the time step itself to its field “untransformed”. Note that this data transformer must be applied first, before any other data transformer.
- observation_spec (nested TensorSpec): describing the observation. This
should be provided when
transformed_observation_specpropery needs to be accessed.
- training: bool#
- create_data_transformer(data_transformer_ctor, observation_spec, device=None)[source]#
Create a data transformer.
- Parameters
data_transformer_ctor (Callable|list[Callable]) – Function(s) for creating data transformer(s). Each of them will be called as
data_transformer_ctor(observation_spec)to create a data transformer. Available transformers are inalgorithms.data_transformer.observation_spec (nested TensorSpec) – the spec of the raw observation.
device (
Optional[str]) – If not None, the data transformer(s) will be created on the specified device.
- Returns
DataTransformer
- l2_dist_close_reward_fn(achieved_goal, goal, threshold=0.05)[source]#
Giving -1/0 reward based on how close the achieved state is to the goal state.
- Parameters
achieved_goal (Tensor) – achieved state, of shape
[batch_size, batch_length, ...]goal (Tensor) – goal state, of shape
[batch_size, batch_length, ...]threshold (float) – L2 distance threshold for the reward.
- Returns
Tensor for -1/0 reward of shape
[batch_size, batch_length].
alf.algorithms.ddpg_algorithm#
Deep Deterministic Policy Gradient (DDPG).
- class DdpgActorState(actor, critics)#
Bases:
tupleCreate new instance of DdpgActorState(actor, critics)
- actor#
Alias for field number 0
- critics#
Alias for field number 1
- class DdpgAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), actor_network_ctor=<class 'alf.networks.actor_networks.ActorNetwork'>, critic_network_ctor=<class 'alf.networks.critic_networks.CriticNetwork'>, reward_weights=None, epsilon_greedy=None, calculate_priority=False, env=None, config=None, ou_stddev=0.2, ou_damping=0.15, critic_loss_ctor=None, num_critic_replicas=1, target_update_tau=0.05, target_update_period=1, rollout_random_action=0.0, dqda_clipping=None, action_l2=0, actor_optimizer=None, critic_optimizer=None, checkpoint=None, debug_summaries=False, name='DdpgAlgorithm')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmDeep Deterministic Policy Gradient (DDPG).
Reference: Lillicrap et al “Continuous control with deep reinforcement learning” https://arxiv.org/abs/1509.02971
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions.
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
actor_network_ctor (Callable) – Function to construct the actor network.
actor_network_ctorneeds to acceptinput_tensor_specandaction_specas its arguments and return an actor network. The constructed network will be called withforward(observation, state).critic_network_ctor (Callable) – Function to construct the critic network.
critic_netwrok_ctorneeds to acceptinput_tensor_specwhich is a tuple of(observation_spec, action_spec). The constructed network will be called withforward((observation, action), state).reward_weights (list[float]) – this is only used when the reward is multidimensional. In that case, the weighted sum of the q values is used for training the actor.
epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from
config.epsilon_greedyand thenalf.get_config_value(TrainerConfig.epsilon_greedy).calculate_priority (bool) – whether to calculate priority. This is only useful if priority replay is enabled.
num_critic_replicas (int) – number of critics to be used. Default is 1.
env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously.
envonly needs to be provided to the root algorithm.config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs
train_iter()by itself.ou_stddev (float) – Standard deviation for the Ornstein-Uhlenbeck (OU) noise added in the default collect policy.
ou_damping (float) – Damping factor for the OU noise added in the default collect policy.
critic_loss_ctor (None|OneStepTDLoss|MultiStepLoss) – a critic loss constructor. If
None, a defaultOneStepTDLosswill be used.target_update_tau (float) – Factor for soft update of the target networks.
target_update_period (int) – Period for soft update of the target networks.
rollout_random_action (float) – the probability of taking a uniform random action during a
rollout_step(). 0 means always directly taking actions added with OU noises and 1 means always sample uniformly random actions. A bigger value results in more exploration during rollout.dqda_clipping (float) – when computing the actor loss, clips the gradient dqda element-wise between
[-dqda_clipping, dqda_clipping]. Does not perform clipping ifdqda_clipping == 0.action_l2 (float) – weight of squared action l2-norm on actor loss.
actor_optimizer (torch.optim.optimizer) – The optimizer for actor.
critic_optimizer (torch.optim.optimizer) – The optimizer for critic.
checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.debug_summaries (bool) – True if debug summaries should be created.
name (str) – The name of this algorithm.
- after_update(root_inputs, info)[source]#
Do things after completing one gradient update (i.e.
update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.- Parameters
root_inputs (nest) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll().info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()for on-policy training ortrain_step()for off-policy training.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(inputs, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- rollout_step(time_step, state=None)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- train_step(inputs, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class DdpgCriticInfo(q_values, target_q_values)#
Bases:
tupleCreate new instance of DdpgCriticInfo(q_values, target_q_values)
- q_values#
Alias for field number 0
- target_q_values#
Alias for field number 1
- class DdpgCriticState(critics, target_actor, target_critics)#
Bases:
tupleCreate new instance of DdpgCriticState(critics, target_actor, target_critics)
- critics#
Alias for field number 0
- target_actor#
Alias for field number 1
- target_critics#
Alias for field number 2
- class DdpgInfo(reward, step_type, discount, action, action_distribution, actor_loss, critic, discounted_return)#
Bases:
tupleCreate new instance of DdpgInfo(reward, step_type, discount, action, action_distribution, actor_loss, critic, discounted_return)
- action#
Alias for field number 3
- action_distribution#
Alias for field number 4
- actor_loss#
Alias for field number 5
- critic#
Alias for field number 6
- discount#
Alias for field number 2
- discounted_return#
Alias for field number 7
- reward#
Alias for field number 0
- step_type#
Alias for field number 1
alf.algorithms.decoding_algorithm#
Decoding algorithm.
- class DecodingAlgorithm(decoder, loss=MSELoss(), loss_weight=1.0, name='DecodingAlgorithm')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmGeneric decoding algorithm.
- Parameters
decoder (Network) – network for decoding target from input.
loss (Callable) – loss function with signature
loss(y_pred, y_true). Note that it should not reduce to a scalar. It should at least keep the batch dimension in the returned loss.loss_weight (float) – weight for the loss.
- train_step(inputs, state=(), rollout_info=None)[source]#
Train one step.
- Parameters
inputs (tuple) – tuple of (input, target)
state (nested Tensor) – network state for
decoder
- Returns
output: decoding result
state: rnn state from
decoderinfo: loss of decoding
- Return type
- training: bool#
alf.algorithms.diayn_algorithm#
- class DIAYNAlgorithm(skill_spec, encoding_net, reward_adapt_speed=8.0, observation_spec=None, hidden_size=(), hidden_activation=<built-in method relu_ of type object>, name='DIAYNAlgorithm')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmDiversity is All You Need Module
This module learns a set of skill-conditional policies in an unsupervised way. See Eysenbach et al “Diversity is All You Need: Learning Diverse Skills without a Reward Function” for more details.
Create a DIAYNAlgorithm.
- Parameters
skill_spec (TensorSpec) – supports both discrete and continuous skills. In the discrete case, the algorithm will predict 1-of-K skills using the cross entropy loss; in the continuous case, the algorithm will predict the skill vector itself using the mean square error loss.
encoding_net (EncodingNetwork) – network for encoding observation into a latent feature.
reward_adapt_speed (float) – how fast to adapt the reward normalizer. rouphly speaking, the statistics for the normalization is calculated mostly based on the most recent T/speed samples, where T is the total number of samples.
observation_spec (TensorSpec) – If not None, this spec is to be used by a observation normalizer to normalize incoming observations. In some cases, the normalized observation can be easier for training the discriminator.
hidden_size (tuple[int]) – a tuple of hidden layer sizes used by the discriminator.
hidden_activation (torch.nn.functional) – activation for the hidden layers.
name (str) – module’s name
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- rollout_step(inputs, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- train_step(inputs, state, rollout_info=None)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
alf.algorithms.dqn_algorithm#
DQN Algorithm.
- class DqnAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), q_network_cls=<class 'alf.networks.q_networks.QNetwork'>, q_optimizer=None, rollout_epsilon_greedy=0.1, target_net_target_action=True, num_critic_replicas=2, env=None, config=None, critic_loss_ctor=None, checkpoint=None, debug_summaries=False, name='DqnAlgorithm')[source]#
Bases:
alf.algorithms.sac_algorithm.SacAlgorithmDQN/DDQN algorithm:
Mnih et al "Playing Atari with Deep Reinforcement Learning", arXiv:1312.5602 Hasselt et al "Deep Reinforcement Learning with Double Q-learning", arXiv:1509.06461
The difference with DQN is that a minimum is taken from the two critics, similar to TD3, instead of choosing the maximum action using the Q network and evaluating the action value using the target Q network.
The implementation is based on the SAC algorithm.
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (BoundedTensorSpec) – Only one discrete action allowed.
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
q_network – is used to construct QNetwork for estimating
Q(s,a)given that the action is discrete. Its output spec must be consistent with the discrete action inaction_spec.q_optimizer (
Optional[Optimizer]) – A custom optimizer for the q network. Uses the enclosing algorithm’s optimizer if None.rollout_epsilon_greedy (
Union[float,Scheduler]) – epsilon greedy policy for rollout. Together with the following two parameters, the SAC algorithm can be converted to a DQN or DDQN algorithm when e.g.rollout_epsilon_greedy=0.3,max_target_action=True, anduse_entropy_reward=False.target_net_target_action (
bool) – whenTrueuses target critic network to get target action (similar as DDPG). WhenFalse, uses critic network to get target action (similar as DDQN/SAC).num_critic_replicas (
int) – number of critics to be used. Default is 2.env (
Optional[AlfEnvironment]) – The environment to interact with.envis a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.config (
Optional[TrainerConfig]) – config for training. It only needs to be provided to the algorithm which performstrain_iter()by itself.critic_loss_ctor (
Optional[Callable[…,TDLoss]]) – a critic loss constructor. IfNone, a defaultOneStepTDLosswill be used.checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.debug_summaries (bool) – True if debug summaries should be created.
name (str) – The name of this algorithm.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- rollout_step(inputs, state)[source]#
rollout_step()basically predicts actions like what is done bypredict_step(). Additionally, if states are to be stored a in replay buffer, then this function also call_critic_networksand_target_critic_networksto maintain their states.
- training: bool#
alf.algorithms.dynamic_action_repeat_agent#
- class ActionRepeatState(rl, action, steps, k, rl_discount, rl_reward, sample_rewards, repr)#
Bases:
tupleCreate new instance of ActionRepeatState(rl, action, steps, k, rl_discount, rl_reward, sample_rewards, repr)
- action#
Alias for field number 1
- k#
Alias for field number 3
- repr#
Alias for field number 7
- rl#
Alias for field number 0
- rl_discount#
Alias for field number 4
- rl_reward#
Alias for field number 5
- sample_rewards#
Alias for field number 6
- steps#
Alias for field number 2
- class DynamicActionRepeatAgent(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), env=None, config=None, K=5, rl_algorithm_cls=<class 'alf.algorithms.sac_algorithm.SacAlgorithm'>, representation_learner_cls=None, reward_normalizer_ctor=None, gamma=0.99, optimizer=None, debug_summaries=False, name='DynamicActionRepeatAgent')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmCreate an agent which learns a variable action repetition duration. At each decision step, the agent outputs both the action to repeat and the number of steps to repeat. These two quantities together constitute the action of the agent. We use SAC with mixed action type for training.
The core idea is similar to Learning to Repeat: Fine Grained Action Repetition for Deep Reinforcement Learning.
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions; can only be continuous actions for now.
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
env (Environment) – The environment to interact with.
envis a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.config (TrainerConfig) – config for training.
configonly needs to be provided to the algorithm which performs a training iteration by itself.K (int) – the maximal repeating times for an action.
rl_algorithm_cls (Callable) – creates an RL algorithm to be augmented by this dynamic action repeating ability.
representation_learner_cls (type) – The algorithm class for learning the representation. If provided, the constructed learner will calculate the representation from the original observation as the observation for downstream algorithms such as
rl_algorithm. We assume that the representation is trained byrl_algorithm.reward_normalizer_ctor (Callable) – if not None, it must be
RewardNormalizerand environment rewards will be normalized for training.gamma (float) – the reward discount to be applied when accumulating
ksteps’ rewards for a repeated action. Note that this value should be equal to the gamma used by the critic loss for target values.optimizer (None|Optimizer) – The default optimizer for training. See comments above for detail.
debug_summaries (bool) – True if debug summaries should be created.
name (str) – name of this agent.
- observe_for_replay(exp)[source]#
Record an experience in a replay buffer.
- Parameters
exp (nested Tensor) – exp (nested Tensor): The shape is \([B, \ldots]\), where \(B\) is the batch size of the batched environment.
- predict_step(time_step, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- preprocess_experience(root_inputs, rollout_info, batch_info)[source]#
Normalize training rewards if a reward normalizer is provided. Shape of
rl_expis[B, T, ...]. The statistics of the normalizer is updated by random sample rewards.
- rollout_step(time_step, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- summarize_train(experience, train_info, loss_info, params)[source]#
Overwrite the function because the training action spec is different from the rollout action spec.
- train_step(inputs, state, rollout_info)[source]#
Train the underlying RL algorithm
self._rl. Because inself.rollout_step()the replay buffer only stores info related toself._rl, here we can directly callself._rl.train_step().- Parameters
rl_exp (Experience) – experiences that have been transformed to be learned by
self._rl.state (ActionRepeatState) –
- training: bool#
alf.algorithms.dynamics_learning_algorithm#
- class DeterministicDynamicsAlgorithm(action_spec, feature_spec, hidden_size=256, num_replicas=1, dynamics_network_ctor=None, name='DeterministicDynamicsAlgorithm')[source]#
Bases:
alf.algorithms.dynamics_learning_algorithm.DynamicsLearningAlgorithmDeterministic Dynamics Learning Module
This module trys to learn the dynamics of environment with a determinstic model.
Create a DeterministicDynamicsAlgorithm.
- Parameters
hidden_size (int|tuple) – size of hidden layer(s)
num_replicas (int) – number of network replicas to be used in the ensemble for dynamics learning
dynamics_network_ctor (
Optional[Callable[[Any,Any],DynamicsNetwork]]) – Used to construct a network for predicting the change of the next feature based on the previous feature and action. It should accept input with spec of the format [feature_spec, encoded_action_spec] and output a tensor of the shape feature_spec. For discrete action case, encoded_action is a one-hot representation of the action. For continuous action, encoded action is the original action.
- predict_step(time_step, state)[source]#
- Predict the next observation given the current time_step.
The next step is predicted using the
prev_actionfrom time_step and thefeaturefrom state.
- Parameters
time_step (TimeStep) – time step structure. The
prev_actionfrom time_step will be used for predicting feature of the next step. It should be a Tensor of the shape [B, …], or [B, n, …] when n > 1, where n denotes the number of dynamics network replicas. When the input tensor has the shape of [B, …] and n > 1, it will be first expanded to [B, n, …] to match the number of dynamics network replicas.state (DynamicsState) –
state for dynamics learning with the following fields: - feature (Tensor): features of the previous observation of the
shape [B, …], or [B, n, …] when n > 1. When
state.featurehas the shape of [B, …] and n > 1, it will be first expanded to [B, n, …] to match the number of dynamics network replicas. It is used for predicting the feature of the next step together withtime_step.prev_action.network: the input state of the dynamics network
- Returns
- outputs (Tensor): predicted feature of the next step, of the
shape [B, …], or [B, n, …] when n > 1.
- state (DynamicsState): with the following fields
- feature (Tensor): [B, n, …] (or [B, n, …] when n > 1)
shape tensor representing the predicted feature of the next step
network: the updated state of the dynamics network
info: empty tuple ()
- Return type
- train_step(time_step, state)[source]#
- Parameters
time_step (TimeStep) – time step structure. The
prev_actionfrom time_step will be used for predicting feature of the next step. It should be a Tensor of the shape [B, …] or [B, n, …] when n > 1, where n denotes the number of dynamics network replicas. When the input tensor has the shape of [B, …] and n > 1, it will be first expanded to [B, n, …] to match the number of dynamics network replicas.state (DynamicsState) –
state for dynamics learning with the following fields: - feature (Tensor): features of the previous observation of the
shape [B, …] or [B, n, …] when n > 1. When
state.featurehas the shape of [B, …] and n > 1, it will be first expanded to [B, n, …] to match the number of dynamics network replicas. It is used for predicting the feature of the next step together withtime_step.prev_action.network: the input state of the dynamics network
- Returns
outputs: empty tuple () state (DynamicsState): with the following fields
- feature (Tensor): [B, …] (or [B, n, …] when n > 1)
shape tensor representing the predicted feature of the next step
network: the updated state of the dynamics network
- info (DynamicsInfo): with the following fields being updated:
loss (LossInfo):
- Return type
- training: bool#
- update_state(time_step, state)[source]#
- Update the state based on TimeStep data. This function is
mainly used during rollout together with a planner. This function is necessary as we need to update the feature in DynamicsState with those of the current observation, after each step of rollout.
- Parameters
time_step (TimeStep) – input data for dynamics learning
state (DynamicsState) – state for DeterministicDynamicsAlgorithm (previous observation)
- Returns
updated dynamics state
- Return type
state (DynamicsState)
- class DynamicsInfo(loss, dist)#
Bases:
tupleCreate new instance of DynamicsInfo(loss, dist)
- dist#
Alias for field number 1
- loss#
Alias for field number 0
- class DynamicsLearningAlgorithm(train_state_spec, action_spec, feature_spec, hidden_size=256, num_replicas=1, dynamics_network=None, checkpoint=None, name='DynamicsLearningAlgorithm')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmBase Dynamics Learning Module
This module learns the dynamics of environment with a determinstic model.
Create a DynamicsLearningAlgorithm.
- Parameters
hidden_size (int|tuple) – size of hidden layer(s)
dynamics_network (Network) – network for predicting the change of the next feature based on the previous feature and action. It should accept input with spec of the format [feature_spec, encoded_action_spec] and output a tensor of the shape feature_spec. For discrete action case, encoded_action is a one-hot representation of the action. For continuous action, encoded action is the original action.
checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- get_state_specs()[source]#
Get the state specs of the current module. This function is mainly used for constructing the nested state specs by the upper-level module.
- property num_replicas#
- predict_step(time_step, state)[source]#
- Predict the current observation using
time_step.prev_action and the feature of the previous observation from
state.
- Parameters
time_step (TimeStep) – input data for dynamics learning
state (DynamicsState) – state for dynamics learning
- Returns
output: state (DynamicsState): info (DynamicsInfo):
- Return type
- Predict the current observation using
- train_step(time_step, state)[source]#
- Parameters
time_step (TimeStep) – input data for dynamics learning
state (DynamicsState) – state for dynamics learning (previous observation)
- Returns
output: state (DynamicsState): state for training info (DynamicsInfo):
- Return type
- training: bool#
- update_state(time_step, state)[source]#
- Update the state based on TimeStep data. This function is
mainly used during rollout together with a planner.
- Parameters
time_step (TimeStep) – input data for dynamics learning
state (DynamicsState) – state for DynamicsLearningAlgorithm (previous observation)
- Returns
updated dynamics state
- Return type
state (DynamicsState)
- class DynamicsState(feature, network)#
Bases:
tupleCreate new instance of DynamicsState(feature, network)
- feature#
Alias for field number 0
- network#
Alias for field number 1
- class StochasticDynamicsAlgorithm(action_spec, feature_spec, hidden_size=256, num_replicas=1, dynamics_network_ctor=None, name='StochasticDynamicsAlgorithm')[source]#
Bases:
alf.algorithms.dynamics_learning_algorithm.DeterministicDynamicsAlgorithmStochastic Dynamics Learning Module
This module learns the dynamics of environment with a stochastic model.
Create a StochasticDynamicsAlgorithm.
- Parameters
hidden_size (int|tuple) – size of hidden layer(s)
num_replicas (int) – number of network replicas to be used in the ensemble for dynamics learning
dynamics_network_ctor (
Optional[Callable[[Any,Any],DynamicsNetwork]]) – used to construct network for predicting next feature based on the previous feature and action. It should accept input with spec [feature_spec, encoded_action_spec] and output a tensor of shape feature_spec. For discrete action, encoded_action is an one-hot representation of the action. For continuous action, encoded action is the original action.
- predict_step(time_step, state)[source]#
- Predict the next observation given the current time_step.
The next step is predicted using the
prev_actionfrom time_step and thefeaturefrom state.
- Parameters
time_step (TimeStep) – time step structure. The
prev_actionfrom time_step will be used for predicting feature of the next step. It should be a Tensor of the shape [B, …], or [B, n, …] when n > 1, where n denotes the number of dynamics network replicas. When the input tensor has the shape of [B, …] and n > 1, it will be first expanded to [B, n, …] to match the number of dynamics network replicas.state (DynamicsState) –
state for dynamics learning with the following fields: - feature (Tensor): features of the previous observation of the
shape [B, …], or [B, n, …] when n > 1. When
state.featurehas the shape of [B, …] and n > 1, it will be first expanded to [B, n, …] to match the number of dynamics network replicas. It is used for predicting the feature of the next step together withtime_step.prev_action.network: the input state of the dynamics network
- Returns
- outputs (Tensor): predicted feature of the next step, of the
shape [B, …], or [B, n, …] when n > 1.
- state (DynamicsState): with the following fields
- feature (Tensor): [B, n, …] (or [B, n, …] when n > 1)
shape tensor representing the predicted feature of the next step
network: the updated state of the dynamics network
- info (DynamicsInfo): with the following fields being updated:
- dist (td.Distribution): the predictive distribution which
can be used for further calculation or summarization.
- Return type
- train_step(time_step, state)[source]#
- Parameters
time_step (TimeStep) – time step structure. The
prev_actionfrom time_step will be used for predicting feature of the next step. It should be a Tensor of the shape [B, …] or [B, n, …] when n > 1, where n denotes the number of dynamics network replicas. When the input tensor has the shape of [B, …] and n > 1, it will be first expanded to [B, n, …] to match the number of dynamics network replicas.state (DynamicsState) –
state for dynamics learning with the following fields: - feature (Tensor): features of the previous observation of the
shape [B, …] or [B, n, …] when n > 1. When
state.featurehas the shape of [B, …] and n > 1, it will be first expanded to [B, n, …] to match the number of dynamics network replicas. It is used for predicting the feature of the next step together withtime_step.prev_action.network: the input state of the dynamics network
- Returns
outputs: empty tuple () state (DynamicsState): with the following fields
- feature (Tensor): [B, …] (or [B, n, …] when n > 1)
shape tensor representing the predicted feature of the next step
network: the updated state of the dynamics network
- info (DynamicsInfo): with the following fields being updated:
loss (LossInfo):
- dist (td.Distribution): the predictive distribution which
can be used for further calculation or summarization.
- Return type
- training: bool#
alf.algorithms.encoding_algorithm#
Encoding algorithm.
- class EncodingAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), encoder_cls=<class 'alf.networks.encoding_networks.EncodingNetwork'>, time_step_as_input=False, output_fields=None, loss_fields=None, loss_weights=None, optimizer=None, config=None, checkpoint=None, debug_summaries=False, name='EncodingAlgorithm')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmBasic encoding algorithm.
It uses the provided encoding network to computed the representation. It also supports the training of the encoding network by using some of its output as losses.
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – not used
encoder_cls (type) – The class or function to create the encoder. It can be called using
encoder_cls(input_tensor_spec).time_step_as_input (bool) – If True, use the whole TimeStep strucuture as the input to the encoder instead of the observation.
output_fields (None | nested str) – if None, all the output from the encoder will be used as the output. Otherwise,
output_fieldswill be used to retrieve the fields from the encoder output.loss_fields (None | nested str) – there is not loss if this is None. Otherwise,
loss_fieldswill be used to retrieve fields from encoder output and use them as loss. Note that those fields must be scalar.loss_weights (None | nested str) – if provided, must have the same structure as
loss_fieldsand will be used as weights for the corresponding loss values.config (
Optional[TrainerConfig]) – The trainer config. Present as representation learner interface to be used withAgent.optimizer (torch.optim.Optimizer) – if provided, will be used to optimize the parameters of encoder.
checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.debug_summaries (bool) – True if debug summaries should be created.
name (str) – Name of this algorithm.
- property output_spec#
- training: bool#
alf.algorithms.entropy_target_algorithm#
An algorithm for adjusting entropy regularization strength.
- class EntropyTargetAlgorithm(action_spec, initial_alpha=0.1, skip_free_stage=False, max_entropy=None, target_entropy=None, very_slow_update_rate=0.001, slow_update_rate=0.01, fast_update_rate=0.6931471805599453, min_alpha=0.0001, average_window=2, debug_summaries=False, name='EntropyTargetAlgorithm')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmAlgorithm for adjusting entropy regularization.
It tries to adjust the entropy regularization (i.e. alpha) so that the the entropy is not smaller than
target_entropy.The algorithm has three stages:
init stage. This is an optional stage. If the initial entropy is already below
max_entropy, then this stage is skipped. Otherwise, the alpha will be slowly decreased so that the entropy will land atmax_entropyto trigger the nextfree_stage. Basically, this stage let the user to choose an arbitrary large init alpha without considering every specific case.free stage. During this stage, the alpha is not changed. It transitions to adjust_stage once entropy drops below
target_entropy.adjust stage. During this stage,
log_alphais adjusted using this formula:((below + 0.5 * above) * decreasing - (above + 0.5 * below) * increasing) * update_rate
Note that
log_alphawill always be decreased if entropy is increasing even when the entropy is below the target entropy. This is to prevent overshootinglog_alphato a too big value. Same reason for always increasinglog_alphaeven when the entropy is above the target entropy.update_rateis initialized tofast_update_rateand is reduced by a factor of 0.9 whenever the entropy crossestarget_entropy.udpate_rateis reset tofast_update_rateif entropy drops too much belowtarget_entropy(i.e.,fast_stage_threshin the code, which is the half oftarget_entropyif it is positive, and twice oftarget_entropyif it is negative.
EntropyTargetAlgorithmcan be used to approximately reproduce the learning of temperature in Soft Actor-Critic Algorithms and Applications. To do so, you need to use the sametarget_entropy, setskip_free_stageto True, and setslow_update_rateandfast_update_rateto the 4 times of the learning rate for temperature.- Parameters
action_spec (nested BoundedTensorSpec) – representing the actions.
initial_alpha (float) – initial value for alpha; make sure that it’s large enough for initial meaningful exploration
skip_free_stage (bool) – If True, directly goes to the adjust stage.
max_entropy (float|None) – the upper bound of the total entropy. If it is None,
min(initial_entropy * 0.8, initial_entropy / 0.8)is used. initial_entropy is estimated from the firstaverage_windowsteps. 0.8 is to ensure that we can get a policy a less random as the initial policy before starting the free stage.target_entropy (float|None) – the lower bound of the total entropy. If it is None, a default value proportional to the action dimension is used. This value should be less or equal than
max_entropy.very_slow_update_rate (float) – a tiny update rate for
log_alpha; used in stage 0.slow_update_rate (float) – minimal update rate for
log_alpha; used in stage 2.fast_update_rate (float) – maximum update rate for
log_alpha; used in state 2.min_alpha (float) – the minimal value of alpha. If <=0, \(e^{-100}\) is used.
average_window (int) – window size for averaging past entropies.
debug_summaries (bool) – True if debug summaries should be created.
- adjust_alpha(entropy)[source]#
Adjust alpha according to the current entropy.
- Parameters
entropy (scalar Tensor) – the current entropy.
- Returns
adjusted entropy regularization
- calc_loss(info, valid_mask=None)[source]#
Calculate loss.
- Parameters
info (EntropyTargetInfo) – for computing loss.
valid_mask (tensor) – valid mask to be applied on time steps.
- Returns
- Return type
- predict_step(distribution_and_step_type, state)[source]#
Predict for one step of inputs.
- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
predict_state_spec.- info (nest): information for analyzing the agent. In particular,
if an element of the info is
alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.
- Return type
- rollout_step(distribution_and_step_type, state=None)[source]#
Rollout step.
- Parameters
distribution (nested Distribution) – action distribution from the policy.
step_type (StepType) – the step type for the distributions.
on_policy_training (bool) – If False, this step does nothing.
- Returns
infofield isLossInfo, other fields are empty. All fields are empty Ifon_policy_training=False.- Return type
- training: bool#
- class EntropyTargetInfo(step_type, loss)#
Bases:
tupleCreate new instance of EntropyTargetInfo(step_type, loss)
- loss#
Alias for field number 1
- step_type#
Alias for field number 0
- class EntropyTargetLossInfo(neg_entropy)#
Bases:
tupleCreate new instance of EntropyTargetLossInfo(neg_entropy,)
- neg_entropy#
Alias for field number 0
- class NestedEntropyTargetAlgorithm(action_spec, initial_alpha=0.1, skip_free_stage=False, max_entropy=None, target_entropy=None, very_slow_update_rate=0.001, slow_update_rate=0.01, fast_update_rate=0.6931471805599453, min_alpha=0.0001, average_window=2, debug_summaries=False, name='EntropyTargetAlgorithm')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmAlgorithm for adjusting entropy regularization.
Similar to
EntropyTargetAlgorithm,NestedEntropyTargetAlgorithmadjusts the entropy regularization for each action in a nested action so that the entropy for each action in the nest is not smaller than the correspondingtarget_entropy. It usesEntropyTargetAlgorithmto do the actual work. SeeEntropyTargetAlgorithmfor how it works.- Parameters
action_spec (nested BoundedTensorSpec) – representing the actions.
initial_alpha (float) – initial value for alpha; make sure that it’s large enough for initial meaningful exploration
skip_free_stage (bool) – If True, directly goes to the adjust stage.
max_entropy (Nested[float|None]) –
the upper bound of the entropy for each corresponding action in
action_spec. If it is None,min(initial_entropy * 0.8, initial_entropy / 0.8)is used. initial_entropy is estimated from the firstaverage_windowsteps. 0.8 is to ensure that we can get a policy a less random as the initial policy before starting the free stage. Iftarget_entropyis nested and:If
max_entropyis None: the max entropy of each of the distribution inaction_specis calculated as using the estimated initial entropy for that distribution.If
max_entropyis nested: it should have the same structure asaction_specand each element indicates the max entropy for the corresponding distribution inaction_spec.If
max_entropyis a float: it is the max entropy for each of the distributions inaction_spec
target_entropy (Nested[float|None]) – the lower bound of the the entropy for each corresponding action in
action_spec. If it is None, a default value proportional to the action dimension is used. This value should be less or equal thanmax_entropy. Ifaction_specis nested,target_entropycan also be a nest with the same structure and each element indicates the target entropy for the corresponding distribution inaction_spec.very_slow_update_rate (float) – a tiny update rate for
log_alpha; used in stage 0.slow_update_rate (float) – minimal update rate for
log_alpha; used in stage 2.fast_update_rate (float) – maximum update rate for
log_alpha; used in state 2.min_alpha (float) – the minimal value of alpha. If <=0, \(e^{-100}\) is used.
average_window (int) – window size for averaging past entropies.
debug_summaries (bool) – True if debug summaries should be created.
- calc_loss(info, valid_mask=None)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(distribution_and_step_type, state=None)[source]#
Predict for one step of inputs.
- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
predict_state_spec.- info (nest): information for analyzing the agent. In particular,
if an element of the info is
alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.
- Return type
- rollout_step(distribution_and_step_type, state=None)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- train_step(distribution_and_step_type, state=None, rollout_info=None)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class SGDEntropyTargetAlgorithm(action_spec, initial_alpha=0.1, target_entropy=None, window_size=1, optimizer=None, debug_summaries=False, name='SGDEntropyTargetAlgorithm')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmAdjusting the entropy weight using SGD according to a target, similar to the way of SAC.
- Parameters
action_spec (
TensorSpec) – nested tensor spec for the actioninitial_alpha (
float) – initial value for alpha; make sure that it’s large enough for initial meaningful explorationtarget_entropy (
Union[Callable[[],float],float,None]) – the target of the total entropy. If it is None, a default value proportional to the action dimension is used.window_size (
int) – window size for averaging past entropies.optimizer (
Optional[Optimizer]) – the optimizer for adjusting the weightdebug_summaries (
bool) – whether to turn on debugging infoname (
str) – name of the class
- calc_loss(info)[source]#
Calculate the losses for training. It will compute two losses, one for training the entropy weight, and the other for maximizing the entropy of the action distribution.
- predict_step(distribution_and_step_type)[source]#
Predict for one step of inputs.
- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
predict_state_spec.- info (nest): information for analyzing the agent. In particular,
if an element of the info is
alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.
- Return type
- rollout_step(distribution_and_step_type)[source]#
- Parameters
distribution_and_step_type (nested Distribution) – action distribution from the policy, and the step type for the distributions.
- Returns
infoisEntropyTargetInfoandinfo.lossisLossInfo, other fields are empty. All fields are empty for off-policy training.
- Return type
- train_step(distribution_and_step_type)[source]#
- Parameters
distribution_and_step_type (nested Distribution) – action distribution from the policy, and the step type for the distributions.
- Returns
infoisEntropyTargetInfoandinfo.lossisLossInfo, other fields are empty.
- Return type
- training: bool#
alf.algorithms.functional_particle_vi_algorithm#
ParticleVI algorithm on parameterized functions.
- class FuncParVIAlgorithm(data_creator=None, data_creator_outlier=None, input_tensor_spec=None, output_dim=None, param_net=None, conv_layer_params=None, fc_layer_params=None, use_conv_bias=False, use_conv_ln=False, use_fc_bias=True, use_fc_ln=False, activation=<built-in method relu_ of type object>, last_activation=<function identity>, last_use_bias=True, last_use_ln=False, num_particles=10, entropy_regularization=1.0, loss_type='classification', voting='soft', par_vi='svgd', function_vi=False, function_bs=None, function_extra_bs_ratio=0.1, function_extra_bs_sampler='uniform', function_extra_bs_std=1.0, critic_hidden_layers=(100, 100), critic_iter_num=2, critic_l2_weight=10.0, critic_use_bn=True, num_train_classes=10, optimizer=None, critic_optimizer=None, logging_network=False, logging_training=False, logging_evaluate=False, config=None, debug_summaries=False, name='FuncParVIAlgorithm')[source]#
Bases:
alf.algorithms.particle_vi_algorithm.ParVIAlgorithmFunctional ParVI Algorithm
Functional ParVI algorithm maintains a set of functional particles, where each particle is a neural network. All particles are updated using particle-based VI approaches.
There are two ways of treating a neural network as a particle:
All the weights of the neural network as a particle.
Outputs of the neural network for an input mini-batch as a particle.
- Parameters
data_creator (Callable) – called as
data_creator()to get a tuple of(train_dataloader, test_dataloader)data_creator_outlier (Callable) – called as
data_creator()to get a tuple of(outlier_train_dataloader, outlier_test_dataloader)input_tensor_spec (nested TensorSpec) – the (nested) tensor spec of the input. If nested, then
preprocessing_combinermust not be None. It must be provided ifdata_creatoris not provided.output_dim (int) – dimension of the output of the generated network. It must be provided if
data_creatoris not provided.param_net (ParamNetwork) – input parametric network.
conv_layer_params (tuple[tuple]) – a tuple of tuples where each tuple takes a format
(filters, kernel_size, strides, padding, pooling_kernel), wherepaddingandpooling_kernelare optional.fc_layer_params (tuple[tuple]) – a tuple of tuples where each tuple takes a format
(FC layer sizes. use_bias), whereuse_biasis optional.use_conv_bias (bool|None) – whether use bias for conv layers. If None, will use
not use_bnfor conv layers.use_conv_ln (bool) – whether use layer normalization for conv layers.
use_fc_bias (bool) – whether use bias for fc layers.
use_fc_ln (bool) – whether use layer normalization for fc layers.
activation (Callable) – activation used for all the layers but the last layer.
last_activation (Callable) – activation function of the additional layer specified by
last_layer_param. Note that iflast_layer_paramis not None,last_activationhas to be specified explicitly.last_use_bias (bool) – whether use bias for the last layer
last_use_ln (bool) – whether use normalization for the last layer.
num_particles (int) – number of sampling particles
entropy_regularization (float) – weight of the repulsive term in par_vi.
function_vi (bool) – whether to use funciton value based par_vi, current supported by [
svgd2,svgd3,gfsf].function_bs (int) – mini batch size for par_vi training. Needed for critic initialization when function_vi is True.
function_extra_bs_ratio (float) – ratio of extra sampled batch size w.r.t. the function_bs.
function_extra_bs_sampler (str) – type of sampling method for extra training batch, types are [
uniform,normal].function_extra_bs_std (float) – std of the normal distribution for sampling extra training batch when using normal sampler.
critic_hidden_layers (tuple) – sizes of hidden layers of the critic, used for
minmax.critic_l2_weight (float) – weight of L2 regularization in training the critic, used for
minmax.critic_iter_num (int) – number of critic updates for each generator train_step, used for
minmax.critic_use_bn (book) – whether use batch norm for each layers of the critic, used for
minmax.critic_optimizer (torch.optim.Optimizer) – Optimizer for training the critic, used for
minmax.loss_type (str) – loglikelihood type for the generated functions, types are [
classification,regression]voting (str) – types of voting results from sampled functions, types are [
soft,hard]par_vi (str) –
types of particle-based methods for variational inference, types are [
svgd,gfsf,minmax]svgd: empirical expectation of SVGD is evaluated by reusing the same batch of particles.
gfsf: wasserstein gradient flow with smoothed functions. It involves a kernel matrix inversion, so computationally more expensive, but in some cases the convergence seems faster than svgd approaches.
function_vi – whether to use function value based par_vi.
num_train_classes (int) – number of classes in training set.
optimizer (torch.optim.Optimizer) – The optimizer for training.
logging_network (bool) – whether logging the archetectures of networks.
logging_training (bool) – whether logging loss and acc during training.
logging_evaluate (bool) – whether logging loss and acc of evaluate.
config (TrainerConfig) – configuration for training
name (str) –
- eval_uncertainty()[source]#
Function to evaluate the epistemic uncertainty of the ensemble. This method computes the following metrics:
AUROC (AUC) evaluates the separability of model predictions with respect to the training data and a prespecified outlier dataset. AUC is computed with respect to the entropy in the averaged softmax probabilities, as well as the sum of the variance of the softmax probabilities over the ensemble.
- predict_step(inputs, params=None, state=None)[source]#
Predict ensemble outputs for inputs using the hypernetwork model.
- Parameters
inputs (Tensor) – inputs to the ensemble of networks.
params (Tensor) – parameters of the ensemble of networks, if None, use self.particles.
state (None) – not used.
- Returns
- output (Tensor): predictions with shape
[batch_size, self._param_net._output_spec.shape[0]]
state (None): not used
- Return type
- set_data_loader(train_loader, test_loader=None, outlier_data_loaders=None, entropy_regularization=None)[source]#
Set data loadder for training and testing.
- Parameters
train_loader (torch.utils.data.DataLoader) – training data loader
test_loader (torch.utils.data.DataLoader) – testing data loader
outlier_data_loaders (tuple[torch.utils.data.DataLoader) – (trainloader, testloader) for outlier datasets
entropy_regularization (float) – weight of particle VI repulsive term.
- summarize_train(loss_info, params, cum_loss=None, avg_acc=None)[source]#
Generate summaries for training & loss info after each gradient update. The default implementation of this function only summarizes params (with grads) and the loss. An algorithm can override this for additional summaries. See
RLAlgorithm.summarize_train()for an example.- Parameters
experience (nested Tensor) – samples used for the most recent
update_with_gradient(). By default it’s not summarized.train_info (nested Tensor) –
AlgStep.inforeturned by eitherrollout_step()(on-policy training) ortrain_step()(off-policy training). By default it’s not summarized.loss_info (LossInfo) – loss
params (list[Parameter]) – list of parameters with gradients
- train_iter(state=None)[source]#
Perform one epoch (iteration) of training.
- Parameters
state (None) – not used
- Returns
mini_batch number
- train_step(inputs, entropy_regularization=None, loss_mask=None, state=None)[source]#
Perform one batch of training computation.
- Parameters
inputs (nested Tensor) – input training data.
entropy_regularization (float) – weight of the repulsive term in par_vi. If None, use self._entropy_regularization.
loss_mask (Tensor) – mask indicating which samples are valid for loss propagation.
state (None) – not used
- Returns
output(Tensor): shape is
[batch_size, dim]state: not used
info (LossInfo): loss
- Return type
- training: bool#
alf.algorithms.generator#
A generic generator.
- class CriticAlgorithm(input_tensor_spec, output_dim=None, hidden_layers=(3, 3), activation=<built-in method relu_ of type object>, net=None, use_relu_mlp=False, use_bn=True, optimizer=None, name='CriticAlgorithm')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmWrap a critic network as an Algorithm for flexible gradient updates called by the Generator when par_vi is ‘minmax’.
Create a CriticAlgorithm.
- Parameters
input_tensor_spec (TensorSpec) – spec of inputs.
output_dim (int) – dimension of output, default value is input_dim.
hidden_layers (tuple) – size of hidden layers.
activation (Callable) – activation used for all critic layers.
net (Network) – network for predicting outputs from inputs. If None, a default one with hidden_layers will be created
use_relu_mlp (bool) – whether use ReluMLP as default net constrctor. Diagonals of Jacobian can be explicitly computed for ReluMLP.
use_bn (bool) – whether use batch norm for each critic layers.
optimizer (torch.optim.Optimizer) – (optional) optimizer for training.
name (str) – name of this CriticAlgorithm.
- predict_step(inputs, state=None, requires_jac_diag=False)[source]#
Predict for one step of inputs.
- Parameters
inputs (Tensor) – inputs for prediction.
state – not used.
requires_jac_trace (bool) – whether outputs diagonals of Jacobian.
- Returns
- output (Tensor): predictions or (predictions, diag_jacobian)
if requires_jac_diag is True.
state: not used.
- Return type
- training: bool#
- class Generator(output_dim, noise_dim=32, input_tensor_spec=None, hidden_layers=(256, ), net=None, net_moving_average_rate=None, entropy_regularization=0.0, mi_weight=None, mi_estimator_cls=<class 'alf.algorithms.mi_estimator.MIEstimator'>, par_vi=None, use_kernel_averager=False, functional_gradient=False, init_lambda=1.0, lambda_trainable=False, block_inverse_mvp=False, direct_jac_inverse=False, inverse_mvp_solve_iters=1, inverse_mvp_hidden_size=100, inverse_mvp_hidden_layers=1, critic_input_dim=None, critic_hidden_layers=(100, 100), critic_l2_weight=10.0, critic_iter_num=2, critic_relu_mlp=False, critic_use_bn=True, minmax_resample=True, critic_optimizer=None, inverse_mvp_optimizer=None, optimizer=None, lambda_optimizer=None, name='Generator')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmGenerator generates outputs given inputs (can be None) by transforming a random noise and input using net:
outputs = net([noise, input]) if input is not None else net(noise)
The generator is trained to minimize the following objective:
\(E(loss\_func(net([noise, input]))) - entropy\_regulariztion \cdot H(P)\)
where P is the (conditional) distribution of outputs given the inputs implied by net and H(P) is the (conditional) entropy of P.
If the loss is the (unnormalized) negative log probability of some distribution Q and the
entropy_regularizationis 1, this objective is equivalent to minimizing \(KL(P||Q)\).It uses two different ways to optimize net depending on
entropy_regularization:entropy_regularization= 0: the minimization is achieved by simply minimizing loss_func(net([noise, inputs]))entropy_regularization > 0: the minimization is achieved using amortized particle-based variational inference (ParVI), in particular, four ParVI methods are implemented:
amortized Stein Variational Gradient Descent (SVGD):
Feng et al “Learning to Draw Samples with Amortized Stein Variational Gradient Descent” https://arxiv.org/pdf/1707.06626.pdf
amortized Wasserstein ParVI with Smooth Functions (GFSF):
Liu, Chang, et al. “Understanding and accelerating particle-based variational inference.” International Conference on Machine Learning. 2019.
amortized Fisher Neural Sampler with Hutchinson’s estimator (MINMAX):
Hu et at. “Stein Neural Sampler.” https://arxiv.org/abs/1810.03545, 2018.
generative particle-based variational inference (GPVI) If
functional_gradientis set to True, then GPVI is used.Ratzlaff, Bai, et al. “Generative Particle Variational Inference via Estimation of Functional Gradients.” International Conference on Machine Learning. 2021.
It also supports an additional optional objective of maximizing the mutual information between [noise, inputs] and outputs by using mi_estimator to prevent mode collapse. This might be useful for
entropy_regulariztion= 0 as suggested in section 5.1 of the following paper:Hjelm et al Learning Deep Representations by Mutual Information Estimation and Maximization <https://arxiv.org/pdf/1808.06670.pdf>
Create a Generator.
- Parameters
output_dim (int) – dimension of output
noise_dim (int) – dimension of noise
input_tensor_spec (nested TensorSpec) – spec of inputs. If there is no inputs, this should be None.
hidden_layers (tuple) – sizes of hidden layers.
net (Network) – network for generating outputs from [noise, inputs] or noise (if inputs is None). If None, a default one with hidden_layers will be created
net_moving_average_rate (float) – If provided, use a moving average version of net to do prediction. This has been shown to be effective for GAN training (arXiv:1907.02544, arXiv:1812.04948).
entropy_regularization (float) – weight of entropy regularization.
mi_weight (float) – weight of mutual information loss.
mi_estimator_cls (type) – the class of mutual information estimator for maximizing the mutual information between [noise, inputs] and [outputs, inputs].
par_vi (string) –
ParVI methods, options are [
svgd,svgd2,svgd3,gfsf,minmax],svgd: empirical expectation of SVGD is evaluated by a single resampled particle. The main benefit of this choice is it supports conditional case, while all other options do not.
svgd2: empirical expectation of SVGD is evaluated by splitting half of the sampled batch. It is a trade-off between computational efficiency and convergence speed.
svgd3: empirical expectation of SVGD is evaluated by resampled particles of the same batch size. It has better convergence but involves resampling, so less efficient computaionally comparing with svgd2.
gfsf: wasserstein gradient flow with smoothed functions. It involves a kernel matrix inversion, so computationally most expensive, but in some case the convergence seems faster than svgd approaches.
minmax: Fisher Neural Sampler, optimal descent direction of the Stein discrepancy is solved by an inner optimization procedure in the space of L2 neural networks.
use_kernel_averager (bool) – whether or not to use a running average of the kernel bandwith for ParVI methods.
functional_gradient (bool) – whether or not to optimize the generator with GPVI. When True, the dimension of the jacobian of the generator function needs to be square – therefore invertible. When the generator is not sqaure, we ensure this by sampling an input noise vector of the same size as the output, and only forwarding the first
noise_dimcomponents. We then add the full noise vector to the output, multiplied by thefullrank_diag_weight.init_lambda (float) – weight on direct input-output link added to the generator output. Only used for GPVI and GPVI_Plus when forcing full rank Jacobian.
lambda_trainable (bool) – whether to train
lambda.block_inverse_mvp (bool) – whether to use the more efficient block form for inverse_mvp when
functional_gradientis True. This option is recommended only whennoise_dim<output_dim. as it is equivalent to the default form whennoise_dimis equal tooutput_dim.inverse_mvp_solve_iters (int) – number of iterations of inverse_mvp network training per single iteration of generator training.
inverse_mvp_hidden_size (int) – width of hidden layers in inverse_mvp network.
inverse_mvp_hidden_layers (int) – number of hidden layers in inverse_mvp network.
critic_input_dim (int) – dimension of critic input, used for
minmax.critic_hidden_layers (tuple) – sizes of hidden layers of the critic, used for
minmax.critic_l2_weight (float) – weight of L2 regularization in training the critic, used for
minmax.critic_iter_num (int) – number of critic updates for each generator train_step, used for
minmax.critic_relu_mlp (bool) – whether use ReluMLP as the critic constructor, used for
minmax.critic_use_bn (book) – whether use batch norm for each layers of the critic, used for
minmax.minmax_resample (bool) – whether resample the generator for each critic update, used for
minmax.critic_optimizer (torch.optim.Optimizer) – Optimizer for training the critic, used for
minmax.inverse_mvp_optimizer (torch.optim.Optimizer) – Optimizer for training the inverse_mvp network, used when
functional_gradientis True.optimizer (torch.optim.Optimizer) – (optional) optimizer for training
lambda_optimizer (torch.optim.Optimizer) – Optimizer for training the
lambda, used for GPVI and GPVI_Plus whenlambda_trainableis True.name (str) – name of this generator
- after_update(training_info)[source]#
Do things after completing one gradient update (i.e.
update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.- Parameters
root_inputs (nest) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll().info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()for on-policy training ortrain_step()for off-policy training.
- property noise_dim#
- predict_step(inputs=None, noise=None, batch_size=None, training=False, state=None)[source]#
Generate outputs given inputs.
- Parameters
inputs (nested Tensor) – if None, the outputs is generated only from noise.
noise (Tensor) – input to the generator.
batch_size (int) – batch_size. Must be provided if inputs is None. Its is ignored if inputs is not None
training (bool) – whether train the generator.
state – not used
- Returns
output (Tensor): predictions with shape
[batch_size, output_dim]state: not used.
- Return type
- train_step(inputs, loss_func, batch_size=None, transform_func=None, entropy_regularization=None, state=None)[source]#
- Parameters
inputs (nested Tensor) – if None, the outputs is generated only from noise.
loss_func (Callable) – loss_func([outputs, inputs]) (loss_func(outputs) if inputs is None) returns a Tensor or namedtuple of tensors with field loss, which is a Tensor of shape [batch_size] a loss term for optimizing the generator.
batch_size (int) – batch_size. Must be provided if inputs is None. Its is ignored if inputs is not None.
transform_func (Callable) –
transform function on generator’s outputs. Used in function value based par_vi (currently supported by [
svgd2,svgd3,gfsf]) for evaluating the network(s) parameterized by the generator’s outputs (given by self._predict) on the training batch (predefined with transform_func). It can be called in two waystransform_func(params): params is a tensor of parameters for a network, of shape
[D]or[B, D]B: batch sizeD: length of network parameters
In this case, transform_func first samples additional data besides the predefined training batch and then evaluate the network(s) parameterized by
paramson the training batch plus additional sampled data.transform_func((params, extra_samples)): params is the same as above case and extra_samples is the tensor of additional sampled data. In this case, transform_func evaluates the network(s) parameterized by
paramson predefined training batch plusextra_samples.
It returns three tensors:
outputs: outputs of network parameterized by params evaluated on predined training batch.
density_outputs: outputs of network parameterized by params evaluated on additional sampled data.
extra_samples: additional sampled data, same as input extra_samples if called as transform_func((params, extra_samples))
entropy_regularization (float) – weight of entropy regularization.
state – not used
- Returns
output (Tensor): predictions with shape
[batch_size, output_dim]info (LossInfo): loss
- Return type
- training: bool#
- class GeneratorLossInfo(generator, mi_estimator, inverse_mvp)#
Bases:
tupleCreate new instance of GeneratorLossInfo(generator, mi_estimator, inverse_mvp)
- generator#
Alias for field number 0
- inverse_mvp#
Alias for field number 2
- mi_estimator#
Alias for field number 1
- class InverseMVPAlgorithm(input_dim, output_dim, hidden_size=100, num_hidden_layers=1, activation=<built-in method relu_ of type object>, optimizer=None, name='InverseMVPAlgorithm')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmInverseMVP network Algorithm
Maintain an encoding network that takes (z, vec) as input and predicts a matrix-vector product (mvp) of the form \(y=J^{-1}(z)*vec\), where \(J^{-1}(z)\) is the inverse of the Jacobian matrix of some function \(f(z)\), and
vecis a vector. This network is used in GPVI in computing thefunctional_gradientof the generator, where \(J^{-1}\) is the inverse of the Jacobian of the generator function w.r.t. input noise \(z'\), andvecis the gradient of the kernel \(\nabla_{z'}k(z', z)\).Training of this network is done outside of the algorithm, where the network is trained to predict \(y\) that minimize the objective :math:`||Jy - vec||^2.
Create a InverseMVPAlgorithm. :param input_dim: dimension of input z :type input_dim: int :param output_dim: output dimension, i.e., dimension of the mvp :type output_dim: int :param hidden_size: width of hidden layers :type hidden_size: int :param num_hidden_layers: number of hidden layers after :type num_hidden_layers: int :param activation: activation used for all hidden layers. :type activation: Callable :param optimizer: (optional) optimizer for training. :type optimizer: torch.optim.Optimizer :param name: name of this Algorithm. :type name: str
- predict_step(inputs, state=None)[source]#
- Predict for one step of inputs.
- Args:
inputs (tuple of Tensors): inputs (z, vec) for prediction. - z (Tensor): of size [N2, K] or [N2, D], representing \(z'\),
where K is self._z_dim and D is self._vec_dim.
- vec (Tensor): of size [N2, D] or [N2, N, D], representing
:math:`
- abla_{z’}k(z’, z)` in GPVI.
state: not used.
- Returns:
AlgStep: - output (tuple of Tensors): predictions of InverseMVP network
and the z_inputs, which is [:, :K] of z.
state: not used.
- training: bool#
alf.algorithms.goal_generator#
- class GoalInfo(goal, loss)#
Bases:
tupleCreate new instance of GoalInfo(goal, loss)
- goal#
Alias for field number 0
- loss#
Alias for field number 1
- class GoalState(goal)#
Bases:
tupleCreate new instance of GoalState(goal,)
- goal#
Alias for field number 0
- class RandomCategoricalGoalGenerator(observation_spec, num_of_goals, name='RandomCategoricalGoalGenerator')[source]#
Bases:
alf.algorithms.rl_algorithm.RLAlgorithmRandom Goal Generation Module.
This module generates a random categorical goal for the agent in the beginning of every episode.
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
num_of_goals (int) – total number of goals the agent can sample from.
name (str) – name of the algorithm.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(inputs, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- rollout_step(inputs, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- train_step(inputs, state, rollout_info)[source]#
For off-policy training, the current output goal should be taken from the goal in
rollout_info(historical goals generated during rollout).Note that we cannot take the goal from
stateand pass it down because the first state might be a zero vector. And we also cannot resample the goal online because that might be inconsistent with the sampled experience trajectory.
- training: bool#
alf.algorithms.handcrafted_algorithm#
Handcrafted Algorithm.
- class HandcraftedAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), env=None, config=None, debug_summaries=False, name='Handcrafted')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmA base class for algorithms with handcrafted computational logic. Note that a concrete algorithm should subclass from this and implement the computational logic in
_policy_func. SeeSimpleCarlaAlgorithmfor an exmaple.- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions.
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
env (Environment) – The environment to interact with.
envis a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.config (TrainerConfig) – config for training. It only needs to be provided to the algorithm which performs
train_iter()by itself.debug_summaries (bool) – True if debug summaries should be created.
name (str) – The name of this algorithm.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(inputs, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- rollout_step(inputs, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- train_step(inputs, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class SimpleCarlaAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), distance_to_decelerate=50.0, distance_to_stop=1.0, env=None, config=None, debug_summaries=False, name='SimpleCarlaAlgorithm')[source]#
Bases:
alf.algorithms.handcrafted_algorithm.HandcraftedAlgorithmA simple controller for Carla environment.
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions.
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
distance_to_decelerate (float|int) – the distance in meter to goal from which to start decreasing the speed
distance_to_stop (float|int) – the distance in meter to goal from which to start to make a stop
env (Environment) – The environment to interact with.
envis a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.config (TrainerConfig) – config for training. It only needs to be provided to the algorithm which performs
train_iter()by itself.debug_summaries (bool) – True if debug summaries should be created.
name (str) – The name of this algorithm.
- training: bool#
alf.algorithms.hypernetwork_algorithm#
HyperNetwork algorithm.
- class HyperNetwork(data_creator=None, data_creator_outlier=None, input_tensor_spec=None, output_dim=None, conv_layer_params=None, fc_layer_params=None, activation=<built-in method relu_ of type object>, last_activation=<function identity>, last_use_bias=True, last_use_ln=False, noise_dim=32, hidden_layers=(64, 64), use_conv_bias=False, use_conv_ln=False, use_fc_bias=True, use_fc_ln=False, generator_use_fc_bn=False, num_particles=10, entropy_regularization=1.0, critic_hidden_layers=(100, 100), critic_iter_num=2, critic_l2_weight=10.0, functional_gradient=False, init_lambda=1.0, lambda_trainable=False, block_inverse_mvp=False, direct_jac_inverse=False, inverse_mvp_solve_iters=1, inverse_mvp_hidden_size=100, inverse_mvp_hidden_layers=1, function_vi=False, function_bs=None, function_extra_bs_ratio=0.1, function_extra_bs_sampler='uniform', function_extra_bs_std=1.0, loss_type='classification', voting='soft', par_vi='svgd', num_train_classes=10, critic_optimizer=None, inverse_mvp_optimizer=None, optimizer=None, lambda_optimizer=None, logging_network=False, logging_training=False, logging_evaluate=False, config=None, name='HyperNetwork')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmHyperNetwork algorithm maintains a generator that generates a set of parameters for a predefined neural network from a random noise input. It is based on the following work:
https://github.com/neale/HyperGAN
Ratzlaff and Fuxin. “HyperGAN: A Generative Model for Diverse, Performant Neural Networks.” International Conference on Machine Learning. 2019.
Major differences versus the original paper are:
A single generator that generates parameters for all network layers.
Remove the mixer and the discriminator.
The generator may be trained with generative particle-based variational inference (ParVI) method. Please refer to generator.py for details.
- Parameters
data_creator (Callable) – called as
data_creator()to get a tuple of(train_dataloader, test_dataloader)data_creator_outlier (Callable) – called as
data_creator()to get a tuple of(outlier_train_dataloader, outlier_test_dataloader)input_tensor_spec (nested TensorSpec) – the (nested) tensor spec of the input. If nested, then
preprocessing_combinermust not be None. It must be provided ifdata_creatoris not provided.output_dim (int) – dimension of the output of the generated network. It must be provided if
data_creatoris not provided.conv_layer_params (tuple[tuple]) – a tuple of tuples where each tuple takes a format
(filters, kernel_size, strides, padding, pooling_kernel), wherepaddingandpooling_kernelare optional.fc_layer_params (tuple[tuple]) – a tuple of tuples where each tuple takes a format
(FC layer sizes. use_bias), whereuse_biasis optional.activation (nn.functional) – activation used for all the layers but the last layer.
last_activation (nn.functional) – activation function of the last layer.
last_use_bias (bool) – whether use bias for the last layer
last_use_ln (bool) – whether use layer normalization for the additional layer.
noise_dim (int) – dimension of noise
hidden_layers (tuple) – size of hidden layers.
use_conv_bias (bool) – whether use bias for conv layers.
use_conv_ln (bool) – whether use layer normalization for conv layers.
use_fc_bias (bool) – whether use bias for fc layers.
use_fc_ln (bool) – whether use layer normalization for fc layers.
generator_use_fc_bn (bool) – whether use batch normalization for generator fc layers.
num_particles (int) – number of sampling particles
entropy_regularization (float) – weight for par_vi repulsive term. If
Noneanddata_creatoris provided, will be set as the ratio between the batch_size and the total size of the trainset.critic_optimizer (torch.optim.Optimizer) – the optimizer for training critic.
critic_hidden_layers (tuple) – sizes of critic hidden layeres.
critic_iter_num (int) – number of minmax optimization iterations to train critic
critic_l2_weight (float) – L2 penalty on critic to ensure boundednesss
functional_gradient (bool) – whether or not to use GPVI.
log_lambda (float) – logarithm of the weight on “extra” dimensions when forcing full rank Jacobian
block_inverse_mvp (bool) – whether to use the more efficient block form for inverse_mvp when
functional_gradientis True. This option only makes sense whennoise_dim<output_dim.inverse_mvp_solve_iters (int) – number of iterations to train inverse_mvp network each training iteration of generator.
inverse_mvp_hidden_size (int) – width of hidden layers of inverse_mvp network.
inverse_mvp_hidden_layers (int) – number of hidden layers in inverse_mvp network.
function_vi (bool) – whether to use funciton value based par_vi, current supported by [
svgd2,svgd3,gfsf].function_bs (int) – mini batch size for par_vi training. Needed for critic initialization when function_vi is True.
function_extra_bs_ratio (float) – ratio of extra sampled batch size w.r.t. the function_bs.
function_extra_bs_sampler (str) – type of sampling method for extra training batch, types are [
uniform,normal].function_extra_bs_std (float) – std of the normal distribution for sampling extra training batch when using normal sampler.
loss_type (str) – loglikelihood type for the generated functions, types are [
classification,regression]voting (str) – types of voting results from sampled functions, types are [
soft,hard]par_vi (str) –
types of particle-based methods for variational inference, types are [
svgd,svgd2,svgd3,gfsf,minmax],svgd: same as
svgd3.svgd2: empirical expectation of SVGD is evaluated by splitting half of the sampled batch. It is a trade-off between computational efficiency and convergence speed.
svgd3: empirical expectation of SVGD is evaluated by resampled particles of the same batch size. It has better convergence but involves resampling, so less efficient computaionally comparing with svgd2.
gfsf: wasserstein gradient flow with smoothed functions. It involves a kernel matrix inversion, so computationally most expensive, but in some case the convergence seems faster than svgd approaches.
minmax: Fisher Neural Sampler, optimal descent direction of the Stein discrepancy is solved by an inner optimization procedure in the space of L2 neural networks.
num_train_classes (int) – number of classes in training set.
critic_optimizer – The optimizer for training critic network
optimizer (torch.optim.Optimizer) – The optimizer for training generator.
logging_network (bool) – whether logging the archetectures of networks.
logging_training (bool) – whether logging loss and acc during training.
logging_evaluate (bool) – whether logging loss and acc of evaluate.
config (TrainerConfig) – configuration for training
name (str) –
- eval_uncertainty(num_particles=None)[source]#
Function to evaluate the epistemic uncertainty of a sampled ensemble. This method computes the following metrics:
AUROC (AUC): AUC is computed with respect to the entropy in the averaged softmax probabilities, as well as the sum of the variance of the softmax probabilities over the ensemble.
- Parameters
num_particles (int) – number of sampled particles. If None, then self.num_particles is used.
- evaluate(num_particles=None)[source]#
Evaluate on a randomly drawn ensemble.
- Parameters
num_particles (int) – number of sampled particles. If None, then self.num_particles is used.
- property num_particles#
number of sampled particles.
- predict_step(inputs, params=None, num_particles=None, state=None)[source]#
Predict ensemble outputs for inputs using the hypernetwork model.
- Parameters
inputs (Tensor) – inputs to the ensemble of networks.
params (Tensor) – parameters of the ensemble of networks, if None, will resample.
num_particles (int) – size of sampled ensemble. Default is None.
state (None) – not used.
- Returns
- output (Tensor): shape is
[batch_size, self._param_net._output_spec.shape[0]]
state (None): not used
- Return type
- sample_parameters(noise=None, num_particles=None, training=True)[source]#
Sample parameters for an ensemble of networks.
- Parameters
noise (Tensor) – input noise to self._generator. Default is None.
num_particles (int) – number of sampled particles. Default is None. If both noise and num_particles are None, num_particles provided to the constructor will be used as batch_size for self._generator.
training (bool) – whether or not training self._generator
- Returns
AlgStep.outputfrompredict_stepofself._generator
- set_data_loader(train_loader, test_loader=None, outlier_data_loaders=None, entropy_regularization=None)[source]#
Set data loadder for training and testing.
- Parameters
train_loader (torch.utils.data.DataLoader) – training data loader
test_loader (torch.utils.data.DataLoader) – testing data loader
outlier_data_loaders (tuple[torch.utils.data.DataLoader) – (trainloader, testloader) for outlier datasets
entropy_regularization (float) – weight for par_vi repulsive term. If None, then self._entropy_regarization is used.
- set_num_particles(num_particles)[source]#
Set the number of particles to sample through one forward pass of the hypernetwork.
- summarize_train(loss_info, params, cum_loss=None, avg_acc=None, inverse_mvp_loss=None)[source]#
Generate summaries for training & loss info after each gradient update. The default implementation of this function only summarizes params (with grads) and the loss. An algorithm can override this for additional summaries. See
RLAlgorithm.summarize_train()for an example.- Parameters
experience (nested Tensor) – samples used for the most recent
update_with_gradient(). By default it’s not summarized.train_info (nested Tensor) –
AlgStep.inforeturned by eitherrollout_step()(on-policy training) ortrain_step()(off-policy training). By default it’s not summarized.loss_info (LossInfo) – loss.
params (list[Parameter]) – list of parameters with gradients.
cum_loss (float) – cumulative training loss of epoch.
avg_acc (float) – average accuracy across batches in epoch.
inverse_mvp_loss (float) – cumulative training loss of InverseMVPNet
- train_iter(num_particles=None, state=None)[source]#
Perform one epoch (iteration) of training.
- Parameters
num_particles (int) – number of sampled particles. Default is None.
state (None) – not used
- Returns
mini_batch number
- train_step(inputs, num_particles=None, entropy_regularization=None, state=None)[source]#
Perform one batch of training computation.
- Parameters
inputs (nested Tensor) – input training data.
num_particles (int) – number of sampled particles. Default is None, in which case self._num_particles will be used for batch_size of self._generator.
entropy_regularization (float) – weight for par_vi repulsive term. If None, then self._entropy_regarization is used.
state (None) – not used
- Returns
train_stepofself._generator
- training: bool#
alf.algorithms.icm_algorithm#
- class ICMAlgorithm(action_spec, observation_spec=None, hidden_size=256, reward_adapt_speed=8.0, encoding_net=None, forward_net=None, inverse_net=None, activation=<built-in method relu_ of type object>, optimizer=None, name='ICMAlgorithm')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmIntrinsic Curiosity Module
This module generate the intrinsic reward based on predition error of observation.
See Pathak et al “Curiosity-driven Exploration by Self-supervised Prediction”
Create an ICMAlgorithm.
- Args
action_spec (nested TensorSpec): agent’s action spec observation_spec (nested TensorSpec): agent’s observation spec. If
not None, then a normalizer will be used to normalize the observation.
hidden_size (int or tuple[int]): size of hidden layer(s) reward_adapt_speed (float): how fast to adapt the reward normalizer.
rouphly speaking, the statistics for the normalization is calculated mostly based on the most recent T/speed samples, where T is the total number of samples.
- encoding_net (Network): network for encoding observation into a
latent feature. Its input is same as the input of this algorithm.
- forward_net (Network): network for predicting next feature based on
previous feature and action. It should accept input with spec [feature_spec, encoded_action_spec] and output a tensor of shape feature_spec. For discrete action, encoded_action is an one-hot representation of the action. For continuous action, encoded action is same as the original action.
- inverse_net (Network): network for predicting previous action given
the previous feature and current feature. It should accept input with spec [feature_spec, feature_spec] and output tensor of shape (num_actions,).
- activation (torch.nn.functional): activation used for constructing
any of the forward net and inverse net, if not provided.
optimizer (torch.optim.Optimizer): The optimizer for training name (str):
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(inputs, state)[source]#
Predict for one step of inputs.
- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
predict_state_spec.- info (nest): information for analyzing the agent. In particular,
if an element of the info is
alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.
- Return type
- rollout_step(inputs, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- train_step(inputs, state, rollout_info=None)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
alf.algorithms.iql_algorithm#
Implicit Q-Learning Algorithm.
- class IqlActionState(actor_network, critic)#
Bases:
tupleCreate new instance of IqlActionState(actor_network, critic)
- actor_network#
Alias for field number 0
- critic#
Alias for field number 1
- class IqlActorInfo(actor_loss)#
Bases:
tupleCreate new instance of IqlActorInfo(actor_loss,)
- actor_loss#
Alias for field number 0
- class IqlAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), actor_network_cls=<class 'alf.networks.actor_distribution_networks.ActorDistributionNetwork'>, critic_network_cls=<class 'alf.networks.critic_networks.CriticNetwork'>, v_network_cls=<class 'alf.networks.value_networks.ValueNetwork'>, reward_weights=None, epsilon_greedy=None, calculate_priority=False, num_critic_replicas=2, env=None, config=None, critic_loss_ctor=None, target_update_tau=0.05, target_update_period=1, temperature=1.0, actor_optimizer=None, critic_optimizer=None, value_optimizer=None, expectile=0.8, max_exp_advantage=100, checkpoint=None, debug_summaries=False, name='IqlAlgorithm')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmImplicit q-learning algorithm (IQL).
IQL is an offline reinforcement learning method. The idea is that instead of constraining the critic network or policy to avoid the value function extrapolation issue, IQL conducts learning using only in-sample data, thus voiding the issues when querying the critic network with out-of-distribution actions, a problem commonly faced in offline RL.
Reference:
Kostrikov, et al. "Offline Reinforcement Learning with Implicit Q-Learning", arXiv:2110.06169
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (BoundedTensorSpec) – representing the actions. Only continuous action is supported currently.
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
actor_network_cls (Callable) – is used to construct the actor network. The constructed actor network will be called to sample continuous actions. All of its output specs must be continuous. Discrete actor network is not supported.
critic_network_cls (Callable) – is used to construct critic network.
v_network_cls (Callable) – is used to construct a value network. for estimating the expectile of q values.
reward_weights (None|list[float]) – this is only used when the reward is multidimensional. In that case, the weighted sum of the q values is used for training the actor if reward_weights is not None. Otherwise, the sum of the q values is used.
epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from
config.epsilon_greedyand thenalf.get_config_value(TrainerConfig.epsilon_greedy).calculate_priority (bool) – whether to calculate priority. This is only useful if priority replay is enabled.
num_critic_replicas (int) – number of critics to be used. Default is 2. This is only applied for critic networks. The value network is not replicated.
env (Environment) – The environment to interact with.
envis a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.config (TrainerConfig) – config for training. It only needs to be provided to the algorithm which performs
train_iter()by itself.critic_loss_ctor (None|OneStepTDLoss|MultiStepLoss) – a critic loss constructor. If
None, a defaultOneStepTDLosswill be used.target_update_tau (float) – Factor for soft update of the target networks.
target_update_period (int) – Period for soft update of the target networks.
temperature (float) – the hyper-parameter for scaling the advantages. It corresponds to 1/beta in Eqn.(7) of the paper.
actor_optimizer (torch.optim.optimizer) – The optimizer for actor.
critic_optimizer (torch.optim.optimizer) – The optimizer for critic.
value_optimizer (torch.optim.optimizer) – The optimizer for value network.
expectile (float) – the expectile value for value learning.
max_exp_advantage (float) – clamp the exponentiated advantages with this value before being applied to weight the actor loss.
checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.debug_summaries (bool) – True if debug summaries should be created.
name (str) – The name of this algorithm.
- after_update(root_inputs, info)[source]#
Do things after completing one gradient update (i.e.
update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.- Parameters
root_inputs (nest) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll().info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()for on-policy training ortrain_step()for off-policy training.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(inputs, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- rollout_step(inputs, state)[source]#
rollout_step()basically predicts actions like what is done bypredict_step(). Additionally, if states are to be stored a in replay buffer, then this function also call_critic_networksand_target_critic_networksto maintain their states.
- train_step(inputs, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class IqlCriticInfo(critics, target_value, value)#
Bases:
tupleCreate new instance of IqlCriticInfo(critics, target_value, value)
- critics#
Alias for field number 0
- target_value#
Alias for field number 1
- value#
Alias for field number 2
- class IqlCriticState(critics, target_critics)#
Bases:
tupleCreate new instance of IqlCriticState(critics, target_critics)
- critics#
Alias for field number 0
- target_critics#
Alias for field number 1
- class IqlInfo(reward, step_type, discount, action, action_distribution, actor, critic)#
Bases:
tupleCreate new instance of IqlInfo(reward, step_type, discount, action, action_distribution, actor, critic)
- action#
Alias for field number 3
- action_distribution#
Alias for field number 4
- actor#
Alias for field number 5
- critic#
Alias for field number 6
- discount#
Alias for field number 2
- reward#
Alias for field number 0
- step_type#
Alias for field number 1
alf.algorithms.lagrangian_reward_weight_algorithm#
LagrangianRewardWeightAlgorithm.
- class LagInfo(rollout_reward)#
Bases:
tupleCreate new instance of LagInfo(rollout_reward,)
- rollout_reward#
Alias for field number 0
- class LagrangianPredRewardWeightAlgorithm(reward_spec, reward_thresholds, optimizer, init_weights=1.0, max_weight=None, reward_weight_normalization=True, pred_rewards_averager_ctor=functools.partial(<class 'alf.utils.averager.EMAverager'>, update_rate=0.0001), debug_summaries=False, name='LagrangianPredRewardWeightAlgorithm')[source]#
Bases:
alf.algorithms.lagrangian_reward_weight_algorithm.LagrangianRewardWeightAlgorithmSimilar to
LagrangianRewardWeightAlgorithm, except that the rewards used to compare with the thresholds are collected by prediction steps instead of by rollout steps. For harsh target constraints, it is important to remove the rollout stochasticity otherwise the agent’s constraint satisfaction ability will usually be under-estimated.Because prediction output is not directly passed to training, in order to use the rewards from prediction to train the weights, here we use an
Averagerto maintain the reward statistics. Inside everyafter_train_iterwe perform a gradient step by querying the current averager value.Note
This algorithm asserts
TrainerConfig.evaluate=True.- Parameters
reward_spec (TensorSpec) – a rank-1 tensor spec representing multi-dim rewards.
reward_thresholds (list[float]|None]) – a list of floating numbers, each representing a desired minimum reward threshold in expectation. If any entry is None, then the corresponding reward weight won’t be tuned; either its init value or its normalized init value (if
reward_weight_normalization=True) will be used.optimizer (optimizer) – optimizer for learning the reward weights.
init_weights (float|list[float]) – the initial reward weights.
max_weight (float) – the reward weights will be clipped up to this value
reward_weight_normalization (bool) – whether project the weights to a simplex (sum-to-one normalization)
pred_rewards_averager_ctor (Callable) – callable for creating an averager to maintain a moving average of prediction rewards. If None,
EMAveragerwith an update rate of1e-4will be used.debug_summaries (bool) –
name (str) –
- predict_step(inputs, state=None)[source]#
Predict for one step of inputs.
- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
predict_state_spec.- info (nest): information for analyzing the agent. In particular,
if an element of the info is
alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.
- Return type
- training: bool#
- class LagrangianRewardWeightAlgorithm(reward_spec, reward_thresholds, optimizer, init_weights=1.0, max_weight=None, reward_weight_normalization=True, lambda_transform=<built-in function softplus>, debug_summaries=False, name='LagrangianRewardWeightAlgorithm')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmAn algorithm that adjusts reward weights according to untransformed rollout rewards. The adjustment is expected to be performed after every training iteration.
Generally speaking, for each reward dimension, the algorithm compares an individual reward per step to an average expected threshold, and if the reward is greater than the threshold (requirement satisfied) then it decreases the reward weight; otherwise it increases the weight.
Note
This algorithm doesn’t put a constraint on per-step basis since it only learns a single, state-independent weight for each reward dim. Also, a reward is always assumed to be the higher the better.
- Parameters
reward_spec (TensorSpec) – a rank-1 tensor spec representing multi-dim rewards.
reward_thresholds (list[float]|None]) – a list of floating numbers, each representing a desired minimum reward threshold in expectation. If any entry is None, then the corresponding reward weight won’t be tuned; either its init value or its normalized init value (if
reward_weight_normalization=True) will be used.optimizer (optimizer) – optimizer for learning the reward weights.
init_weights (float|list[float]) – the initial reward weights.
max_weight (float) – the reward weights will be clipped up to this value
reward_weight_normalization (bool) – whether project the weights to a simplex (sum-to-one normalization)
lambda_transform (Callable) – the transform function to make sure all lambdas (reward weights) are positive. Currently only support
F.softplusandtorch.exp.debug_summaries (bool) –
name (str) –
- predict_step(inputs, state=None)[source]#
Predict for one step of inputs.
- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
predict_state_spec.- info (nest): information for analyzing the agent. In particular,
if an element of the info is
alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.
- Return type
- property reward_weights#
Return the detached reward weights. These weights are expected not to be changed by external code.
- rollout_step(inputs, state=None)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- training: bool#
alf.algorithms.mbrl_algorithm#
Model-based RL Algorithm.
- class LatentMbrlAlgorithm(observation_spec, action_spec, planner_module_ctor, reward_spec=TensorSpec(shape=(), dtype=torch.float32), env=None, config=None, planner_optimizer=None, debug_summaries=False, name='LatentMbrlAlgorithm')[source]#
Bases:
alf.algorithms.mbrl_algorithm.MbrlAlgorithmModel-based RL algorithm in a latent space.
Create an LatentMbrlAlgorithm. The LatentMbrlAlgorithm takes as input a planner module for making decisions on actions based on the latent representation of the current observation as well as a latent dynamics model.
The latent representation as well as the latent dynamics is provided by a latent predictive representation module, which is an instance of
PredictiveRepresentationLearner. It is set through theset_latent_predictive_representation_module()function. The latent predictive representation module should have a functionpredict_multi_stepfor performing multi-step imagined rollout. Currently it is assumed that the training of the latent representation module is outside of theLatentMbrlAlgorithm, although theLatentMbrlAlgorithmcan also contribute to its training by using the latent representation in loss calculation.- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (BoundedTensorSpec) – representing the actions.
planner_module_ctor (
Callable[[Any,Any],PlanAlgorithm]) – used to constrcut module for generating planned action based on specified reward function and dynamics functionreward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. env only needs to be provided to the root Algorithm.
config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs train_iter() by itself.
debug_summaries (bool) – True if debug summaries should be created.
name (str) – The name of this algorithm.
- calc_loss(training_info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- train_step(exp, state, rollout_info=None)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class MbrlAlgorithm(observation_spec, action_spec, reward_module, planner_module_ctor, feature_spec=None, dynamics_module_ctor=None, reward_spec=TensorSpec(shape=(), dtype=torch.float32), particles_per_replica=1, epsilon_greedy=None, env=None, config=None, dynamics_optimizer=None, reward_optimizer=None, planner_optimizer=None, checkpoint=None, debug_summaries=False, name='MbrlAlgorithm')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmModel-based RL algorithm
Create an MbrlAlgorithm. The MbrlAlgorithm takes as input the following set of modules for making decisions on actions based on the current observation: 1) learnable/fixed dynamics module 2) learnable/fixed reward module 3) learnable/fixed planner module
- Parameters
action_spec (BoundedTensorSpec) – representing the actions.
dynamics_module_ctor (
Optional[Callable[[Any,Any],DynamicsLearningAlgorithm]]) – used to construct the module for learning to predict the next feature based on the previous feature and action. It should accept input with spec [feature_spec, encoded_action_spec] and output a tensor of shape feature_spec. For discrete action, encoded_action is an one-hot representation of the action. For continuous action, encoded action is same as the original action.reward_module (RewardEstimationAlgorithm) – module for calculating the reward, i.e., evaluating the reward for a (s, a) pair
planner_module_ctor: – used to construct the module for generating: planned action based on specified reward function and dynamics function
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
particles_per_replica (int) – number of particles for each replica
epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from
config.epsilon_greedyand thenalf.get_config_value(TrainerConfig.epsilon_greedy).env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. env only needs to be provided to the root Algorithm.
config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs train_iter() by itself.
checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.debug_summaries (bool) – True if debug summaries should be created.
name (str) – The name of this algorithm.
- after_update(root_inputs, training_info)[source]#
Do things after completing one gradient update (i.e.
update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.- Parameters
root_inputs (nest) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll().info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()for on-policy training ortrain_step()for off-policy training.
- calc_loss(training_info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(time_step, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- rollout_step(time_step, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- train_step(inputs, state, rollout_info=None)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
alf.algorithms.mcts_algorithm#
Monte-Carlo Tree Search.
- class MCTSAlgorithm(observation_spec, action_spec, num_simulations, root_dirichlet_alpha, root_exploration_fraction, pb_c_init, pb_c_base, discount, is_two_player_game, visit_softmax_temperature_fn, model=None, keep_model_pred_state=False, predict_action_sampler=MultinomialSampler(), rollout_action_sampler=MultinomialSampler(), learn_policy_temperature=1.0, reward_spec=TensorSpec(shape=(), dtype=torch.float32), expand_all_children=False, expand_all_root_children=False, known_value_bounds=None, value_min_max_delta=1e-30, ucb_break_tie_eps=0.0, ucb_parent_visit_count_minus_one=False, unexpanded_value_score=0.5, act_with_exploration_policy=False, search_with_exploration_policy=False, learn_with_exploration_policy=False, exploration_policy_type='rkl', max_unroll_length=1000000, num_parallel_sims=1, checkpoint=None, debug_summaries=False, name='MCTSAlgorithm')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmMonte-Carlo Tree Search algorithm.
The code largely follows the pseudocode of Schrittwieser et al. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. The pseudocode can be downloaded from https://arxiv.org/src/1911.08265v2/anc/pseudocode.py
There are several differences:
In this implementation, all values and rewards are for player 0. It seems that the values and rewards in the pseudocode can be either for player 0 or player 1 depending on who is on current turn. It makes reasoning the logic of the code more difficult and error prone. And it indeed seems there is a bug in the pseudocode related to this. More concretely, in the pseudocode, line 524 suggests that the value_sum is relative to a changing player; line 528 suggests that all the rewards along a path are relative to a same player; while line 499 combines the reward and value without considering the player.
When calculating UCB score, the pseudocode normalizes value before adding with reward. We normalize after summing reward and value.
When calculating UCB score, if the visit count of the node is 0, the value component of the score is 0 in the pseudocode. We use 0.5 instead so that it is not always the lowest score (or highest for player 1) no matter what the outcome of its siblings are.
The pseudocode initializes the visit count of root to 0. We initialize it to 1 instead so that prior is not neglected in the first select_child(). This is consistent with how the visit_count of other nodes are initialized. When other nodes are expanded, the immediately subsequenct backup() will make their initial visit_count to be 1.
We add a game_over field to ModelOutput to indicate the game is over so that we won’t keep expanding over that branch.
We add support for using a stochastic policy instead of using UCB to do the search/learn/act. This can be enabled by setting
act_with_exploration_policysearch_with_exploration_policy,learn_with_exploration_policyto True. See Grill et al. Monte-Carlo tree search as regularized policy optimization for reference.
In addition to the original MuZero paper, we also implemented the method described in the following two paper:
1. Grill et al. Monte-Carlo tree search as regularized policy optimization
It can be enabled by setting (act/learn/search)_with_exploration_policy
2. Hubert et. al. Learning and Planning in Complex Action Spaces
It is enabled when SimpleMCTSModel.num_sampled_actions is set.
The time spent on tree search is directly related to how many times the tree is expanded. To make it faster, we also support expanding multiple leaves simulaneously. In order to do this, we maintain num_parallel_sims best children for each node in the tree and use them to contruct k=num_paralilel_sims paths. Note the k best children may have duplicates, which is desired because we want to expand the most promising path more often. Depending the value of
search_with_exploration_policy, this process is slightly different:search_with_exploration_policy=True. The k best_children of each node are simply chosen by independently sampling the exploration policy k times. When contructing the search paths, the i-th search path is based on the i-th best child of each node.
search_with_exploration_policy=False. The best child is same the case k=1. The second best child is found by assuming the visit count of the best child and the parent are increased by 1 and applying the UCB criterion again. This is repeated k times to get k best children. Note that this is different from directly selecting the best k childrens based on the original UCB scores. The reason of not doing that is that if the highest score is much bigger than the second highest score, we want to both paths to select the same child. During the process of traversing from the root to contruct k search paths, if several (let’s say k’) paths are exactly same so far, we will use best k’ children of the last node of these k’ paths to extend the paths so that The k’ children (may contains duplicates) being selected to extend these k’ paths are most promising according to the UCB scores.
- Parameters
observation_spec (nested TensorSpec) –
if the observation is a dictionary,
MCTSAlgorithmwill use the following three fields if they are contained in the dictionary:valid_action_mask: a bool Tensor to indicate which actions are allowed. It will be used to mask out invalid actions. If not provided, all possible actions are considered.
steps: int32 Tensor to indicate the number of steps since the beginning of the game. If not provided, an internal counter will be used. However, this internal count will not be correct if the algorithm is used to play against human because it is not used to generate all the moves of both players.
to_play: int8 Tensor whose elements are 0 or 1 to indicate who is the player to take the action. If not provided, steps % 2 will be used as to_play.
action_spec (nested BoundedTensorSpec) – representing the actions.
num_simulations (
int) – the number of simulations per search (calls to model)root_dirichlet_alpha (
float) – alpha of dirichlet prior for explorationroot_exploration_fraction (
float) – noise generated by the dirichlet distribution is combined with the action distribution from the model to be used as the action prior for the children of the root.pb_c_init (
float) – c1 of the pUCT rule in Appendix B, equation (2)pb_c_base (
float) – c2 of the pUCT rule in Appendix B, equation (2)discount (
float) – reward discount factoris_two_player_game (bool) – whether this is a two player (zero-sum) game
model (
Optional[MCTSModel]) – the model used by the algorithm. If not provided in the constructor. It should be specified using set_model before predict_step or rollout_step is used.keep_model_pred_state (
bool) – whether to keep ModelOutput.state.pred_state returned from model.initial_predict as part of the state of this algorithm. If so previous pred_state will be used to call initial_predict.visit_softmax_temperature_fn (Callable) – function for calculating the softmax temperature for sampling action based on the visit_counts of the children of the root. \(P(a) \propto \exp(visit\_count/t)\). This function is called as
visit_softmax_temperature_fn(steps), wherestepsis a vector representing the number of steps in the games. And it is expected to return a float vector of the same shape assteps.predict_action_sampler – available choices include
CategoricalSeedSampler,EpsilonGreedySampler,MultinomialSamplerrollout_action_sampler – available choices include
CategoricalSeedSampler,EpsilonGreedySampler,MultinomialSamplerlearn_policy_temperature (float) – transform the policy p found by MCTS by \(p^{1/learn_policy_temperature} / Z\) as policy target for model learning, where Z is a normalization factor so that the resulting probabilities sum to one.
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
expand_all_children (bool) – If True, when a new leaf is selected, immediately expand all its children. With this option, the visit count does not truly reflect the quality of a node. Hence it should be used with (act/learn)_with_exploration_policy=True
expand_all_root_children (bool) – whether to expand all root children before search. This is described in Appendix A of “Learning and Planning in Complex Action Spaces”. However, our implementation is different from the paper’s. The paper initializes Q(s, a) for root s for all the action being sampled. We expand all sampled action for root s. With this option, the visit count does not truly refect the quality of a node. Hence it should be used with (act/learn)_with_exploration_policy=True.
known_value_bounds (tuple|None) – known bound of the values.
value_min_max_delta (
float) – when normalizing the value using the the min and max values,(max-min).clamp(min=value_min_max_delta)is used as the denominator.ucb_break_tie_eps (
float) – add a random number in the range of [0, ucb_break_tie_eps) to the UCB score to choose actions with close UCB score randomly. It is used only if at least one ofact/search/learn_with_exploration_policyis False.ucb_parent_visit_count_minus_one (
bool) – This option effectively chooses the first child of a parent uniformly, which can increase exploration.unexpanded_value_score (float|str) – The value score for an unexpanded child. If ‘max’/’min’/’mean’, will use the maximum/minimum/mean of the value scores of the expanded siblings. If ‘mean_with_parent’, will use the mean of the value scores of the expanded siblings and its parent (this is used in ELF OpenGo and EfficientZero). If ‘none’, when exploration policy is used, will keep the policy for the unexpanded children same as prior; when exporation is not used, ‘none’ behaves same as ‘min’.
act_with_exploration_policy (bool) – If True, a policy calculated using reverse KL divergence will be used for generate action.
search_with_exploration_policy (bool) – If True, a policy calculated using reverse KL divergence will be used for tree search.
learn_with_exploration_policy (bool) – If True, a policy calculated using reverse KL divergence will be used for learning.
exploration_policy_type (
str) – Type of exploration policy. Must be one of (‘rkl’, ‘kl’)max_unroll_length (int) – maximal allowed unroll steps when building the search tree. If
expand_all_childrenis False, the maximal allowed tree depth will bemax_unroll_length. Otherwise, the maximal allowed tree depth will bemax_unroll_length-1num_parallel_sims (int) – expanding so many leaves at a time for one tree.
num_simulationsmust be divisable bynum_parallel_sims.checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.name (str) – the name of the algorithm.
- property discount#
- predict_step(time_step, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- rollout_step(time_step, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- training: bool#
- class MCTSInfo(candidate_actions, value, candidate_action_policy)#
Bases:
tupleCreate new instance of MCTSInfo(candidate_actions, value, candidate_action_policy)
- candidate_action_policy#
Alias for field number 2
- candidate_actions#
Alias for field number 0
- value#
Alias for field number 1
- class MCTSState(steps, pred_state, action_sampler_state, next_predicted_reward)#
Bases:
tupleCreate new instance of MCTSState(steps, pred_state, action_sampler_state, next_predicted_reward)
- action_sampler_state#
Alias for field number 2
- next_predicted_reward#
Alias for field number 3
- pred_state#
Alias for field number 1
- steps#
Alias for field number 0
- class VisitSoftmaxTemperatureByMoves(move_temperature_pairs=[(29, 1.0), (10000, 0.0001)])[source]#
Bases:
objectScheduling the temperature by move.
- Parameters
move_temperature_pairs (list[tuple]) – each (moves, temperature) pair indicates using this temperature until so many moves have been played in the current game. The moves should be in ascending order. Note that
num_movesused to calculate the temperature starts from 0.
- class VisitSoftmaxTemperatureByProgress(progress_temperature_pairs=[(0.5, 1.0), (0.75, 0.5), (1, 0.25)])[source]#
Bases:
objectScheduling the temperature by training progress.
- Parameters
progress_temperature_pairs (list[tuple]) – each (progress, temperature) pair indicates using this temperature until this training progress. Note that progress should be in ascending order.
- calculate_exploration_policy(value, prior, c, tol=1e-06)[source]#
Calculate exploration policy.
The policy is based on Grill et al. Monte-Carlo tree search as regularized policy optimization
Notation:
q: prior policy
p: sampling probability
v: value
The exploration policy is found by minimizing the following:
\[p = \arg\min_p \left[ -E_p(v) + c KL(q\|p) \right]\]which leads to the following solution:
\[p_i = c\frac{q_i}{\alpha - v_i}\]where \(\alpha \ge \max_i(v_i)\) is such that \(\sum_i p_i = 1\)
To make the solving numerically more stable and efficient, we reparameterize the problem to the following:
\[\begin{split}\begin{array}{ll} & v^* = \max_i v_i \\ & \alpha = v^* + c \beta \\ & u_i = \frac{v_i - v^*}{c} \\ & p_i = \frac{q_i}{\beta - u_i} \\ \end{array}\end{split}\]With this reparametrization, we need to find \(\beta>0\) s.t.
\[\sum_i \frac{q_i}{\beta - u_i} = 1\]We use Newton’s method to update \(\beta\) iteratively:
\[\beta \leftarrow \beta - \frac{f(\beta)}{f'(\beta)} = \beta + \frac{\sum_i \frac{q_i}{\beta - v_i} - 1}{\sum_i \frac{q_i}{(\beta - v_i)^2}}\]where \(f(\beta) = \sum_i \frac{q_i}{\beta - u_i} - 1\) and \(f'(\beta)\) is the derivative of \(f(\beta)\). Since \(f(\beta)\) is convex, starting the iteration with a \(\beta\) s.t. \(f(\beta) > 0\) gaurantees the convergence. In practice, we find that about 10 iterations can reach tolerance of 1e-6. Newton’s method is much faster than binary search.
- Parameters
value (Tensor) – [N, K] Tensor
prior (Tensor) – [N, K] Tensor
c (Tensor) – [N, 1] Tensor
tol (float) – Desired acurracy. The result satisfy \(|\sum_i p_i - 1| \le tol\)
- Returns
Tensor: [N, K], the exploration policy
int: the number of iterations
- Return type
tuple
- calculate_kl_exploration_policy(value, prior, c)[source]#
Calculate exploration policy.
This is similar to
calculate_exploration_policy, but using \(KL(p\|q)\) instead of \(KL(q\|p)\) for regularization.Notation:
q: prior policy
p: sampling probability
v: value
The exploration policy is found by minimizing the following:
\[p = \arg\min_p \left[ -E_p(v) + c KL(p\|q) \right]\]which leads to the following solution:
\[p_i = \frac{q_i \exp(v_i/c)}{Z}\]where \(Z\) is such that \(\sum_i p_i = 1\)
- Parameters
value (Tensor) – [N, K] Tensor
prior (Tensor) – [N, K] Tensor
c (Tensor) – [N, 1] Tensor
- Returns
Tensor: [N, K], the exploration policy
int: always 0 (to conform with the signature of calculate_exploration_policy)
- Return type
tuple
- create_atari_mcts(observation_spec, action_spec)[source]#
Helper function for creating MCTSAlgorithm for atari games.
- create_board_game_mcts(observation_spec, action_spec, dirichlet_alpha, pb_c_init=1.25, num_simulations=800, debug_summaries=False)[source]#
Helper function for creating MCTSAlgorithm for board games.
alf.algorithms.mcts_models#
- class MCTSModel(num_unroll_steps, representation_net, dynamics_net, prediction_net, train_reward_function, train_game_over_function, train_repr_prediction=False, train_policy=True, predict_reward_sum=False, value_loss_weight=1.0, reward_loss_weight=1.0, policy_loss_weight=1.0, game_over_loss_weight=1.0, repr_prediction_loss_weight=1.0, initial_alpha=0.0, reward_loss=SquareLoss(), value_loss=SquareLoss(), repr_loss=MeanSquaredLoss(batch_dims=2), target_entropy=None, alpha_adjust_rate=0.001, initial_loss_weight=1, predict_initial_reward=True, reset_reward_sum_period=0, apply_beyond_episode_end_mask=False, apply_partial_trajectory_mask=False, debug_summaries=False, name='MCTSModel')[source]#
Bases:
torch.nn.modules.module.ModuleThe interface for the model used by MCTSAlgorithm.
- Parameters
representation_net (Network) – the network for generating initial latent representation from observation. It is called as
representation_net(observation).dynamics_net (Network) – the network for generating the next latent representation given the current latent representation and action. It is called as
dynamics_net((current_latent_representation, action))prediction_net (Network) –
the network for predicting value, reward and action. It is called as
prediction_net(dyn_state, pred_state)and output a tuple of four Tensors: - value_pred: the prediction for value. The way it is interpreteddepends on
value_loss.reward_pred (Optional): the prediction for reward. The way it is interpreted depends on
reward_loss.action_distribution: The distribution of the actions of the predicted policy.
game_over_logit (Optional): The predicted logits for game over.
train_reward_function (bool) – whether to predict reward
train_game_over_function (bool) – whether to predict game over
train_repr_prediction (bool) – whether to train to predict future latent representation.
train_policy (bool) – whether to train a policy. Note that training policy is REQUIRED when the model is used in MCTS algorithm.
predict_reward_sum (bool) – If True, the loss for reward is between the predicted reward and the sum of actual reward over unroll steps. If False, the loss for reward is the mean square error between the predicted reward and the actual reward.
value_loss_weight (float) – the weight for value prediction loss.
reward_loss_weight (float) – the weight for reward prediction loss
policy_loss_weight (float) – the weight for policy prediction loss
repr_prediction_loss_weight (float) – the weight for the loss of predicting latent representation.
initial_alpha (float) – initial value for the weight of entropy regulariation
reward_loss (
ScalarPredictionLoss) – the loss function for reward prediction.value_loss (
ScalarPredictionLoss) – the loss function for value prediction.repr_loss (
Callable) – the loss function for representation learning. It is called asrepr_loss(predicted_representation, target_representation), where the shape of the two tensors are [B, num_unroll_steps+1, …]. It should return a loss with the shape [B, num_unroll_steps+1]``. Note thatrepr_losscan have its own parameters.target_entropy (float) – if provided, will adjust alpha automatically so that the entropy is not smaller than this.
alpha_adjust_rate (float) – the speed to adjust alpha
initial_loss_weight (
Optional[float]) – the weight for the loss at the initial step of the trajectory. If not provided,1 / num_unroll_stepswill be used.predict_initial_reward (
bool) – whether to predict the reward at the initial step.reset_reward_sum_period (
int) – reset the reward sum every so many steps. Do not reset the reward sum if this is 0.apply_beyond_episode_end_mask (
bool) – If True, the steps after the end of an episode is ignored for the representation prediction loss.apply_partial_trajectory_mask (
bool) – If True, the steps after an unfinished episode (due to TimeLimit or an ongoing episode) is ignored for all the losses.
- calc_loss(model_output, target)[source]#
Calculate the loss.
The shapes of the tensors in model_output are [B, unroll_steps+1, …] :returns: the shapes of the tensors are [B] :rtype: LossInfo
- calc_repr_prediction_loss(repr, target_repr)[source]#
Calculate the loss given the predicted representation and target representation.
- initial_predict(latent, pred_state=())[source]#
Make predictions based on an initial latent representation.
Note that we specialize for initial prediction (in addition to recurrent prediction made in recurrent_inference()) because some stateful initializations need to be completed.
- Parameters
latent (
Tensor) – A batch of initial representation (i.e. directly derived from a raw observation).pred_state – prediction state. If provided, it should be ModelOutput.state.pred_state returned from initial_predict at the previous step
- Return type
- Returns
A ModelOutput object produced by the prediction network.
- initial_representation(observation)[source]#
Compute the initial latent representation given the observation. :param observation: A tensor or tensor nest representing a batch of
observations.
- Return type
Tensor- Returns
The latent representation generated by the representation net.
- property pred_state_spec: Union[alf.tensor_specs.TensorSpec, List[NestedTensorSpec], Tuple[()], Tuple[NestedTensorSpec, ...], Dict[str, NestedTensorSpec]]#
Returns the spec of the prediction_net.
- Return type
Union[TensorSpec,List[ForwardRef],Tuple[()],Tuple[ForwardRef, …],Dict[str,ForwardRef]]
- prediction_model(dyn_state, pred_state)[source]#
- Calculate the prediction given the latent state of the dynamics model
and the state of the prediction model.
- Returns
the following fields need to be provided - value_pred: - reward_pred: provide if need to predict reward - game_over: provide if need to predict game over - actions: provide if actions are sampled - action_probs - state (ModelState): dyn_state, pred_state - action_distribution: - game_over_logit: provide if need to predict game over
- Return type
- recurrent_inference(state, action)[source]#
Generate prediction given state and action.
- Parameters
state (Tensor) – the latent state of the model. The state should be from previous call of
initial_inferenceorrecurrent_inference.action (Tensor) – the imagined action
- Returns
the prediction
- Return type
- property repr_spec: alf.tensor_specs.TensorSpec#
Returns the spec of the representation.
Used by the downstream RL algorithms as their observation spec.
- Return type
- training: bool#
- class ModelOutput(value, reward, game_over, actions, action_probs, state, action_distribution, game_over_logit, value_pred, reward_pred)#
Bases:
tupleCreate new instance of ModelOutput(value, reward, game_over, actions, action_probs, state, action_distribution, game_over_logit, value_pred, reward_pred)
- action_distribution#
Alias for field number 6
- action_probs#
Alias for field number 4
- actions#
Alias for field number 3
- game_over#
Alias for field number 2
- game_over_logit#
Alias for field number 7
- reward#
Alias for field number 1
- reward_pred#
Alias for field number 9
- state#
Alias for field number 5
- value#
Alias for field number 0
- value_pred#
Alias for field number 8
- class ModelState(state, pred_state, step, prev_reward_sum)#
Bases:
tupleCreate new instance of ModelState(state, pred_state, step, prev_reward_sum)
- pred_state#
Alias for field number 1
- prev_reward_sum#
Alias for field number 3
- state#
Alias for field number 0
- step#
Alias for field number 2
- class ModelTarget(is_partial_trajectory, beyond_episode_end, reward, action, action_policy, game_over, value, observation)#
Bases:
tupleCreate new instance of ModelTarget(is_partial_trajectory, beyond_episode_end, reward, action, action_policy, game_over, value, observation)
- action#
Alias for field number 3
- action_policy#
Alias for field number 4
- beyond_episode_end#
Alias for field number 1
- game_over#
Alias for field number 5
- is_partial_trajectory#
Alias for field number 0
- observation#
Alias for field number 7
- reward#
Alias for field number 2
- value#
Alias for field number 6
- class SimpleMCTSModel(observation_spec, action_spec, num_unroll_steps, num_sampled_actions=None, encoding_net_ctor=<function create_simple_encoding_net>, dynamics_net_ctor=<function create_simple_dynamics_net>, prediction_net_ctor=<function create_simple_prediction_net>, game_over_logit_thresh=1.0, initial_alpha=0.0, target_entropy=None, alpha_adjust_rate=0.001, train_reward_function=True, train_game_over_function=True, train_policy=True, train_repr_prediction=False, debug_summaries=False, name='SimpleMCTSModel')[source]#
Bases:
alf.algorithms.mcts_models.MCTSModel- Parameters
observation_spec (TensorSpec) – representing the observations.
action_spec (BoundedTensorSpec) – representing the actions.
num_sampled_actions (int) – the number of actions sampled from the action distribution. For continuous action or multi-dimensional discrete action, so many actions will be sampled from the action distribution. For 1 dimensional (scalar) discrete action, the
num_sampled_actionsactions with the largest probability will be chosen.dynamics_net_ctor (Callable) – Called as
dynamics_net_ctor((observation_spec, action_spec))to create the dynamics net. The created net should take a tuple of (observation, action) as input and output the next observation.prediction_net_ctor (Callable) – Called as
prediction_net_ctor(observation_spec, action_spec)to create the prediction net. The created net should take the latent_state as input and output the prediction for (value, reward, action_distribution, game_over_logit).game_over_logit_thresh (float) – the threshold of treating the state as game over if the logit for game is greater than this.
initial_alpha (float) – initial value for the weight of entropy regularization
target_entropy (float) – if provided, will adjust alpha automatically so that the entropy is not smaller than this.
alpha_adjust_rate (float) – the speed to adjust alpha
train_reward_function (bool) – whether to predict reward
train_game_over_function (bool) – whether to predict game over
train_repr_prediction (bool) – whether to train to predict future latent representation. This implements the self-supervised consistency loss described in Ye et. al. Mastering Atari Games with Limited Data. The loss is
-cosine(prediction_net(projection_net(x)), projection_net(y)), where x is the representation calcuated by dynamics_net and y is the representation calcualted by representation_net from the corresponding future observations.train_policy (bool) – whether to train a policy. Note that training policy is REQUIRED when the model is used in MCTS algorithm.
- prediction_model(dyn_state, pred_state)[source]#
- Calculate the prediction given the latent state of the dynamics model
and the state of the prediction model.
- Returns
the following fields need to be provided - value_pred: - reward_pred: provide if need to predict reward - game_over: provide if need to predict game over - actions: provide if actions are sampled - action_probs - state (ModelState): dyn_state, pred_state - action_distribution: - game_over_logit: provide if need to predict game over
- Return type
- property repr_spec#
Returns the spec of the representation.
Used by the downstream RL algorithms as their observation spec.
- training: bool#
- class SimplePredictionNet(observation_spec, action_spec, trunk_net_ctor, num_quantiles=1, discrete_projection_net_ctor=<class 'alf.networks.projection_networks.CategoricalProjectionNetwork'>, continuous_projection_net_ctor=<class 'alf.networks.projection_networks.StableNormalProjectionNetwork'>, initial_game_over_bias=0.0)[source]#
Bases:
alf.networks.network.Network- Parameters
observation_spec (TensorSpec) – describing the observation.
action_spec (BoundedTensorSpec) – describing the action.
trunk_net_ctor (Callable) – called as
trunk_net_ctor(input_tensor_spec=observation_spec)to created a network which taks observation as input and output a hidden representation which will be used as input for predicting value, reward, action_distribution and game_over_logitinitial_game_over_bias (float) – initial bias for predicting the. logit of game_over. Sugguest to use
log(game_over_prob/(1 - game_over_prob))
- forward(input, state=())[source]#
Predict (value, reward, action_distribution, game_over_logit)
- Parameters
input (Tensor) – observation
state – not used.
- Returns
(value, reward, action_distribution, game_over_logit), ()
- Return type
A tuple of
- training: bool#
alf.algorithms.mdq_algorithm#
Multi-Dimensional Q-Learning Algorithm.
- class MdqAlgorithm(observation_spec, action_spec, critic_network, reward_spec=TensorSpec(shape=(), dtype=torch.float32), epsilon_greedy=None, env=None, config=None, critic_loss_ctor=None, target_entropy=<function calc_default_target_entropy_quantized>, initial_log_alpha=0.0, target_update_tau=0.05, target_update_period=1, distill_noise=0.01, critic_optimizer=None, alpha_optimizer=None, debug_summaries=False, name='MdqAlgorithm')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmMulti-Dimensional Q-Learning Algorithm.
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions.
critic_network (MdqCriticNetwork) – an instance of MdqCriticNetwork
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from
config.epsilon_greedyand thenalf.get_config_value(TrainerConfig.epsilon_greedy).env (Environment) – The environment to interact with.
envis a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.config (TrainerConfig) – config for training. It only needs to be provided to the algorithm which performs
train_iter()by itself.critic_loss_ctor (None|OneStepTDLoss|MultiStepLoss) – a critic loss constructor. If
None, a defaultOneStepTDLosswill be used.initial_log_alpha (float) – initial value for variable
log_alpha.target_entropy (float|Callable) – If a floating value, it’s the target average policy entropy, for updating
alpha. If a callable function, then it will be called on the action spec to calculate a target entropy. Note that in MDQ algorithm, as the continuous action is represented by a discrete distribution for each action dimension,calc_default_target_entropy_quantizedis used to compute the target entropy by default.target_update_tau (float) – Factor for soft update of the target networks.
target_update_period (int) – Period for soft update of the target networks.
distill_noise (int) – the std of random Gaussian noise added to the action used for distillation.
critic_optimizer (torch.optim.optimizer) – The optimizer for critic.
alpha_optimizer (torch.optim.optimizer) – The optimizer for alpha.
debug_summaries (bool) – True if debug summaries should be created.
name (str) – The name of this algorithm.
- after_update(root_inputs, info)[source]#
Do things after completing one gradient update (i.e.
update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.- Parameters
root_inputs (nest) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll().info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()for on-policy training ortrain_step()for off-policy training.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(time_step, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- rollout_step(time_step, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- train_step(inputs, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class MdqAlphaInfo(alpha_loss, neg_entropy)#
Bases:
tupleCreate new instance of MdqAlphaInfo(alpha_loss, neg_entropy)
- alpha_loss#
Alias for field number 0
- neg_entropy#
Alias for field number 1
- class MdqCriticInfo(critic_free_form, target_critic_free_form, critic_adv_form, distill_target, kl_wrt_prior)#
Bases:
tupleCreate new instance of MdqCriticInfo(critic_free_form, target_critic_free_form, critic_adv_form, distill_target, kl_wrt_prior)
- critic_adv_form#
Alias for field number 2
- critic_free_form#
Alias for field number 0
- distill_target#
Alias for field number 3
- kl_wrt_prior#
Alias for field number 4
- target_critic_free_form#
Alias for field number 1
- class MdqCriticState(critic, target_critic)#
Bases:
tupleCreate new instance of MdqCriticState(critic, target_critic)
- critic#
Alias for field number 0
- target_critic#
Alias for field number 1
- class MdqInfo(reward, step_type, discount, action, critic, alpha)#
Bases:
tupleCreate new instance of MdqInfo(reward, step_type, discount, action, critic, alpha)
- action#
Alias for field number 3
- alpha#
Alias for field number 5
- critic#
Alias for field number 4
- discount#
Alias for field number 2
- reward#
Alias for field number 0
- step_type#
Alias for field number 1
alf.algorithms.merlin_algorithm#
Implementation of MERLIN algorithm. See class MerlinAlgorithm for detail.
- class MBPLossInfo(decoder, vae)#
Bases:
tupleCreate new instance of MBPLossInfo(decoder, vae)
- decoder#
Alias for field number 0
- vae#
Alias for field number 1
- class MBPState(latent_vector, mem_readout, rnn_state, memory)#
Bases:
tupleCreate new instance of MBPState(latent_vector, mem_readout, rnn_state, memory)
- latent_vector#
Alias for field number 0
- mem_readout#
Alias for field number 1
- memory#
Alias for field number 3
- rnn_state#
Alias for field number 2
- class MemoryBasedActor(observation_spec, action_spec, memory, reward_spec=TensorSpec(shape=(), dtype=torch.float32), epsilon_greedy=None, num_read_keys=1, lstm_size=(256, 256), latent_dim=200, loss=None, loss_class=<class 'alf.algorithms.actor_critic_loss.ActorCriticLoss'>, loss_weight=1.0, debug_summaries=False, name='mba')[source]#
Bases:
alf.algorithms.on_policy_algorithm.OnPolicyAlgorithmThe policy module for MERLIN model.
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions.
memory (MemoryWithUsage) – the memory module from
MemoryBasedPredictorreward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from
alf.get_config_value(TrainerConfig.epsilon_greedy).num_read_keys (int) – number of keys for reading memory.
latent_dim (int) – the dimension of the hidden representation of VAE.
lstm_size (list[int]) – size of lstm layers
loss (None|ActorCriticLoss) – an object for calculating the loss for reinforcement learning. If None, a default
ActorCriticLosswill be used.loss_class (type) – the class of the loss. The signature of its constructor: loss_class(debug_summaries)
name (str) – name of the algorithm.
- predict_step(time_step, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- rollout_step(time_step, state)[source]#
Train one step.
- Parameters
time_step (TimeStep) –
time_step.observationshould be the latent vector.state (nested Tensor) – state of the model
- training: bool#
- class MemoryBasedPredictor(action_spec, encoders, decoders, num_read_keys=3, lstm_size=(256, 256), latent_dim=200, memory_size=1350, loss_weight=1.0, name='mbp')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmThe Memroy Based Predictor.
It’s described in: Wayne et al “Unsupervised Predictive Memory in a Goal-Directed Agent” arXiv:1803.10760
- Parameters
action_spec (nested BoundedTensorSpec) – representing the actions.
encoders (nested Network) – the nest should match observation_spec
decoders (nested Algorithm) – the nest should match observation_spec
num_read_keys (int) – number of keys for reading memory.
lstm_size (list[int]) – size of lstm layers for MBP and MBA
latent_dim (int) – the dimension of the hidden representation of VAE.
memroy_size (int) – number of memory slots
loss_weight (float) – weight for the loss
name (str) – name of the algorithm.
- property memory#
Return the external memory of this module.
- predict_step(inputs, state)[source]#
Train one step.
- Parameters
inputs (tuple) – a tuple of
(observation, action).state (nested Tensor) – RNN state
- Returns
output: latent vector
state: next state
info: empty tuple
- Return type
- train_step(inputs, state)[source]#
Train one step.
- Parameters
inputs (tuple) – a tuple of
(observation, action).- Returns
output: latent vector
state: next state
info (LossInfo): loss
- Return type
- training: bool#
- class MerlinAlgorithm(observation_spec, action_spec, encoders, decoders, reward_spec=TensorSpec(shape=(), dtype=torch.float32), env=None, config=None, latent_dim=200, lstm_size=(256, 256), memory_size=1350, rl_loss=None, optimizer=None, debug_summaries=False, name='Merlin')[source]#
Bases:
alf.algorithms.on_policy_algorithm.OnPolicyAlgorithmMERLIN model.
This implements the MERLIN model described in Wayne et al “Unsupervised Predictive Memory in a Goal-Directed Agent” arXiv:1803.10760
Current differences:
No action encoding and decoding
No retroactive memory update
No prediction of state-action value
Value prediction does not use action distribution as feature.
No q-value prediction
Image encoding and decoding use batch-norm. The paper didn’t use.
- Parameters
action_spec (nested BoundedTensorSpec) – representing the actions.
encoders (nested Network) – the nest should match observation_spec
decoders (nested Algorithm) – the nest should match observation_spec
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
env (Environment) – The environment to interact with.
envis a batched environment, which means that it runs multiple simulations simultaneously. Running multiple environments in parallel is crucial to on-policy algorithms as it increases the diversity of data and decreases temporal correlation.envonly needs to be provided to the rootAlgorithm.config (TrainerConfig) – config for training.
configonly needs to be provided to the algorithm which performstrain_iter()by itself.latent_dim (int) – the dimension of the hidden representation of VAE.
lstm_size (list[int]) – size of lstm layers for MBP and MBA
memroy_size (int) – number of memory slots
rl_loss (None|ActorCriticLoss) – an object for calculating the loss for reinforcement learning. If None, a default
ActorCriticLosswill be used.optimizer (torch.optim.Optimizer) – The optimizer for training.
debug_summaries – True if debug summaries should be created.
name (str) – name of the algorithm.
- predict_step(time_step, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- training: bool#
- class MerlinInfo(mbp_info, mba_info)#
Bases:
tupleCreate new instance of MerlinInfo(mbp_info, mba_info)
- mba_info#
Alias for field number 1
- mbp_info#
Alias for field number 0
- class MerlinLossInfo(mba, mbp)#
Bases:
tupleCreate new instance of MerlinLossInfo(mba, mbp)
- mba#
Alias for field number 0
- mbp#
Alias for field number 1
- class MerlinState(mbp_state, mba_state)#
Bases:
tupleCreate new instance of MerlinState(mbp_state, mba_state)
- mba_state#
Alias for field number 1
- mbp_state#
Alias for field number 0
- class ResnetDecodingNetwork(input_tensor_spec, output_tensor_spec=TensorSpec(shape=(3, 64, 64), dtype=torch.float32), name='ResnetDecodingNetwork')[source]#
Bases:
alf.networks.network.NetworkImage decoding network using ResNet bottleneck blocks.
This is not a generic network, it implements ImageDecoder described in 2.2.1 of “Unsupervised Predictive Memory in a Goal-Directed Agent”
- Parameters
input_tensor_spec (TensorSpec) – input latent spec.
output_tensor_spec (TensorSpec) – desired output shape. Height and width needs to be divisible by 8.
- forward(observation, state=())[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- training: bool#
- class ResnetEncodingNetwork(input_tensor_spec, output_size=500, output_activation=<built-in method tanh of type object>, use_fc_bn=False, norm_layer=None, name='ResnetEncodingNetwork')[source]#
Bases:
alf.networks.network.NetworkImage encoding network using ResNet bottleneck blocks.
This is not a generic network, it implements ImageEncoder described in 2.1.1 of “Unsupervised Predictive Memory in a Goal-Directed Agent”
- Parameters
input_tensor_spec (nested TensorSpec) – input observations spec.
output_size (int) – dimension of the encoding result
output_activation (Callable) – activation for the output
use_fc_bn (bool) – whether to use batch normalization for the final
FClayer.norm_layer (nn.Module|None) – optional additional layer for normalization.
- forward(observation, state=())[source]#
Defines the computation performed at every call.
Should be overridden by all subclasses.
Note
Although the recipe for forward pass needs to be defined within this function, one should call the
Moduleinstance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.
- training: bool#
alf.algorithms.mi_estimator#
Mutual Information Estimator.
- class MIEstimator(x_spec, y_spec, model=None, fc_layers=(256), sampler='buffer', buffer_size=65536, optimizer=None, estimator_type='DV', averager=None, name='MIEstimator')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmMutual Infomation Estimator.
Implements several mutual information estimator from Belghazi et al Mutual Information Neural Estimation Hjelm et al Learning Deep Representations by Mutual Information Estimation and Maximization
Currently, 3 types of estimator are implemented, which are based on the following variational lower bounds:
DV: \(\sup_T E_P(T) - \log E_Q(\exp(T))\)
KLD: \(\sup_T E_P(T) - E_Q(\exp(T)) + 1\)
JSD: \(\sup_T -E_P(softplus(-T))) - E_Q(softplus(T)) + \log(4)\)
ML: \(\sup_q E_P(\log(q(y|x)) - \log(P(y)))\)
where P is the joint distribution of X and Y, and Q is the product marginal distribution of P. Both DV and KLD are lower bounds for \(KLD(P||Q)=MI(X, Y)\). However, JSD is not a lower bound for mutual information, it is a lower bound for \(JSD(P||Q)\), which is closely correlated with MI as pointed out in Hjelm et al.
For ML, \(P(y)\) is the margianl distribution of y, and it needs to be provided. The current implementation uses a normal distribution with diagonal variance for \(q(y|x)\). So it only support continous y. If \(P(y|x)\) can be reasonably approximated as an diagonal normal distribution and \(P(y)\) is known, then ‘ML’ may give better estimation for the mutual information.
Assumming the function class of T is rich enough to represent any function, for KLD and JSD, T will converge to \(\log(\frac{P}{Q})\) and hence \(E_P(T)\) can also be used as an estimator of \(KLD(P||Q)=MI(X,Y)\). For DV, \(T\) will converge to \(\log(\frac{P}{Q}) + c\), where \(c=\log E_Q(\exp(T))\).
Among DV, KLD and JSD, DV and KLD seem to give a better estimation of PMI than JSD. But JSD might be numerically more stable than DV and KLD because of the use of softplus instead of exp. And DV is more stable than KLD because of the logarithm.
Several strategies are implemented in order to estimate \(E_Q(\cdot)\):
‘buffer’: store \(y\) to a buffer and randomly retrieve samples from the buffer.
‘double_buffer’: stroe both \(x\) and \(y\) to buffers and randomly retrieve samples from the two buffers.
‘shuffle’: randomly shuffle batch \(y\)
‘shift’: shift batch \(y\) by one sample, i.e.
torch.cat([y[-1:, ...], y[0:-1, ...]], dim=0)direct sampling: You can also provide the marginal distribution of \(y\) to
train_step(). In this case, sampler is ignored and samples of \(y\) for estimating \(E_Q(.)\) are sampled fromy_distribution.
If you need the gradient of \(y\), you should use sampler ‘shift’ and ‘shuffle’.
Among these, ‘buffer’ and ‘shift’ seem to perform better and ‘shuffle’ performs worst. ‘buffer’ incurs additional storage cost. ‘shift’ has the assumption that y samples from one batch are independent. If the additional memory is not a concern, we recommend ‘buffer’ sampler so that there is no need to worry about the assumption of independence.
MIEstimatorcan be also used to estimate conditional mutual information \(MI(X,Y|Z)\) using KLD, JSD or ML. In this case, you should letxto represent \(X\) and \(Z\), andyto represent \(Y\). And when callingtrain_step(), you need to providey_distributionwhich is the distribution \(P(Y|z)\). Note that DV cannot be used for estimating conditional mutual information. Seemi_estimator_test.pyfor an example.- Parameters
x_spec (nested TensorSpec) – spec of
xy_spec (nested TensorSpec) – spec of
ymodel (Network) – can be called as
model([x, y])and return a Tensor withshape=[batch_size, 1]. If None, a default MLP withfc_layerswill be created.fc_layers (tuple[int]) – size of hidden layers. Only used if model is None.
sampler (str) – type of sampler used to get samples from marginal distribution, should be one of
['buffer', 'double_buffer', 'shuffle', 'shift'].buffer_size (int) – capacity of buffer for storing y for sampler ‘buffer’ and ‘double_buffer’.
optimzer (torch.optim.Optimzer) – optimizer
estimator_type (str) – one of ‘DV’, ‘KLD’ or ‘JSD’
averager (EMAverager) – averager used to maintain a moving average of \(exp(T)\). Only used for ‘DV’ estimator. If None, a ScalarAdaptiveAverager will be created.
name (str) – name of this estimator
- calc_pmi(x, y, y_distribution=None)[source]#
Return estimated pointwise mutual information.
The pointwise mutual information is defined as:
\[\log \frac{P(x|y)}{P(x)} = \log \frac{P(y|x)}{P(y)}\]- Parameters
x (Tensor) – x
y (Tensor) – y
y_distribution (DiagMultivariateNormal) – needs to be provided for ‘ML’ estimator.
- Returns
pointwise mutual information between
xandy.- Return type
Tensor
- train_step(inputs, y_distribution=None, state=None)[source]#
Perform training on one batch of inputs.
- Parameters
inputs (tuple(nested Tensor, nested Tensor)) – tuple of
xandyy_distribution (nested td.Distribution) – distribution for the marginal distribution of
y. If None, will use the sampling methodsamplerprovided at constructor to generate the samples for the marginal distribution of \(Y\).state – not used
- Returns
outputs (Tensor): shape is
[batch_size], its mean is the estimated MI for estimator ‘KL’, ‘DV’ and ‘KLD’, and Jensen-Shannon divergence for estimator ‘JSD’state: not used
info (LossInfo):
info.lossis the loss
- Return type
- training: bool#
alf.algorithms.monet_algorithm#
- class MoNetAlgorithm(n_slots, slot_size, input_tensor_spec, attention_unet_cls=<class 'alf.algorithms.monet_algorithm.MoNetUNet'>, encoder_cls=<class 'alf.networks.encoding_networks.EncodingNetwork'>, decoder_cls=<function SpatialBroadcastDecodingNetwork>, recurrent_attention=True, beta=0.0, gamma=0.0, name='MoNetAlgorithm')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmImplement the MoNet algorithm in the paper:
Burgess et al. 2019, MONet: Unsupervised Scene Decomposition and Representation
The algorithm can be thought of as one kind of VAEs except that it’s expected to produce object-centric posterior latent embeddings.
We follow the exact form of image reconstruction loss in the paper. For each pixel, the mask values are the component weights of a GMM, and the predicted pixel values are the means of the GMM (log of weighted probs). Another implementation https://github.com/stelzner/monet uses an upper bound of this loss, where the mask values are weights of the mean square errors between a pixel and its predicted values (weighted log probs).
We also support generating attention masks all at once, which could speed up the attention process if the number of slots is large. However, we do observe that the recurrent process usually gives better performance than this one-time process.
Each slot has a different pre-assigned fixed sigma for its Gaussian model. The sigmas are automatically generated. The unequal sigmas are crucial for breaking symmetry when generating attention masks for the slots.
- Parameters
n_slots (
int) – number of slots (or objects) pre-defined. Note that background is also counted as an “object”.slot_size (
int) – the dimension of each slot embedding.input_tensor_spec (
Union[TensorSpec,List[ForwardRef],Tuple[()],Tuple[ForwardRef, …],Dict[str,ForwardRef]]) – the spec of input imagesattention_unet_cls (
Callable) –creates the attention UNet that generates masks for the slots. Depending on the value of
recurrent_attention, this unet input and output channels might change. The user doesn’t need to specify the input and output specs for this UNet, as it is automatically handled by the algorithm.If
recurrent_attention==True, this UNet receives RGB+attention_scope and outputs attention logits for the current iteration. Input shape:[B,C+1,H,W]; output shape:[B,2,H,W].Otherwise it receives RGB and outputs
n_slotschannels (all attention logits). Input shape:[B,C,H,W]; output shape:[B,n_slots,H,W].
In either case, the UNet’s output should be non-activated.
encoder_cls (
Callable) – creates the posterior encoder of MoNet. Note that this encoder operates on each individual slot independently, and thus it’s invariant to the slot order. For each slot, the encoder accepts a concatenation of the image and an attention mask for the slot, in a shape of[B,C+1,H,W]. The encoder outputs a non-activated vector of shape[B,2*slot_size], representing the mean and log variance of the slot Gaussian posterior.decoder_cls (
Callable) – creates the decoder of MoNet. The decoder also operates on each individual slot independently, and it should reconstruct both the image (the part masked by the attention; 3 channels) and the attention mask input to the encoder (1 channel). The output should be non-activated. Input shape:[B,slot_size]; output shape:[B,C+1,H,W].recurrent_attention (
bool) – if True, recurrently generates attention masks where each iteration conditions on the scope as the remaining attention; otherwise all attention masks are generated once.beta (
float) – weight for the VAE KLD term, sometimes this KLD can be ignored.gamma (
float) – weight for the KLD between generated attention masks and the reconstructed masks. A positive value might help make the masks more regular and compact.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- train_step(inputs, state=())[source]#
Run a training step of MoNet.
- Parameters
inputs (
Tensor) – the input image- Returns
- output (VAEOutput): contains the rsampled posterior
zand the mode of the posterior distribution
z_mode.
- output (VAEOutput): contains the rsampled posterior
state: empty
- info (MoNetInfo):
loss: the overall loss
kld: kl divergence between posterior and prior (before
beta)rec_loss: image reconstruction loss
mask_rec_loss: mask reconstruction loss (before
gamma)full_rec: the fully reconstructed image from all slots (shape
[B,C,H,W])mask: the attention masks output by the attention network (note not the reconstructed one; shape
[B,slots,H,W])z_dist: the posterior distribution
- Return type
- training: bool#
- class MoNetInfo(kld, rec_loss, mask_rec_loss, full_rec, mask, z_dist)#
Bases:
tupleCreate new instance of MoNetInfo(kld, rec_loss, mask_rec_loss, full_rec, mask, z_dist)
- full_rec#
Alias for field number 3
- kld#
Alias for field number 0
- mask#
Alias for field number 4
- mask_rec_loss#
Alias for field number 2
- rec_loss#
Alias for field number 1
- z_dist#
Alias for field number 5
- class MoNetUNet(input_tensor_spec, filters, nonskip_fc_layers, output_channels, name='MoNetUNet')[source]#
Bases:
alf.networks.network.NetworkImplement the UNet architecture used by MoNet. See Appendix B.2 of the MoNet paper https://arxiv.org/abs/1901.11390 for details.
The architecture is slightly different from the one in the paper, where for the downsampling path, we don’t downsample for the first block but always downsample for the other blocks. For an illustration,
(img) 16 16 (output) (3x3 conv) | skip | (3x3 conv + 1x1 conv) 16 ----> 16 (3x3 conv + maxpool 2) | skip | (3x3 conv + upsample 2) 8 -----> 8 (3x3 conv + maxpool 2) | skip | (3x3 conv + upsample 2) 4 -----> 4 \ / MLP
- Parameters
input_tensor_spec (
Union[TensorSpec,List[ForwardRef],Tuple[()],Tuple[ForwardRef, …],Dict[str,ForwardRef]]) – spec of the input imagefilters (
Tuple[int]) – a tuple of output channels along the downsampling path, each for a conv layer. The upsampling path uses a reversed tuple.nonskip_fc_layers (
Tuple[int]) – a tuple of fc layer sizes for the bottleneck connection (nonskip) of the UNet.output_channels (
int) – final output channels. The output features are non-activated.
- forward(inputs, state=())[source]#
Do a forward step of the UNet.
- Parameters
inputs (
Tensor) – the input image of shape[B,C,H,W]whereCcan be any value.- Returns
- output: an output image of the shape
[B,K,H,W], whereKis output_channels. The output image is non-activated.
- output: an output image of the shape
state: empty
- Return type
tuple
- training: bool#
alf.algorithms.muzero_algorithm#
MuZero algorithm.
- class MuzeroAlgorithm(observation_spec, action_spec, discount, reward_spec=TensorSpec(shape=(), dtype=torch.float32), representation_learner_ctor=<class 'alf.algorithms.muzero_representation_learner.MuzeroRepresentationImpl'>, mcts_algorithm_ctor=<class 'alf.algorithms.mcts_algorithm.MCTSAlgorithm'>, reward_transformer=None, config=None, enable_amp=True, checkpoint=None, debug_summaries=False, name='MuZero')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmMuZero algorithm. MuZero is described in the paper: Schrittwieser et al. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model.
This is a wrapper that combines two sub algorithm components:
A Muzero-style representation learner.
The representation learner employs a MCTSModel to train a translation from a raw observation to its latent representation. The model is also used to predict the reward, values, policy, etc which will be used in the MCTS algorithm.
A MCTS-based policy algorithm. It will perform tree search using the model provided by the representation learner to give the final policy on each predict and rollout step.
NOTE: Currently, the MCTS-based policy algorithm is assumed to NOT have any learnable parameters. This means that training will only update the parameters of the underlying model in the representation learner, and training related hooks for example
train_step()andpreprocess_experience()will delegate directly to their counterparts in the representation learner. This behavior can be changed if needed in the future.- Parameters
observation_spec (TensorSpec) – representing the observations.
action_spec (BoundedTensorSpec) – representing the actions.
representation_learner_ctor (
Callable[…,MuzeroRepresentationImpl]) – It will be called to construct a MuZero-style representation learner. It is expected to be called asrepresentation_learner_ctor(observation_spec=?, action_spec=?, reward_spec=?, discount=?, reward_transformer=?, enable_amp=?, config=?, debug_summaries=?, name=?).mcts_algorithm_ctor (
Callable[…,MCTSAlgorithm]) – will be called asmcts_algorithm_ctor(observation_spec=?, action_spec=?, discount=?, debug_summaries=?, name=?)to construct anMCTSAlgorithminstance. The constructed MCTS algorithm is assumed to have no learnable parameters. It also relies on the model from the representation learner ro run MCTS.reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
reward_transformer (Callable|None) – if provided, will be used to transform reward.
config (
Optional[TrainerConfig]) – The trainer config that will eventually be assigned toself._config.enable_amp (
bool) – whether to use automatic mixed precision for inference. This usually makes the algorithm run faster. However, the result may be different (mostly likely due to random fluctuation). Note that rollout_step is exempted from using AMP.checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.debug_summaries (bool) –
name (str) –
- after_update(root_inputs, info)[source]#
Do things after completing one gradient update (i.e.
update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.- Parameters
root_inputs (nest) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll().info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()for on-policy training ortrain_step()for off-policy training.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(time_step, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- preprocess_experience(root_inputs, rollout_info, batch_info)[source]#
This function is called on the experiences obtained from a replay buffer. An example usage of this function is to calculate advantages and returns in
PPOAlgorithm.The shapes of tensors in experience are assumed to be \((B, T, ...)\).
- Parameters
root_inputs (nest) – input for rollout_step() of the root algorithm. This is from replay buffer. Note this is not same as the input of rollout_step() of self unless self is the root algorithm.
rollout_info (nested Tensor) –
AlgStep.infofrom rollout_step() for this algorithm.batch_info (BatchInfo) – information about this batch of data
- Returns
processed root_inputs
processed rollout_info
- Return type
tuple
- rollout_step(time_step, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- set_path(path)[source]#
Set the path from the root algorithm to this algorithm.
See
AlgorithmInterface.pathfor description about path. This function is called by the trainer before training starts. It needs to be implemented if the algorithm contains some other sub-algorithms.If an algorithm does not have any sub-algorithm or its sub-algorithm does not need to access the root replay buffer directly, it does not implement this function.
- train_step(exp, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
alf.algorithms.muzero_representation_learner#
MuZero algorithm.
- class LinearTdStepFunc(max_bootstrap_age, min_td_steps=1)[source]#
Bases:
objectLinearly decrease td steps from
max_td_stepstomin_td_stepsbased on the age of a sample.If the age of a sample is more than
max_bootstrap_age, its td steps will bemin_td_steps. This is the “dynamic horizon” trick described in paper Mastering Atari Games with Limited Data
- class MuzeroInfo(action, value, target, loss)#
Bases:
tupleCreate new instance of MuzeroInfo(action, value, target, loss)
- action#
Alias for field number 0
- loss#
Alias for field number 3
- target#
Alias for field number 2
- value#
Alias for field number 1
- class MuzeroRepresentationImpl(observation_spec, action_spec, model_ctor, num_unroll_steps, td_steps, discount, reward_spec=TensorSpec(shape=(), dtype=torch.float32), recurrent_gradient_scaling_factor=0.5, reward_transformer=None, calculate_priority=None, train_reward_function=True, train_game_over_function=True, train_repr_prediction=False, train_policy=True, reanalyze_algorithm_ctor=None, reanalyze_ratio=0.0, reanalyze_td_steps=5, reanalyze_td_steps_func=None, reanalyze_batch_size=None, full_reanalyze=False, priority_func="lambda loss_info: loss_info.extra['value'].sqrt().sum(dim=0)", data_transformer_ctor=None, data_augmenter=None, target_update_tau=1.0, target_update_period=1000, config=None, enable_amp=True, random_action_after_episode_end=False, optimizer=None, checkpoint=None, debug_summaries=False, name='MuzeroRepresentationImpl')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmMuZero-style Representation Learner.
MuZero is described in the paper: Schrittwieser et al. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model.
The pseudocode can be downloaded from https://arxiv.org/src/1911.08265v2/anc/pseudocode.py
This representation learner trains the underlying MCTSModel to
Most importantly, produce a latent representation from an observation
Predict the next latent representation given the current latent + an action
Predict various targets (e.g. reward, value)
Amont the above, 1) can be used as the representation in comibination with another RL aalgorithm; 2) and 3) can be used in policy improvements that requires a predictive model (e.g. Monte Carlo Tree Search).
The model is trained with supervision on target prediction in 2) and 3). Some of the targets may be computed with the reanalyze component. Please refer to the original MuZero paper and the following paper for details.
Online and Offline Reinforcement Learning by Planning with a Learned Model.
- Parameters
observation_spec (TensorSpec) – representing the observations.
action_spec (BoundedTensorSpec) – representing the actions.
model_ctor (Callable) – will be called as
model_ctor(observation_spec=?, action_spec=?, debug_summaries=?)to construct the model. The model should follow the interfacealf.algorithms.mcts_models.MCTSModel.num_unroll_steps (
int) – steps for unrolling the model during training.td_steps (
int) – bootstrap so many steps into the future for calculating the discounted return. -1 means to bootstrap to the end of the game. Can only used for environments whose rewards are zero except for the last step as the current implmentation only use the reward at the last step to calculate the return.reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
recurrent_gradient_scaling_factor (float) – the gradient go through the
model.recurrent_inferenceis scaled by this factor. This is suggested in Appendix G.reward_transformer (Callable|None) – if provided, will be used to transform reward.
calculate_priority (bool) – whether to calculate priority. If not provided, will be same as
TrainerConfig.priority_replay. This is only useful if priority replay is enabled.train_reward_function (bool) – whether train reward function. If False, reward should only be given at the last step of an episode.
train_game_over_function (bool) – whether train game over function.
train_repr_prediction (bool) – whether to train to predict future latent representation.
train_policy (bool) – whether to train a policy. Note that training policy is REQUIRED when the model is used in MCTS algorithm.
reanalyze_algorithm_ctor (Callable) – will be called as
reanalyze_algorithm_ctor(observation_spec=?, action_spec=?, discount=?, debug_summaries=?, name=?)to construct anAlgorithminstance for reanalyze. It can also optionally accept an additional argument ‘model’. If so, an model constructed usingmodel_ctorwill be passed to the constructor.reanalyze_ratio (float) – float number in [0., 1.]. Reanalyze so much portion of data retrieved from replay buffer. Reanalyzing means using recent model to calculate the value and policy target.
reanalyze_td_steps (int) – the n for the n-step return for reanalyzing.
reanalyze_td_steps_func (Callable) – If provided, will be called as reanalyze_td_steps_func(sample_age, reanalyze_td_steps, current_max_age) to calculate the td_steps in reanalyze. sample_age is a Tensor whose elements are between 0 and 1 indicating the age of each sample. The age of the latest sample is 0. The age of the sample collected at the beginning of the training is current_max_age.
reanalyze_batch_size (int|None) – the memory usage may be too much for reanalyzing all the data for one training iteration. If so, provide a number for this so that it will analyzing the data in several batches.
full_reanalyze (bool) –
if False, during reanalyze only the first
num_unroll_steps+1steps are calculated using MCTS, and the nextreanalyze_td_stepsare calculated from the model directly. If True, all are calculated using MCTS.priority_func (
Union[Callable,str]) – the function for calculating priority. If it is a str,eval(priority_func)will be called first to convert it aCallable. It is called aspriority_func(loss_info), where loss_info is the temporally stackedLossInfostrucuture returned fromMCTSModel.calc_loss().data_transformer_ctor (None|Callable|list[Callable]) – if provided, will used to construct data transformer. Otherwise, the one provided in config will be used.
data_augmenter (
Optional[Callable]) – If provided, will be called to perform data augmentation asdata_augmenter(observation)for training observations, where the shape of observation is [B, T, …] iftrain_repr_predictionis False, and [B, T*(R+1), …] iftrain_repr_predictionis True. B is mini-batch size, T is mini-batch length and R isnum_unroll_steps.target_update_tau (float) – Factor for soft update of the target networks used for reanalyzing.
target_update_period (int) – Period for soft update of the target networks used for reanalyzing.
config (
Optional[TrainerConfig]) – The trainer config that will eventually be assigned toself._config.enable_amp (
bool) – whether to use automatic mixed precision for inference. This usually makes the algorithm run faster. However, the result may be different (mostly likely due to random fluctuation).random_action_after_episode_end – If False, the actions used to predict future states after the end of an episode will be the same as the last action. If True, they will be uniformly sampled.
optimizer (
Optional[Optimizer]) – the optimizer for independently training the representation.checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.debug_summaries (bool) –
name (str) –
- after_update(root_inputs, info)[source]#
Do things after completing one gradient update (i.e.
update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.- Parameters
root_inputs (nest) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll().info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()for on-policy training ortrain_step()for off-policy training.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- property model#
- predict_step(time_step, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- preprocess_experience(root_inputs, rollout_info, batch_info)[source]#
Fill rollout_info with MuzeroInfo.
Especially, the training targets for representation learning is computed here with reanalyze and/or bootstrapping.
Note that the shape of experience is [B, T, …], where B is the batch size T is the mini batch length.
- rollout_step(time_step, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- train_step(exp, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class MuzeroRepresentationLearner(observation_spec, action_spec, config, training_options=None, reward_spec=TensorSpec(shape=(), dtype=torch.float32), impl_cls=<class 'alf.algorithms.muzero_representation_learner.MuzeroRepresentationImpl'>, debug_summaries=False, name='MuZeroRepresentationLearner')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmLearn represenation following the MuZero style.
This is a thin wrapper over the MuzeroRepresentationImpl, so as to make it possible to work in combination with an RL algorithm (within
Agent).Construct a MuzeroRepresentationLearner.
- Parameters
observation_spec (TensorSpec) – representing the observations.
action_spec (BoundedTensorSpec) – representing the actions.
config (
TrainerConfig) – The trainer config, usually passed down fromAgent.training_options (
Optional[MuzeroRepresentationTrainingOptions]) – The representation learner trains its underlying model independent of the RL algorithm, and therefore will need a separate set of parameters for the training options. SeeMuzeroRepresentationTrainingOptionsabove for details. If not set, training will not happen.reward_spec – a rank-1 or rank-0 tensor spec representing the reward(s). Will passed down to the underlying wrapped
MuzeroRepresentationImpl.impl_cls (
Callable[…,MuzeroRepresentationImpl]) – a callable to construct the underlyingMuzeroRepresentationImpl. It will be called asimpl_cls( observation_spec=?, action_spec=?, reward_spec=?, config=?, debug_summaries=?).debug_summaries (
bool) –name (
str) –
- after_train_iter(experience, info)[source]#
Do things after completing one training iteration (i.e.
train_iter()that consists of one or multiple gradient updates). This function can be used for training additional modules that have their own training logic (e.g., on/off-policy, replay buffers, etc). These modules should be added to_trainable_attributes_to_ignorein the parent algorithm.Other things might also be possible as long as they should be done once every training iteration.
This function will serve the same purpose with
after_updateif there is always only one gradient update in each training iteration. Otherwise it’s less frequently called thanafter_update.- Parameters
root_inputs (nest|None) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll(). In the case where no data is available from therollout_step()(e.g. in a offline pre-training phase where the online interaction is not started yet)root_inputswill be None.rollout_info (nest|None) – information collected from
rollout_step()for this algorithm duringunroll(). In the case where no data is available from therollout_step()(e.g. in a offline pre-training phase where the online interaction is not started yet)rollout_infowill be None.
- after_update(root_inputs, info)[source]#
Do things after completing one gradient update (i.e.
update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.- Parameters
root_inputs (nest) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll().info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()for on-policy training ortrain_step()for off-policy training.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- property output_spec#
Access the spec of the produced representation.
This will be used as the obervation spec for the subsequent RL algorithm.
- predict_step(time_step, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- preprocess_experience(root_inputs, rollout_info, batch_info)[source]#
This function is called on the experiences obtained from a replay buffer. An example usage of this function is to calculate advantages and returns in
PPOAlgorithm.The shapes of tensors in experience are assumed to be \((B, T, ...)\).
- Parameters
root_inputs (nest) – input for rollout_step() of the root algorithm. This is from replay buffer. Note this is not same as the input of rollout_step() of self unless self is the root algorithm.
rollout_info (nested Tensor) –
AlgStep.infofrom rollout_step() for this algorithm.batch_info (BatchInfo) – information about this batch of data
- Returns
processed root_inputs
processed rollout_info
- Return type
tuple
- rollout_step(time_step, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- train_step(exp, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class MuzeroRepresentationTrainingOptions(interval: int = 1, mini_batch_length: int = 1, mini_batch_size: int = 256, num_updates_per_train_iter: int = 10, replay_buffer_length: int = 100000, initial_collect_steps: int = 2000, priority_replay: bool = True, priority_replay_alpha: float = 1.2, priority_replay_beta: float = 0.0)[source]#
Bases:
tupleThe options for training the Muzero Representation.
When used together with an RL algorithm, the representation training does not necessarily share the training options with the RL algorithm. Therefore, we use this class to hold the training options private to the Muzero representation learner.
Create new instance of MuzeroRepresentationTrainingOptions(interval, mini_batch_length, mini_batch_size, num_updates_per_train_iter, replay_buffer_length, initial_collect_steps, priority_replay, priority_replay_alpha, priority_replay_beta)
- initial_collect_steps: int#
Alias for field number 5
- interval: int#
Alias for field number 0
- mini_batch_length: int#
Alias for field number 1
- mini_batch_size: int#
Alias for field number 2
- num_updates_per_train_iter: int#
Alias for field number 3
- priority_replay: bool#
Alias for field number 6
- priority_replay_alpha: float#
Alias for field number 7
- priority_replay_beta: float#
Alias for field number 8
- replay_buffer_length: int#
Alias for field number 4
alf.algorithms.oac_algorithm#
Optimistic Actor Critic algorithm.
- class OacAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), actor_network_cls=<class 'alf.networks.actor_distribution_networks.ActorDistributionNetwork'>, critic_network_cls=<class 'alf.networks.critic_networks.CriticNetwork'>, q_network_cls=<class 'alf.networks.q_networks.QNetwork'>, epsilon_greedy=None, use_entropy_reward=True, calculate_priority=False, num_critic_replicas=2, env=None, config=None, critic_loss_ctor=None, target_entropy=None, prior_actor_ctor=None, target_kld_per_dim=3.0, initial_log_alpha=0.0, explore=True, explore_delta=6.8, beta_ub=4.6, max_log_alpha=None, target_update_tau=0.05, target_update_period=1, dqda_clipping=None, actor_optimizer=None, critic_optimizer=None, alpha_optimizer=None, checkpoint=None, debug_summaries=False, name='OacAlgorithm')[source]#
Bases:
alf.algorithms.sac_algorithm.SacAlgorithmOptimistic Actor Critic algorithm, described in:
Ciosek et al "Better Exploration with Optimistic Actor-Critic", arXiv:1910.12807
Refer to SacAlgorithm for Args besides the following.
- Parameters
explore (bool) – default is True for OAC algorithm, where only continuous action space is supported. When ‘explore’ is False, OAC is the same as SAC.
explore_delta (float) – parameter controlling how optimistic in shifting the mean of the target policy to get the mean of the explore policy.
beta_ub (float) – parameter for computing the upperbound of Q value: \(Q_ub(s,a) = \mu_Q(s,a) + eta_ub * \sigma_Q(s,a)\)
- rollout_step(inputs, state)[source]#
Same as SacAlgorithm.rollout_step except that explore is set to be self._explore when calling _predict_action.
- training: bool#
alf.algorithms.off_policy_algorithm#
Base class for off policy algorithms.
- class OffPolicyAlgorithm(observation_spec, action_spec, train_state_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), predict_state_spec=None, rollout_state_spec=None, is_on_policy=None, reward_weights=None, env=None, config=None, optimizer=None, checkpoint=None, is_eval=False, overwrite_policy_output=False, debug_summaries=False, name='RLAlgorithm')[source]#
Bases:
alf.algorithms.rl_algorithm.RLAlgorithmOffPolicyAlgorithmimplements basic off-policy training pipeline. User needs to implementrollout_step()andtrain_step(). -rollout_step()is called to generate actions at every environment step. -train_step()is called to generate necessary information for training.The following is the pseudo code to illustrate how
OffPolicyAlgorithmis used:# (1) collect stage for _ in range(steps_per_collection): # collect experience and store to replay buffer policy_step = rollout_step(time_step, policy_step.state) experience = make_experience(time_step, policy_step) store experience to replay buffer action = sample action from policy_step.action time_step = env.step(action) # (2) train stage for _ in range(training_steps_per_collection): # sample experiences and perform training experiences = sample batch from replay_buffer batched_train_info = [] for experience in experiences: policy_step = train_step(experience, state) add policy_step.info to batched_train_info loss = calc_loss(experiences, batched_train_info) update_with_gradient(loss)
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions.
train_state_spec (nested TensorSpec) – for the network state of
train_step().reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
rollout_state_spec (nested TensorSpec) – for the network state of
predict_step(). If None, it’s assumed to be the same astrain_state_spec.predict_state_spec (nested TensorSpec) – for the network state of
predict_step(). If None, it’s assumed to be the same asrollout_state_spec.is_on_policy (None|bool) – whether the algorithm is on-policy or not.
reward_weights (None|list[float]) – this is only used when the reward is multidimensional. If not None, the weighted sum of rewards is the reward for training. Otherwise, the sum of rewards is used.
env (Environment) – The environment to interact with.
envis a batched environment, which means that it runs multiple simulations simultaneously. Running multiple environments in parallel is crucial to on-policy algorithms as it increases the diversity of data and decreases temporal correlation.envonly needs to be provided to the rootAlgorithm.config (TrainerConfig) – config for training.
configonly needs to be provided to the algorithm which performs a training iteration by itself.optimizer (torch.optim.Optimizer) – The default optimizer for training.
checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.is_eval (bool) – True if this algorithm is used for evaluation only, during deployment. In this case, the algorithm do not need to create certain components such as value_network for ActorCriticAlgorithm, critic_networks for SacAlgorithm.
overwrite_policy_output (bool) – if True, overwrite the policy output with next_step.prev_action. This option can be used in some cases such as data collection.
debug_summaries (bool) – If True, debug summaries will be created.
name (str) – Name of this algorithm.
- property on_policy#
Whether is on-policy training.
For on-policy training,
train_step()will not be called. Andinfopassed tocalc_loss()is info collected fromrollout_step().For off-policy training,
train_step()will be called with the experience from replay buffer. Andinfopassed tocalc_loss()is info collected fromtrain_step.An algorithm can override this to indicate whether it is an on-policy or off-policy algorithm. If an algorithm does not override this, it needs to support both on-policy and off-policy training, which means that
rollout_step()andtrain_step()need to have the correct behavior for on-policy and off-policy training. It can check wether it is on-policy training by calling this function.- Returns
- True if on-policy training, False if off-policy training,
None if not set.
- Return type
bool | None
- training: bool#
alf.algorithms.on_policy_algorithm#
Base class for on-policy RL algorithms.
- class OnPolicyAlgorithm(observation_spec, action_spec, train_state_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), predict_state_spec=None, rollout_state_spec=None, is_on_policy=None, reward_weights=None, env=None, config=None, optimizer=None, checkpoint=None, is_eval=False, overwrite_policy_output=False, debug_summaries=False, name='RLAlgorithm')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmOnPolicyAlgorithm implements the basic on-policy training procedure.
User needs to implement
rollout_step()andcalc_loss().rollout_step()is called to generate actions for every environment step. It also needs to generate necessary information for training.update_with_gradient()is called everyunroll_lengthsteps (specified inconfig.TrainerConfig). All the training information collected by everyrollout_step()are batched and provided as arguments forcalc_loss().The following is the pseudo code to illustrate how
OnPolicyAlgorithmcan be used:for _ in range(unroll_length): policy_step = rollout_step(time_step, policy_step.state) collect information from time_step into experience collect information from policy_step.info into train_info time_step = env.step(policy_step.output) loss = calc_loss(experience, train_info) update_with_gradient(loss)
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions.
train_state_spec (nested TensorSpec) – for the network state of
train_step().reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
rollout_state_spec (nested TensorSpec) – for the network state of
predict_step(). If None, it’s assumed to be the same astrain_state_spec.predict_state_spec (nested TensorSpec) – for the network state of
predict_step(). If None, it’s assumed to be the same asrollout_state_spec.is_on_policy (None|bool) – whether the algorithm is on-policy or not.
reward_weights (None|list[float]) – this is only used when the reward is multidimensional. If not None, the weighted sum of rewards is the reward for training. Otherwise, the sum of rewards is used.
env (Environment) – The environment to interact with.
envis a batched environment, which means that it runs multiple simulations simultaneously. Running multiple environments in parallel is crucial to on-policy algorithms as it increases the diversity of data and decreases temporal correlation.envonly needs to be provided to the rootAlgorithm.config (TrainerConfig) – config for training.
configonly needs to be provided to the algorithm which performs a training iteration by itself.optimizer (torch.optim.Optimizer) – The default optimizer for training.
checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.is_eval (bool) – True if this algorithm is used for evaluation only, during deployment. In this case, the algorithm do not need to create certain components such as value_network for ActorCriticAlgorithm, critic_networks for SacAlgorithm.
overwrite_policy_output (bool) – if True, overwrite the policy output with next_step.prev_action. This option can be used in some cases such as data collection.
debug_summaries (bool) – If True, debug summaries will be created.
name (str) – Name of this algorithm.
- property on_policy#
Whether is on-policy training.
For on-policy training,
train_step()will not be called. Andinfopassed tocalc_loss()is info collected fromrollout_step().For off-policy training,
train_step()will be called with the experience from replay buffer. Andinfopassed tocalc_loss()is info collected fromtrain_step.An algorithm can override this to indicate whether it is an on-policy or off-policy algorithm. If an algorithm does not override this, it needs to support both on-policy and off-policy training, which means that
rollout_step()andtrain_step()need to have the correct behavior for on-policy and off-policy training. It can check wether it is on-policy training by calling this function.- Returns
- True if on-policy training, False if off-policy training,
None if not set.
- Return type
bool | None
- train_step(inputs, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
alf.algorithms.one_step_loss#
- class OneStepTDLoss(gamma=0.99, td_error_loss_fn=<function element_wise_squared_loss>, debug_summaries=False, name='OneStepTDLoss')[source]#
Bases:
alf.algorithms.td_loss.TDLoss- Parameters
gamma (
Union[float,List[float]]) – A discount factor for future rewards. For multi-dim reward, this can also be a list of discounts, each discount applies to a reward dim.td_error_loss_fn (
Callable) – A function for computing the TD errors loss. This function takes as input the target and the estimated Q values and returns the loss for each element of the batch.debug_summaries (
bool) – True if debug summaries should be createdname (
str) – The name of this loss.
- training: bool#
- class OneStepTDQRLoss(num_quantiles=50, gamma=0.99, td_error_loss_fn=<function huber_function>, sum_over_quantiles=False, debug_summaries=False, name='OneStepTDQRLoss')[source]#
Bases:
alf.algorithms.td_loss.TDQRLossOne step temporal difference quantile regression loss.
- Parameters
num_quantiles (
int) – the number of quantiles.gamma (
Union[float,List[float]]) – A discount factor for future rewards. For multi-dim reward, this can also be a list of discounts, each discount applies to a reward dim.td_error_loss_fn (
Callable) – A function for computing the TD errors loss. This function takes as input the target and the estimated Q values and returns the loss for each element of the batch.sum_over_quantiles (
bool) – If True, the quantile regression loss will be summed along the quantile dimension. Otherwise, it will be averaged along the quantile dimension instead. Default is False.debug_summaries (
bool) – True if debug summaries should be createdname (
str) – The name of this loss.
- training: bool#
alf.algorithms.particle_vi_algorithm#
A generic generator.
- class ParVIAlgorithm(particle_dim, num_particles=10, entropy_regularization=1.0, par_vi='gfsf', critic_input_dim=None, critic_hidden_layers=(100, 100), critic_l2_weight=10.0, critic_iter_num=2, critic_use_bn=True, critic_optimizer=None, optimizer=None, debug_summaries=False, name='ParVIAlgorithm')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmParVIAlgorithm maintains a set of particles that keep chasing some target distribution. Two particle-based variational inference (par_vi) methods are implemented:
Stein Variational Gradient Descent (SVGD):
Liu, Qiang, and Dilin Wang. “Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm.” NIPS. 2016.
Wasserstein Particle-based VI with Smooth Functions (GFSF):
Liu, Chang, et al. “Understanding and accelerating particle-based variational inference.” International Conference on Machine Learning. 2019.
Create a ParVIAlgorithm.
- Parameters
particle_dim (int) – dimension of the particles.
num_particles (int) – number of particles.
entropy_regularization (float) – weight of the repulsive term in par_vi.
par_vi (string) –
par_vi methods, options are [
svgd,gfsf,None],svgd: empirical expectation of SVGD is evaluated by reusing the same batch of particles.
gfsf: wasserstein gradient flow with smoothed functions. It involves a kernel matrix inversion, so computationally more expensive, but in some cases the convergence seems faster than svgd approaches.
critic_input_dim (int) – dimension of critic input, used for
minmax.critic_hidden_layers (tuple) – sizes of hidden layers of the critic, used for
minmax.critic_l2_weight (float) – weight of L2 regularization in training the critic, used for
minmax.critic_iter_num (int) – number of critic updates for each generator train_step, used for
minmax.critic_use_bn (book) – whether use batch norm for each layers of the critic, used for
minmax.critic_optimizer (torch.optim.Optimizer) – Optimizer for training the critic, used for
minmax.optimizer (torch.optim.Optimizer) – (optional) optimizer for training
name (str) – name of this generator
- property num_particles#
- property particles#
- predict_step(state=None)[source]#
Generate outputs given inputs.
- Parameters
state – not used
- Returns
output (Tensor): shape is
[num_particles, output_dim]state: not used
- Return type
- train_step(loss_func, transform_func=None, entropy_regularization=None, loss_mask=None, state=None)[source]#
- Parameters
loss_func (Callable) – loss_func(loss_inputs) returns a Tensor or namedtuple of tensors with field loss, which is a Tensor of shape [num_particles] a loss term for optimizing the generator.
transform_func (Callable) –
tranform functoin on particles. Used in function value based par_vi, where each particle represents parameters of a neural network function. It is call by transform_func(particles) which returns the following,
outputs: outputs of network parameterized by particles evaluated on predifined training batch.
extra_outputs: outputs of network parameterized by particles evaluated on additional sampled data.
entropy_regularization (float) – weight of the repulsive term in par_vi. If None, use self._entropy_regularization.
loss_mask (Tensor) – mask indicating which samples are valid for loss propagation.
state – not used
- Returns
output (Tensor): shape is
[num_particles, dim]state: not used
info (LossInfo): loss
- Return type
- training: bool#
alf.algorithms.planning_algorithm#
- class CEMPlanAlgorithm(feature_spec, action_spec, population_size, planning_horizon, reward_spec=TensorSpec(shape=(), dtype=torch.float32), elite_size=50, max_iter_num=5, epsilon=0.01, tau=0.9, scalar_var=None, upper_bound=None, lower_bound=None, name='CEMPlanAlgorithm')[source]#
Bases:
alf.algorithms.planning_algorithm.RandomShootingAlgorithmCEM-based planning method.
This method uses a Cross-Entropy Method (CEM) to optimize an action trajectory by minimizing a given cost function. The optimized action trajectory is termed as a ‘plan’ which can be used by other components such as a MPC-based controller. This has been used by some MBRL works such as Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models
To speedup, when possible, we have used the plan obtained at the previous time step to initialize the the mean of the plan distribution at the current time step, after proper shifting and padding.
Create a CEMPlanAlgorithm.
- Parameters
population_size (int) – the size of polulation for optimization
planning_horizon (int) – planning horizon in terms of time steps
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s.)
elite_size (int) – the number of elites selected in each round
max_iter_num (int|Tensor) – the maximum number of CEM iterations
epsilon (float) – a minimum variance threshold. If the variance of the population falls below it, the CEM iteration will stop.
tau (float) –
a value in (0, 1) for softly updating the population mean and variance:
mean = (1 - tau) * mean + tau * new_mean var = (1 - tau) * var + tau * new_var
scalar_var (None|float) – the value that will be used to construct the initial diagonal covariance matrix of the multi-dimensional Gaussian used by the CEM optimizer. If value is None, 0.5 * (upper_bound - lower_bound) is used.
upper_bound (int) – upper bound for elements in solution; action_spec.maximum will be used if not specified
lower_bound (int) – lower bound for elements in solution; action_spec.minimum will be used if not specified
- predict_plan(time_step, state, epislon_greedy)[source]#
Compute the plan based on the provided observation and action :type time_step:
TimeStep:param time_step: input data for next step prediction :type time_step: TimeStep :type state:PlannerState:param state: input planner state :type state: PlannerState- Returns
planned action for the given inputs
- Return type
action
- training: bool#
- class PlanAlgorithm(feature_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), planning_horizon=25, upper_bound=None, lower_bound=None, name='PlanningAlgorithm')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmPlanning Module
This module plans for actions based on initial observation and specified reward and dynamics functions
Create a PlanningAlgorithm.
- Parameters
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
planning_horizon (int) – planning horizon in terms of time steps
upper_bound (int) – upper bound for elements in solution; action_spec.maximum will be used if not specified
lower_bound (int) – lower bound for elements in solution; action_spec.minimum will be used if not specified
particles_per_replica (int) – number of particles used for each replica
- predict_plan(time_step, state, epsilon_greedy)[source]#
Compute the plan based on the provided observation and action :type time_step:
TimeStep:param time_step: input data for next step prediction :type time_step: TimeStep :type state:PlannerState:param state: input planner state :type state: PlannerState- Returns
planned action for the given inputs
- Return type
action
- set_action_sequence_cost_func(action_seq_cost_func)[source]#
Set a function for evaluating the action sequences for planning :param action_seq_cost_func: cost function to be used for planning. :type action_seq_cost_func: Callable :param action_seq_cost_func takes initial observation and action sequences: :param of the shape [B, population, unroll_steps, action_dim] as input: :param and returns the accumulated cost along the unrolled trajectory, with: :param the shape of [B, population]:
- train_step(time_step, state, rollout_info=None)[source]#
- Parameters
time_step (TimeStep) – input data for dynamics learning
state (PlannerState) – input planner state
- Returns
output: empty tuple () state (PlannerState): updated planner state info (PlannerInfo):
- Return type
- training: bool#
- class PlannerInfo(planner)#
Bases:
tupleCreate new instance of PlannerInfo(planner,)
- planner#
Alias for field number 0
- class PlannerState(prev_plan)#
Bases:
tupleCreate new instance of PlannerState(prev_plan,)
- prev_plan#
Alias for field number 0
- class RandomShootingAlgorithm(feature_spec, action_spec, population_size, reward_spec=TensorSpec(shape=(), dtype=torch.float32), planning_horizon=25, upper_bound=None, lower_bound=None, name='RandomShootingAlgorithm')[source]#
Bases:
alf.algorithms.planning_algorithm.PlanAlgorithmRandom Shooting-based planning method.
This method uses a Random Shooting approach to optimize an action trajectory by minimizing a given cost function. The optimized action trajectory is termed as a ‘plan’ which can be used by other components such as a MPC-based controller. It has been used in Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning
Create a RandomShootingAlgorithm.
- Parameters
population_size (int) – the size of polulation for random shooting
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
planning_horizon (int) – planning horizon in terms of time steps
upper_bound (int) – upper bound for elements in solution; action_spec.maximum will be used if not specified
lower_bound (int) – lower bound for elements in solution; action_spec.minimum will be used if not specified
- after_update(root_inputs, info)[source]#
Do things after completing one gradient update (i.e.
update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.- Parameters
root_inputs (nest) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll().info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()for on-policy training ortrain_step()for off-policy training.
- predict_plan(time_step, state, epsilon_greedy)[source]#
Compute the plan based on the provided observation and action :type time_step:
TimeStep:param time_step: input data for next step prediction :type time_step: TimeStep :type state:PlannerState:param state: input planner state :type state: PlannerState- Returns
planned action for the given inputs
- Return type
action
- training: bool#
alf.algorithms.ppg_algorithm#
Phasic Policy Gradient Algorithm.
- class PPGAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), env=None, config=None, aux_options=PPGAuxOptions(enabled=True, interval=32, mini_batch_length=None, mini_batch_size=8, num_updates_per_train_iter=6), encoding_network_ctor=<class 'alf.networks.encoding_networks.EncodingNetwork'>, policy_optimizer=None, aux_optimizer=None, epsilon_greedy=None, checkpoint=None, debug_summaries=False, name='PPGAlgorithm')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmPPG Algorithm.
Implementation of the paper: https://arxiv.org/abs/2009.04416
PPG can be viewed as a variant of PPO, with two differences:
It uses a special network structure (DisjointPolicyValueNetwork) that has an extra auxiliary value head in addition to the policy head and value head. In the current implementation, the auxiliary value head also tries to estimate the value function, similar to the (actual) value head.
It does PPO update in normal iterations. However, after every specified number of iterations, it will perform auxiliary phase updates based on auxiliary phase losses (different from PPO loss, see algorithms/ppg/ppg_aux_phase_loss.py for details). Auxiliary phase updates does not require new rollouts. Instead it is performed on all of the experience collected since the last auxiliary phase update.
Args:
observation_spec (nested TensorSpec): representing the observations. action_spec (nested BoundedTensorSpec): representing the actions. reward_spec (TensorSpec): a rank-1 or rank-0 tensor spec representing
the reward(s).
- env (Environment): The environment to interact with. env is a
batched environment, which means that it runs multiple simulations simultateously. env only needs to be provided to the root Algorithm. NOTE: env will default to None if PPGAlgorithm is run via Agent.
- config (TrainerConfig): config for training. config only needs to be
provided to the algorithm which performs
train_iter()by itself.
aux_options: Options that controls the auxiliary phase training. encoding_network_ctor (Callable[[TensorSpec], Network]): Function to
construct the encoding network from an input tensor spec. The constructed network will be called with
forward(observation, state).- policy_optimizer (torch.optim.Optimizer): The optimizer for training
the policy phase of PPG.
- aux_optimizer (torch.optim.Optimizer): The optimizer for training
the auxiliary phase of PPG.
- epsilon_greedy (float): a floating value in [0,1], representing the
chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from
config.epsilon_greedyand thenalf.get_config_value(TrainerConfig.epsilon_greedy). It is used inpredict_step()during evaluation.- checkpoint (None|str): a string in the format of “prefix@path”,
where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.
debug_summaries (bool): True if debug summaries should be created. name (str): Name of this algorithm.
- after_train_iter(experience, info)[source]#
Run auxiliary update if conditions are met
PPG requires running auxiliary update after certain number of iterations policy update. This is checked and performed at the after_train_iter() hook currently.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(inputs, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- rollout_step(inputs, state)[source]#
Rollout step for PPG algorithm
Besides running the network prediction, it does one extra thing to store the experience in the auxiliary replay buffer so that it can be consumed by the auxiliary phase updates.
- Return type
- train_step(inputs, state, plain_rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
alf.algorithms.ppo_algorithm#
PPO algorithm.
- class PPOAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), reward_weights=None, actor_network_ctor=<class 'alf.networks.actor_distribution_networks.ActorDistributionNetwork'>, value_network_ctor=<class 'alf.networks.value_networks.ValueNetwork'>, epsilon_greedy=None, env=None, config=None, loss=None, loss_class=<class 'alf.algorithms.actor_critic_loss.ActorCriticLoss'>, optimizer=None, checkpoint=None, debug_summaries=False, name='ActorCriticAlgorithm')[source]#
Bases:
alf.algorithms.actor_critic_algorithm.ActorCriticAlgorithmPPO Algorithm. Implement the simplified surrogate loss in equation (9) of “Proximal Policy Optimization Algorithms” https://arxiv.org/abs/1707.06347
It works with
ppo_loss.PPOLoss. It should have same behavior as baselines.ppo2.- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions.
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
reward_weights (None|list[float]) – this is only used when the reward is multidimensional. In that case, the weighted sum of the v values is used for training the actor if reward_weights is not None. Otherwise, the sum of the v values is used.
env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. env only needs to be provided to the root Algorithm.
epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from
config.epsilon_greedyand thenalf.get_config_value(TrainerConfig.epsilon_greedy).config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs
train_iter()by itself.actor_network_ctor (Callable) – Function to construct the actor network.
actor_network_ctorneeds to acceptinput_tensor_specandaction_specas its arguments and return an actor network. The constructed network will be called withforward(observation, state).value_network_ctor (None | Callable) – Function to construct the value network.
value_network_ctorneeds to acceptinput_tensor_specas its arguments and return a value netwrok. The contructed network will be called withforward(observation, state)and returns value tensor for each observation given observation and network state. Note that if the algorithm is constructed for evaluation or deployment only, the value_network_ctor can be set to None and the value network will not be constructed at all.loss (None|ActorCriticLoss) – an object for calculating loss. If None, a default loss of class loss_class will be used.
loss_class (type) – the class of the loss. The signature of its constructor:
loss_class(debug_summaries)optimizer (torch.optim.Optimizer) – The optimizer for training
checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.debug_summaries (bool) – True if debug summaries should be created.
name (str) – Name of this algorithm.
- property on_policy#
Whether is on-policy training.
For on-policy training,
train_step()will not be called. Andinfopassed tocalc_loss()is info collected fromrollout_step().For off-policy training,
train_step()will be called with the experience from replay buffer. Andinfopassed tocalc_loss()is info collected fromtrain_step.An algorithm can override this to indicate whether it is an on-policy or off-policy algorithm. If an algorithm does not override this, it needs to support both on-policy and off-policy training, which means that
rollout_step()andtrain_step()need to have the correct behavior for on-policy and off-policy training. It can check wether it is on-policy training by calling this function.- Returns
- True if on-policy training, False if off-policy training,
None if not set.
- Return type
bool | None
- preprocess_experience(root_inputs, rollout_info, batch_info)[source]#
Compute advantages and put it into exp.rollout_info.
- train_step(inputs, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class PPOInfo(step_type, discount, reward, action, rollout_log_prob, rollout_action_distribution, returns, advantages, action_distribution, value, reward_weights)#
Bases:
tupleCreate new instance of PPOInfo(step_type, discount, reward, action, rollout_log_prob, rollout_action_distribution, returns, advantages, action_distribution, value, reward_weights)
- action#
Alias for field number 3
- action_distribution#
Alias for field number 8
- advantages#
Alias for field number 7
- discount#
Alias for field number 1
- returns#
Alias for field number 6
- reward#
Alias for field number 2
- reward_weights#
Alias for field number 10
- rollout_action_distribution#
Alias for field number 5
- rollout_log_prob#
Alias for field number 4
- step_type#
Alias for field number 0
- value#
Alias for field number 9
alf.algorithms.ppo_loss#
Loss for PPO algorithm.
- class PPOLoss(gamma=0.99, td_error_loss_fn=<function element_wise_squared_loss>, td_lambda=0.95, normalize_advantages=True, compute_advantages_internally=False, advantage_clip=None, entropy_regularization=None, td_loss_weight=1.0, importance_ratio_clipping=0.2, log_prob_clipping=0.0, check_numerics=False, debug_summaries=False, name='PPOLoss')[source]#
Bases:
alf.algorithms.actor_critic_loss.ActorCriticLossPPO loss.
Implement the simplified surrogate loss in equation (9) of Proximal Policy Optimization Algorithms.
The total loss equals to
(policy_gradient_loss # (L^{CLIP} in equation (9)) + td_loss_weight * td_loss # (L^{VF} in equation (9)) - entropy_regularization * entropy)
This loss works with
PPOAlgorithm. The advantages and returns are pre-computed byPPOAlgorithm.preprocess(). One known difference with baselines.ppo2 is that value estimation is not clipped here, while baselines.ppo2 also clipped value if it deviates from returns too much.- Parameters
gamma (float|list[float]) – A discount factor for future rewards. For multi-dim reward, this can also be a list of discounts, each discount applies to a reward dim.
td_errors_loss_fn (Callable) – A function for computing the TD errors loss. This function takes as input the target and the estimated Q values and returns the loss for each element of the batch.
td_lambda (float) – Lambda parameter for TD-lambda computation.
normalize_advantages (bool) – If True, normalize advantage to zero mean and unit variance within batch for caculating policy gradient.
compute_advantages_internally (bool) – Normally PPOLoss does not compute the adavantage and it expects the info to carry the already-computed advantage. If this flag is set to True, PPOLoss will instead compute the advantage internally without depending on the input info, because loading very large amount of experiences into GPU memory to compute advantages may not always be possible.
advantage_clip (float) – If set, clip advantages to \([-x, x]\)
entropy_regularization (float) – Coefficient for entropy regularization loss term.
td_loss_weight (float) – the weigt for the loss of td error.
importance_ratio_clipping (float) – Epsilon in clipped, surrogate PPO objective. See the cited paper for more detail.
log_prob_clipping (float) – If >0, clipping log probs to the range
(-log_prob_clipping, log_prob_clipping)to preventinf/NaNvalues.check_numerics (bool) – If true, checking for
NaN/Infvalues. For debugging only.name (str) –
- training: bool#
alf.algorithms.predictive_representation_learner#
PredictiveRepresentationLearner.
- class PredictiveRepresentationLearner(observation_spec, action_spec, num_unroll_steps, decoder_ctor, encoding_net_ctor, dynamics_net_ctor, reward_spec=TensorSpec(shape=(), dtype=torch.float32), config=None, postprocessor=None, encoding_optimizer=None, dynamics_optimizer=None, postprocessor_optimizer=None, checkpoint=None, debug_summaries=False, name='PredictiveRepresentationLearner')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmLearn representation based on the prediction of future values.
PredictiveRepresentationLearnercontains 3 ``Module``s:encoding_net: it is a
Networkthat encodes the raw observation to a latent vector.dynamics_net: it is a
Networkthat generates the future latent states from the current latent state.decoder: it is an
Algorithmthat decode the target values from the latent state and calcuate the loss.
- Parameters
observation_spec (nested TensorSpec) – describing the observation.
action_spec (nested BoundedTensorSpec) – describing the action.
num_unroll_steps (int) – the number of future steps to predict.
num_unroll_stepsof 0 means no future prediction and hencedynamics_net_ctoris ignored.decoder_ctor (Callable|[Callable]) – each individual constructor is called as
decoder_ctor(observation)to construct the decoder algorithm. It should follow theAlgorithminterface. In addition to the interface ofAlgorithm, it should also implement a member functionget_target_fields(), which returns a nest of the names of target fields. SeeSimpleDecoderfor an example of decoder.encoding_net_ctor (Callable) – called as
encoding_net_ctor(observation_spec)to construct the encodingNetwork. The network takes raw observation as input and output the latent representation. encoding_net can be an RNN.dynamics_net_ctor (Callable) – called as
dynamics_net_ctor(action_spec)to construct the dynamicsNetwork. It must be an RNN. The constructed network takes action as input and outputs the future latent representation. If the state_spec of the dynamics net is exactly same as the state_spec of the encoding net, the current state of the encoding net will be used as the initial state of the dynamics net. Otherwise, a linear projection will be used to convert the current latent represenation to the initial state for the dynamics net.reward_spec – NOT USED. Only present as representation learner interface to be used with
Agent.config (
Optional[TrainerConfig]) – The trainer config. Present as representation learner interface to be used withAgent.postprocessor (None|Callable) – If provided, will be called as
postprocessor(latent)to get the actual representation, wherelatentis the output from encoding_net.encoding_optimizer (Optimizer|None) – if provided, will be used to optimize the parameter for the encoding net.
dynamics_optimizer (Optimizer|None) – if provided, will be used to optimize the parameter for the dynamics net.
postprocessor_optimizer (Optimizer|None) – if provided, will be used to optimize the parameter for the postprocessor.
checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.debug_summaries (bool) – whether to generate debug summaries
name (str) – name of this instance.
- get_decoder(target_field)[source]#
Get the decoder which predicts the target specified by
target_name. :param target_field: the name of the prediction quantity correspondingto the decoder
- Returns
decoder (Algorithm)
- property output_spec#
- predict_multi_step(init_latent, actions, target_field=None, state=None)[source]#
- Perform multi-step predictions based on the initial latent
representation and actions sequences.
- Parameters
init_latent (Tensor) – the latent representation for the initial step of the prediction
actions (Tensor) – [B, unroll_steps, action_dim]
target_field (None|str|[str]) – the name or a list if names of the quantities to be predicted. It is used for selecting the corresponding decoder. If None, all the available decoders will be used for generating predictions.
state –
- Returns
- predicted target of shape
[B, unroll_steps + 1, d], where d is the dimension of the predicted target. The return is a list of Tensors when there are multiple targets to be predicted.
- Return type
prediction (Tensor|[Tensor])
- predict_step(inputs, state)[source]#
Predict for one step of inputs.
- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
predict_state_spec.- info (nest): information for analyzing the agent. In particular,
if an element of the info is
alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.
- Return type
- preprocess_experience(root_inputs, rollout_info, batch_info)[source]#
Fill experience.rollout_info with PredictiveRepresentationLearnerInfo
Note that the shape of experience is [B, T, …].
The target is a Tensor (or a nest of Tensors) when there is only one decoder. When there are multiple decorders, the target is a list, and each of its element is a Tensor (or a nest of Tensors), which is used as the target for the corresponding decoder.
- rollout_step(inputs, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- train_step(root_inputs, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class PredictiveRepresentationLearnerInfo(action, mask, target)#
Bases:
tupleCreate new instance of PredictiveRepresentationLearnerInfo(action, mask, target)
- action#
Alias for field number 0
- mask#
Alias for field number 1
- target#
Alias for field number 2
- class SimpleDecoder(input_tensor_spec, target_field, decoder_net_ctor, loss_ctor=functools.partial(<class 'torch.nn.modules.loss.SmoothL1Loss'>, reduction='none'), loss_weight=1.0, summarize_each_dimension=False, optimizer=None, normalize_target=False, append_target_field_to_name=True, debug_summaries=False, name='SimpleDecoder')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmA simple decoder with elementwise loss between the target and the predicted value.
It is used to predict the target value from the given representation. Its loss can be used to train the representation.
- Parameters
input_tensor_spec (TensorSpec) – describing the input tensor.
target_field (str) – name of the field in the experience to be used as the decoding target.
decoder_net_ctor (Callable) – called as
decoder_net_ctor(input_tensor_spec=input_tensor_spec)to construct an instance ofNetworkfor decoding. The network should take the latent representation as input and output the predicted value of the target.loss_ctor (Callable) – loss function with signature
loss(y_pred, y_true). Note that it should not reduce to a scalar. It should at least keep the batch dimension in the returned loss.loss_weight (float) – weight for the loss.
optimizer (Optimzer|None) – if provided, it will be used to optimize the parameter of decoder_net
normalize_target (bool) – whether to normalize target. Note that the effect of this is to change the loss. The predicted value itself is not normalized.
append_target_field_to_name (bool) – whether append target field to the name of the decoder. If True, the actual name used will be
name.target_fielddebug_summaries (bool) – whether to generate debug summaries
name (str) – name of this instance
- calc_loss(target, predicted, mask=None)[source]#
Calculate the loss between
targetandpredicted.- Parameters
target (Tensor) – target to be predicted. Its shape is [T, B, …]
predicted (Tensor) – predicted target. Its shape is [T, B, …]
mask (bool Tensor) – indicating which target should be predicted. Its shape is [T, B].
- Returns
LossInfo
- predict_step(repr, state=())[source]#
Predict for one step of inputs.
- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
predict_state_spec.- info (nest): information for analyzing the agent. In particular,
if an element of the info is
alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.
- Return type
- train_step(repr, state=())[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
alf.algorithms.prior_actor#
Prior action policies for KL regularized RL.
- class SameActionPriorActor(observation_spec, action_spec, same_action_noise=0.1, same_action_prob=0.9, debug_summaries=False, name='SameActionPriorActor')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmSameActionPriorActorcan be used as a prior for KLD regularized RL-algorithms. It encodes the prior intuition that the next action should be same as the previous action most of time. More specifically, the distribution for each action dimension is a mixture of two components:a flat
TruncatedNormalwithlocequal to the median of the action rangescaleequal to the action range.a sharp
TruncatedNormalwithlocequal to the previous action and scale equal to the action range multiplied bysame_action_noise.
The mixture weight depends on step_type:
If the step_type is FIRST, the mixture weight is [1.0, 0]
Otherwise the mixture weight is [1-same_actin_prob, same_actin_prob]
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions.
same_action_noise (float) – the noise added to the previous action if the new action is the same as the previous action.
same_action_prob (float) – the probability that the next action is same as the previous action.
debug_summaries (bool) – True if debug summaries should be created.
name (str) – The name of this algorithm.
- rollout_step(inputs, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- train_step(inputs, state, unroll_info=())[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class TruncatedNormal(loc, scale, low, high, validate_args=None)[source]#
Bases:
torch.distributions.distribution.DistributionNormal distribution truncated to the range between
lowandhigh.Currently, only
log_prob()is implemented.- Parameters
loc (Tensor) – mean of the untruncated Normal
scale (Tensor) – standard deviation of the untruncated Normal
low (Tensor) – lower range of the truncation range
high (Tensor) – upper range of the truncation range
- log_prob(value)[source]#
Log-probability of
value.- Parameters
value (Tensor) – the samples whose log_prob is to calculated
- Returns
log probability of
value
- class UniformPriorActor(observation_spec, action_spec, debug_summaries=False, name='UniformPriorActor')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmUniformPriorActor can be used as a prior for KLD regularized RL-algorithms. It generate a prior distribution for the next action using limited information, which can be used as the prior distribution in KLD.
The action distribution is always an uniform distribution defined by the valid range of the action specified in
action_spec- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions.
debug_summaries (bool) – True if debug summaries should be created.
name (str) – The name of this algorithm.
- predict_step(inputs, state)[source]#
Predict for one step of inputs.
- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
predict_state_spec.- info (nest): information for analyzing the agent. In particular,
if an element of the info is
alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.
- Return type
- rollout_step(inputs, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- train_step(inputs, state, rollout_info=None)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
alf.algorithms.qrsac_algorithm#
Quantile Regression Soft Actor Critic Algorithm.
- class QrsacAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), actor_network_cls=<class 'alf.networks.actor_distribution_networks.ActorDistributionNetwork'>, critic_network_cls=<class 'alf.networks.critic_networks.CriticNetwork'>, epsilon_greedy=None, use_entropy_reward=False, normalize_entropy_reward=False, calculate_priority=False, num_critic_replicas=2, min_critic_by_critic_mean=False, env=None, config=None, critic_loss_ctor=None, target_entropy=None, prior_actor_ctor=None, target_kld_per_dim=3.0, initial_log_alpha=0.0, max_log_alpha=None, target_update_tau=0.05, target_update_period=1, dqda_clipping=None, actor_optimizer=None, critic_optimizer=None, alpha_optimizer=None, checkpoint=None, debug_summaries=False, reproduce_locomotion=False, name='QrsacAlgorithm')[source]#
Bases:
alf.algorithms.sac_algorithm.SacAlgorithmQuantile regression actor critic algorithm.
A SAC variant that applies the following quantile regression based distributional RL approach to model the critic function:
Dabney et al "Distributional Reinforcement Learning with Quantile Regression", arXiv:1710.10044
Currently, only continuous action space is supported.
Refer to SacAlgorithm for Args beside the following. Args used for discrete and mixed actions are omitted.
- Parameters
min_critic_by_critic_mean (
bool) – If True, compute the min quantile distribution of critic replicas by choosing the one with the lowest distribution mean. Otherwise, compute the min quantile by taking a minimum value across all critic replicas for each quantile value.checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.
- training: bool#
alf.algorithms.reward_learning_algorithm#
- class FixedRewardFunction(reward_func, name='FixedRewardFunction')[source]#
Bases:
alf.algorithms.reward_learning_algorithm.RewardEstimationAlgorithmFixed Reward Estimation Module with hand-crafted computational rules.
- Parameters
reward_func (Callable) –
a function for computing reward. It takes as input:
observation (Tensor of shape [batch_size, observation_dim])
action (Tensor of shape [batch_size, num_actions]) and returns a reward Tensor of shape [batch_size]
- compute_reward(obs, action, state)[source]#
Compute reward based on current observation and action :param obs: observation :type obs: Tensor :param action: action :type action: Tensor :param state: state for reward calculation
- Returns
compuated reward for the given input state: updated state, currently simply passing the input state
- Return type
reward (Tensor)
- train_step(time_step, state=(), rollout_info=None)[source]#
- Parameters
time_step (TimeStep) – input data for dynamics learning
state – state for reward learning
- Returns
AlgStep
- training: bool#
- class RewardEstimationAlgorithm(name='RewardEstimationAlgorithm')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmReward Estimation Module
This module is responsible for computing/predicting rewards
Create a RewardEstimationAlgorithm.
- compute_reward(obs, action, state)[source]#
Compute reward based on the provided observation and action :param obs: observation :type obs: Tensor :param action: action :type action: Tensor :param state ():
- Returns
compuated reward for the given input
- Return type
reward (Tensor)
- train_step(time_step, state, rollout_info=None)[source]#
- Parameters
time_step (TimeStep) – input data for dynamics learning
state (Tensor) – state for dynamics learning (previous observation)
- Returns
AlgStep
- training: bool#
alf.algorithms.rl_algorithm#
Base class for RL algorithms.
- class RLAlgorithm(observation_spec, action_spec, train_state_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), predict_state_spec=None, rollout_state_spec=None, is_on_policy=None, reward_weights=None, env=None, config=None, optimizer=None, checkpoint=None, is_eval=False, overwrite_policy_output=False, debug_summaries=False, name='RLAlgorithm')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmAbstract base class for RL Algorithms.
RLAlgorithmprovide basic functions and generic interface for rl algorithms.The key interface functions are:
predict_step(): one step of computation of action for evaluation.rollout_step(): one step of computation for rollout. It is used for collecting experiences during training. Different frompredict_step,rollout_stepmay include addtional computations for training. For on-policy algorithms (e.g., AC, PPO, etc), the collected experiences will be immediately used to update parameters after one rollout (multiple rollout steps) is performed; for off-policy algorithms (e.g., SAC, DDPG, etc), these collected experiences will be put into a replay buffer.train_step(): only used for off-policy training. The training data are sampled from the replay buffer filled byrollout_step().train_iter(): perform one iteration of training (rollout [and train]).train_iter()is callednum_iterationstimes byTrainer. We provide a default implementation. Users can choose to implement their owntrain_iter().update_with_gradient(): Do one gradient update based on the loss. It is used by the defaulttrain_iter()implementation. You can override to implement your ownupdate_with_gradient().calc_loss(): calculate loss based theexperienceand thetrain_infocollected fromrollout_step()ortrain_step(). It is used by the default implementation oftrain_iter(). If you want to use the defaulttrain_iter(), you need to implementcalc_loss().after_update(): called bytrain_iter()after every call toupdate_with_gradient(), mainly for some postprocessing steps such as copying a training model to a target model in SAC or DQN.after_train_iter(): called bytrain_iter()after every call totrain_from_unroll()(on-policy training iter) ortrain_from_replay_buffer(off-policy training iter). It’s mainly for training additional modules that have their own training logic (e.g., on/off-policy, replay buffers, etc). Other things might also be possible as long as they should be done once every training iteration.
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions.
train_state_spec (nested TensorSpec) – for the network state of
train_step().reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
rollout_state_spec (nested TensorSpec) – for the network state of
predict_step(). If None, it’s assumed to be the same astrain_state_spec.predict_state_spec (nested TensorSpec) – for the network state of
predict_step(). If None, it’s assumed to be the same asrollout_state_spec.is_on_policy (None|bool) – whether the algorithm is on-policy or not.
reward_weights (None|list[float]) – this is only used when the reward is multidimensional. If not None, the weighted sum of rewards is the reward for training. Otherwise, the sum of rewards is used.
env (Environment) – The environment to interact with.
envis a batched environment, which means that it runs multiple simulations simultaneously. Running multiple environments in parallel is crucial to on-policy algorithms as it increases the diversity of data and decreases temporal correlation.envonly needs to be provided to the rootAlgorithm.config (TrainerConfig) – config for training.
configonly needs to be provided to the algorithm which performs a training iteration by itself.optimizer (torch.optim.Optimizer) – The default optimizer for training.
checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.is_eval (bool) – True if this algorithm is used for evaluation only, during deployment. In this case, the algorithm do not need to create certain components such as value_network for ActorCriticAlgorithm, critic_networks for SacAlgorithm.
overwrite_policy_output (bool) – if True, overwrite the policy output with next_step.prev_action. This option can be used in some cases such as data collection.
debug_summaries (bool) – If True, debug summaries will be created.
name (str) – Name of this algorithm.
- property action_spec#
Return the action spec.
- get_metrics()[source]#
Returns the metrics monitored by this driver.
- Returns
- Return type
list[StepMetric]
- get_step_metrics()[source]#
Get step metrics that used for generating summaries against
- Returns
step metrics
EnvironmentStepsandNumberOfEpisodes.- Return type
list[StepMetric]
- has_multidim_reward()[source]#
Check if the algorithm uses multi-dim reward or not.
- Returns
True if the reward has multiple dims.
- Return type
bool
- load_offline_replay_buffer(untransformed_observation_spec)[source]#
Load replay buffer from a replay buffer checkpoint. It will construct a replay buffer (
self._offline_replay_buffer) holding the data loaded from the checkpoint, which can be used for model training, e.g. in the hybrid training pipeline or in other ways.- Parameters
untransformed_observation_spec (nested TensorSpec) – spec that describes the strcuture of the utransformed observations.
- property observation_spec#
Return the observation spec.
- predict_step(inputs, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- property reward_weights#
Return the current reward weights.
- property rollout_info_spec#
The spec for the
AlgStep.inforeturned fromrollout_step().
- set_reward_weights(reward_weights)[source]#
Update reward weights; this function can be called at any step during training. Once called, the updated reward weights are expected to be used by the algorithm in the next.
- Parameters
reward_weights (Tensor) – a tensor that is compatible with
self._reward_spec.
- summarize_metrics()[source]#
Generate summaries for metrics
AverageEpisodeLength,AverageReturn, etc.
- summarize_rollout(experience, custom_summary=None)[source]#
Generate summaries for rollout.
- Parameters
experience (
Experience) – experience collected fromrollout_step().custom_summary (
Optional[Callable[[Experience],None]]) – when specified it is a function that will be called every time when thissummarize_rollouthook is called. This provides a convenient way for the user to extendsummarize_rolloutfrom ALF configs.
- summarize_train(experience, train_info, loss_info, params)[source]#
Generate summaries for training & loss info after each gradient update.
For on-policy algorithms,
experience.rollout_infois empty, while for off-policy algorithms, it is available. However, the statistics in bothtrain_infoand ``experience.rollout_info` are for the data sampled from the replay buffer. They store the update-to-date model outputs and the historical model outputs (on the past rollout data), respectively. They do not represent the model outputs on the current on-going rollout.- Parameters
experience (Experience) – experiences collected from the most recent
unroll()or from a replay buffer. It also has been used for the most recentupdate_with_gradient().train_info (nested Tensor) –
AlgStep.inforeturned by eitherrollout_step()(on-policy training) ortrain_step()(off-policy training).loss_info (LossInfo) – loss
params (list[Parameter]) – list of parameters with gradients
- train_iter()[source]#
Perform one iteration of training.
Users may choose to implement their own
train_iter().- Returns
the number of samples being trained on (including duplicates).
- Return type
int
- training: bool#
- unroll(**kwargs)#
- adjust_replay_buffer_length(config, num_earliest_frames_ignored=0)[source]#
Adjust the replay buffer length for whole replay buffer training.
Normally we just respect the replay buffer length set in the config. However, for a specific case where the user asks to do “whole replay buffer training”, we need to adjust the user provided length to achieve desired behavior.
- Parameters
config (
TrainerConfig) – The trainer config of the training sessionnum_earliest_frames_ignored (
int) – ignore the earliest so many frames from the buffer when sampling or gathering. This is typically required when FrameStacker is used. SeeReplayBufferfor details.
- Return type
int- Returns
An integer representing the adjusted replay buffer length.
alf.algorithms.rnd_algorithm#
- class RNDAlgorithm(target_net, predictor_net, encoder_net=None, reward_adapt_speed=None, observation_adapt_speed=None, observation_spec=None, optimizer=None, clip_value=- 1.0, keep_stacked_frames=1, name='RNDAlgorithm')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmExploration by Random Network Distillation, Burda et al. 2019.
This module generates the intrinsic reward based on the prediction errors of randomly generated state embeddings.
Suppose we have a fixed randomly initialized target network g: s -> e_t and a trainable predictor network h: s -> e_p, then the intrinsic reward is
r = |e_t - e_p|^2
The reward is expected to be higher for novel states.
- Parameters
encoder_net (EncodingNetwork) – a shared network that encodes observation to embeddings before being input to target_net or predictor_net; its parameters are not trainable.
target_net (EncodingNetwork) – the random fixed network that generates target state embeddings to be fitted.
predictor_net (EncodingNetwork) – the trainable network that predicts target embeddings. If fully trained given enough data, predictor_net will become target_net eventually.
reward_adapt_speed (float) – speed for adaptively normalizing intrinsic rewards; if None, no normalizer is used.
observation_adapt_speed (float) – speed for adaptively normalizing observations. Only useful if observation_spec is not None.
observation_spec (TensorSpec) – the observation tensor spec; used for creating an adaptive observation normalizer.
optimizer (torch.optim.Optimizer) – The optimizer for training
clip_value (float) – if positive, the rewards will be clipped to [-clip_value, clip_value]; only used for reward normalization.
keep_stacked_frames (int) – a non-negative integer indicating how many stacked frames we want to keep as the observation. If >0, we only keep the last so many frames for RND to make predictions on, as suggested by the original paper Burda et al. 2019. For Atari games, this argument is usually 1 (with frame_stacking==4). If it’s 0, the observation is unchanged. For other games, the user is responsible for setting this value correctly depending on how many channels an observation has at each time step.
name (str) –
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(inputs, state)[source]#
Predict for one step of inputs.
- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
predict_state_spec.- info (nest): information for analyzing the agent. In particular,
if an element of the info is
alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.
- Return type
- rollout_step(inputs, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- train_step(inputs, state, rollout_info=None)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
alf.algorithms.sac_algorithm#
Soft Actor Critic Algorithm.
- class SacActionState(actor_network, critic)#
Bases:
tupleCreate new instance of SacActionState(actor_network, critic)
- actor_network#
Alias for field number 0
- critic#
Alias for field number 1
- class SacActorInfo(actor_loss, neg_entropy)#
Bases:
tupleCreate new instance of SacActorInfo(actor_loss, neg_entropy)
- actor_loss#
Alias for field number 0
- neg_entropy#
Alias for field number 1
- class SacAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), actor_network_cls=<class 'alf.networks.actor_distribution_networks.ActorDistributionNetwork'>, critic_network_cls=<class 'alf.networks.critic_networks.CriticNetwork'>, q_network_cls=<class 'alf.networks.q_networks.QNetwork'>, reward_weights=None, epsilon_greedy=None, use_entropy_reward=True, normalize_entropy_reward=False, calculate_priority=False, num_critic_replicas=2, env=None, config=None, critic_loss_ctor=None, target_entropy=None, prior_actor_ctor=None, target_kld_per_dim=3.0, initial_log_alpha=0.0, max_log_alpha=None, target_update_tau=0.05, target_update_period=1, dqda_clipping=None, actor_optimizer=None, critic_optimizer=None, alpha_optimizer=None, checkpoint=None, debug_summaries=False, reproduce_locomotion=False, name='SacAlgorithm')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmSoft Actor Critic algorithm, described in:
Haarnoja et al "Soft Actor-Critic Algorithms and Applications", arXiv:1812.05905v2
There are 3 points different with
tf_agents.agents.sac.sac_agent:1. To reduce computation, here we sample actions only once for calculating actor, critic, and alpha loss while
tf_agents.agents.sac.sac_agentsamples actions for each loss. This difference has little influence on the training performance.2. We calculate losses for every sampled steps. \((s_t, a_t), (s_{t+1}, a_{t+1})\) in sampled transition are used to calculate actor, critic and alpha loss while
tf_agents.agents.sac.sac_agentonly uses \((s_t, a_t)\) and critic loss for \(s_{t+1}\) is 0. You should handle this carefully, it is equivalent to applying a coefficient of 0.5 on the critic loss.3. We mask out
StepType.LASTsteps when calculating losses buttf_agents.agents.sac.sac_agentdoes not. We believe the correct implementation should mask outLASTsteps. And this may make different performance on same tasks.In addition to continuous actions addressed by the original paper, this algorithm also supports discrete actions and a mixture of discrete and continuous actions. The networks for computing Q values \(Q(s,a)\) and sampling acitons can be divided into 3 cases according to action types:
Discrete only: a
QNetworkis used for estimating Q values. There will be no actor network to learn because actions can be directly sampled from the Q values: \(p(a|s) \propto \exp(\frac{Q(s,a)}{\alpha})\).Continuous only: a
CriticNetworkis used for estimating Q values. AnActorDistributionNetworkfor sampling actions will be learned according to Q values.Mixed: a
QNetworkis used for estimating Q values. The input of this particularQNetwork(dubbed as “Universal Q Network”) is augmented with all continuous actions as(observation, continuous_action), while the output heads correspond to discrete actions. So a Q value \(Q(s, a_{cont}, a_{disc}=k)\) is estimated by the \(k\)-th output head of the network given \(a_{cont}\) as the augmented input to \(s\). Still only anActorDistributionNetworkis needed for first sampling continuous actions, and then a discrete action is sampled from Q values conditioned on the continuous actions. Seealf/docs/notes/sac_with_hybrid_action_types.rstfor training details.
In addition to the entropy regularization described in the SAC paper, we also support KL-Divergence regularization if a prior actor is provided. In this case, the training objective is:
\(E_\pi(\sum_t \gamma^t(r_t - \alpha D_{\rm KL}(\pi(\cdot)|s_t)||\pi^0(\cdot)|s_t)))\)
where \(pi^0\) is the prior actor.
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions; can be a mixture of discrete and continuous actions. The number of continuous actions can be arbitrary while only one discrete action is allowed currently. If it’s a mixture, then it must be a tuple/list
(discrete_action_spec, continuous_action_spec).reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
actor_network_cls (Callable) – is used to construct the actor network. The constructed actor network will be called to sample continuous actions. All of its output specs must be continuous. Note that we don’t need a discrete actor network because a discrete action can simply be sampled from the Q values.
critic_network_cls (None or Callable) – is used to construct critic network. for estimating
Q(s,a)given that the action is continuous. Note that if the algorithm is constructed for evaluation or deployment only, the critic_network_cls can be set to None and the network will not be constructed at all.q_network (Callable) – is used to construct QNetwork for estimating
Q(s,a)given that the action is discrete. Its output spec must be consistent with the discrete action inaction_spec.reward_weights (None|list[float]) – this is only used when the reward is multidimensional. In that case, the weighted sum of the q values is used for training the actor if reward_weights is not None. Otherwise, the sum of the q values is used.
epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from
config.epsilon_greedyand thenalf.get_config_value(TrainerConfig.epsilon_greedy).use_entropy_reward (bool) – whether to include entropy as reward
normalize_entropy_reward (bool) – if True, normalize entropy reward to reduce bias in episodic cases. Only used if
use_entropy_reward==True.calculate_priority (bool) – whether to calculate priority. This is only useful if priority replay is enabled.
num_critic_replicas (int) – number of critics to be used. Default is 2.
env (Environment) – The environment to interact with.
envis a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.config (TrainerConfig) – config for training. It only needs to be provided to the algorithm which performs
train_iter()by itself.critic_loss_ctor (None|OneStepTDLoss|MultiStepLoss) – a critic loss constructor. If
None, a defaultOneStepTDLosswill be used.initial_log_alpha (float) – initial value for variable
log_alpha.max_log_alpha (float|None) – if not None,
log_alphawill be capped at this value.target_entropy (float|Callable|None) – If a floating value, it’s the target average policy entropy, for updating
alpha. If a callable function, then it will be called on the action spec to calculate a target entropy. IfNone, a default entropy will be calculated. For the mixed action type, discrete action and continuous action will have separate alphas and target entropies, so this argument can be a 2-element list/tuple, where the first is for discrete action and the second for continuous action.prior_actor_ctor (Callable) – If provided, it will be called using
prior_actor_ctor(observation_spec, action_spec, debug_summaries=debug_summaries)to constructor a prior actor. The output of the prior actor is the distribution of the next action. Two prior actors are implemented:alf.algorithms.prior_actor.SameActionPriorActorandalf.algorithms.prior_actor.UniformPriorActor.target_kld_per_dim (float) –
alphais dynamically adjusted so that the KLD is abouttarget_kld_per_dim * dim.target_update_tau (float) – Factor for soft update of the target networks.
target_update_period (int) – Period for soft update of the target networks.
dqda_clipping (float) – when computing the actor loss, clips the gradient dqda element-wise between
[-dqda_clipping, dqda_clipping]. Will not perform clipping ifdqda_clipping == 0.actor_optimizer (torch.optim.optimizer) – The optimizer for actor.
critic_optimizer (torch.optim.optimizer) – The optimizer for critic.
alpha_optimizer (torch.optim.optimizer) – The optimizer for alpha.
debug_summaries (bool) – True if debug summaries should be created.
checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.reproduce_locomotion (bool) – if True, some slight tweaks are added to the original SAC to roughly reproducing its reported results on MuJoCo locomotion tasks. These include uniform action sampling in the beginning and different masks for actor and critic losses.
name (str) – The name of this algorithm.
- after_update(root_inputs, info)[source]#
Do things after completing one gradient update (i.e.
update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.- Parameters
root_inputs (nest) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll().info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()for on-policy training ortrain_step()for off-policy training.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(inputs, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- rollout_step(inputs, state)[source]#
rollout_step()basically predicts actions like what is done bypredict_step(). Additionally, if states are to be stored a in replay buffer, then this function also call_critic_networksand_target_critic_networksto maintain their states.
- train_step(inputs, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class SacCriticInfo(critics, target_critic)#
Bases:
tupleCreate new instance of SacCriticInfo(critics, target_critic)
- critics#
Alias for field number 0
- target_critic#
Alias for field number 1
- class SacCriticState(critics, target_critics)#
Bases:
tupleCreate new instance of SacCriticState(critics, target_critics)
- critics#
Alias for field number 0
- target_critics#
Alias for field number 1
- class SacInfo(reward, step_type, discount, action, action_distribution, actor, critic, alpha, log_pi, discounted_return)#
Bases:
tupleCreate new instance of SacInfo(reward, step_type, discount, action, action_distribution, actor, critic, alpha, log_pi, discounted_return)
- action#
Alias for field number 3
- action_distribution#
Alias for field number 4
- actor#
Alias for field number 5
- alpha#
Alias for field number 7
- critic#
Alias for field number 6
- discount#
Alias for field number 2
- discounted_return#
Alias for field number 9
- log_pi#
Alias for field number 8
- reward#
Alias for field number 0
- step_type#
Alias for field number 1
alf.algorithms.sarsa_algorithm#
SARSA Algorithm.
- class SarsaAlgorithm(observation_spec, action_spec, actor_network_ctor, critic_network_ctor, reward_spec=TensorSpec(shape=(), dtype=torch.float32), num_critic_replicas=2, env=None, config=None, critic_loss_cls=<class 'alf.algorithms.one_step_loss.OneStepTDLoss'>, target_entropy=None, epsilon_greedy=None, use_entropy_reward=False, calculate_priority=False, initial_alpha=1.0, ou_stddev=0.2, ou_damping=0.15, actor_optimizer=None, critic_optimizer=None, alpha_optimizer=None, target_update_tau=0.05, target_update_period=10, use_smoothed_actor=False, dqda_clipping=0.0, on_policy=False, checkpoint=None, debug_summaries=False, name='SarsaAlgorithm')[source]#
Bases:
alf.algorithms.rl_algorithm.RLAlgorithmSARSA Algorithm.
SARSA update Q function using the following loss:
\[||Q(s_t,a_t) - \text{nograd}(r_t + \gamma * Q(s_{t+1}, a_{t+1}))||^2\]See https://en.wikipedia.org/wiki/State-action-reward-state-action
Currently, this is only implemented for continuous action problems. The policy is dervied by a DDPG/SAC manner by maximizing \(Q(a(s_t), s_t)\), where \(a(s_t)\) is the action.
- Parameters
action_spec (nested BoundedTensorSpec) – representing the actions.
observation_spec (nested TensorSpec) – spec for observation.
actor_network_ctor (Callable) – Function to construct the actor network.
actor_network_ctorneeds to acceptinput_tensor_specandaction_specas its arguments and return an actor network. The constructed network will be called withforward(observation, state).critic_network_ctor (Callable) – Function to construct the critic network.
critic_netwrok_ctorneeds to acceptinput_tensor_specwhich is a tuple of(observation_spec, action_spec). The constructed network will be called withforward((observation, action), state).reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
num_critic_replicas (int) – number of critics to be used. Default is 2.
env (Environment) – The environment to interact with.
envis a batched environment, which means that it runs multiple simulations simultaneously. Running multiple environments in parallel is crucial to on-policy algorithms as it increases the diversity of data and decreases temporal correlation.envonly needs to be provided to the rootAlgorithm.config (TrainerConfig) – config for training.
configonly needs to be provided to the algorithm which performstrain_iter()by itself.initial_alpha (float|None) – If provided, will add
-alpha*entropyto the loss to encourage diverse action.target_entropy (float|Callable|None) – If a floating value, it’s the target average policy entropy, for updating
alpha. If a callable function, then it will be called on the action spec to calculate a target entropy. IfNone, a default entropy will be calculated.epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from
config.epsilon_greedyand thenalf.get_config_value(TrainerConfig.epsilon_greedy).use_entropy_reward (bool) – If
True, will use alpha*entropy as additional reward.calculate_priority (bool) – whether to calculate priority. This is only useful if priority replay is enabled.
ou_stddev (float) – Only used for DDPG. Standard deviation for the Ornstein-Uhlenbeck (OU) noise added in the default collect policy.
ou_damping (float) – Only used for DDPG. Damping factor for the OU noise added in the default collect policy.
target_update_tau (float) – Factor for soft update of the target networks.
target_update_period (int) – Period for soft update of the target networks.
use_smoothed_actor (bool) – use a smoothed version of actor for predict and rollout. This option can be used if
on_policyisFalse.dqda_clipping (float) – when computing the actor loss, clips the gradient
dqdaelement-wise between[-dqda_clipping, dqda_clipping]. Does not perform clipping ifdqda_clipping == 0.actor_optimizer (torch.optim.Optimizer) – The optimizer for actor.
critic_optimizer (torch.optim.Optimizer) – The optimizer for critic networks.
alpha_optimizer (torch.optim.Optimizer) – The optimizer for alpha. Only used if
initial_alphais notNone.on_policy (bool) – whether it is used as an on-policy algorithm.
checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.debug_summaries (bool) –
Trueif debug summaries should be created.name (str) – The name of this algorithm.
- after_update(root_inputs, info)[source]#
Do things after completing one gradient update (i.e.
update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.- Parameters
root_inputs (nest) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll().info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()for on-policy training ortrain_step()for off-policy training.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- convert_train_state_to_predict_state(state)[source]#
Convert RNN state for
train_step()to RNN state forpredict_step().
- predict_step(inputs, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- rollout_step(inputs, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- train_step(inputs, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class SarsaInfo(reward, step_type, discount, action_distribution, actor_loss, critics, target_critics, neg_entropy)#
Bases:
tupleCreate new instance of SarsaInfo(reward, step_type, discount, action_distribution, actor_loss, critics, target_critics, neg_entropy)
- action_distribution#
Alias for field number 3
- actor_loss#
Alias for field number 4
- critics#
Alias for field number 5
- discount#
Alias for field number 2
- neg_entropy#
Alias for field number 7
- reward#
Alias for field number 0
- step_type#
Alias for field number 1
- target_critics#
Alias for field number 6
- class SarsaLossInfo(actor, critic, alpha, neg_entropy)#
Bases:
tupleCreate new instance of SarsaLossInfo(actor, critic, alpha, neg_entropy)
- actor#
Alias for field number 0
- alpha#
Alias for field number 2
- critic#
Alias for field number 1
- neg_entropy#
Alias for field number 3
- class SarsaState(prev_observation, prev_step_type, actor, critics, target_critics, noise)#
Bases:
tupleCreate new instance of SarsaState(prev_observation, prev_step_type, actor, critics, target_critics, noise)
- actor#
Alias for field number 2
- critics#
Alias for field number 3
- noise#
Alias for field number 5
- prev_observation#
Alias for field number 0
- prev_step_type#
Alias for field number 1
- target_critics#
Alias for field number 4
alf.algorithms.taac_algorithm#
- class ActPredOutput(dists, b, actor_a, taus, q_values2)#
Bases:
tupleCreate new instance of ActPredOutput(dists, b, actor_a, taus, q_values2)
- actor_a#
Alias for field number 2
- b#
Alias for field number 1
- dists#
Alias for field number 0
- q_values2#
Alias for field number 4
- taus#
Alias for field number 3
- class Distributions(beta_dist, b1_a_dist)#
Bases:
tupleCreate new instance of Distributions(beta_dist, b1_a_dist)
- b1_a_dist#
Alias for field number 1
- beta_dist#
Alias for field number 0
- Mode#
alias of
alf.algorithms.taac_algorithm.AlgorithmMode
- class TAACTDLoss(gamma=0.99, td_error_loss_fn=<function element_wise_squared_loss>, debug_summaries=False, name='TAACTDLoss')[source]#
Bases:
torch.nn.modules.module.ModuleThis TD loss implements the compare-through multi-step Q operator \(\mathcal{T}^{\pi^{\text{ta}}}\) proposed in the TAAC paper. For a sampled trajectory, it compares the beta action \(\tilde{b}_n\) sampled from the current policy with the historical rollout beta action \(b_n\) step by step, and uses the minimum \(n\) that has \(\tilde{b}_n\lor b_n=1\) as the target step for boostrapping.
- Parameters
gamma (float|list[float]) – A discount factor for future rewards. For multi-dim reward, this can also be a list of discounts, each discount applies to a reward dim.
td_errors_loss_fn (Callable) – A function for computing the TD errors loss. This function takes as input the target and the estimated Q values and returns the loss for each element of the batch.
debug_summaries (bool) – True if debug summaries should be created.
name (str) – The name of this loss.
- forward(info, value, target_value)[source]#
Calculate the TD loss. The first dimension of all the tensors is the time dimension and the second dimesion is the batch dimension.
- Parameters
info (TaacInfo) – TaacInfo collected from train_step().
value (torch.Tensor) – the tensor for the value at each time step. The loss is between this and the calculated return.
target_value (torch.Tensor) – the tensor for the value at each time step. This is used to calculate return.
- Returns
TD loss with the
extrafield same as the loss.- Return type
- property gamma#
Return the \(\gamma\) value for discounting future rewards.
- Returns
a rank-0 or rank-1 (multi-dim reward) floating tensor.
- Return type
Tensor
- training: bool#
- class TaacActorInfo(actor_loss, b1_a_entropy, beta_entropy, adv, value_loss)#
Bases:
tupleCreate new instance of TaacActorInfo(actor_loss, b1_a_entropy, beta_entropy, adv, value_loss)
- actor_loss#
Alias for field number 0
- adv#
Alias for field number 3
- b1_a_entropy#
Alias for field number 1
- beta_entropy#
Alias for field number 2
- value_loss#
Alias for field number 4
- class TaacAlgorithm(name='TaacAlgorithm', *args, **kwargs)[source]#
Bases:
alf.algorithms.taac_algorithm.TaacAlgorithmBaseModel temporal abstraction by action repetition. See
“TAAC: Temporally Abstract Actor-Critic for Continuous Control”, Yu et al., arXiv 2021.
for algorithm details.
See
TaacAlgorithmBasefor argument description.- training: bool#
- class TaacAlgorithmBase(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), actor_network_cls=<class 'alf.networks.actor_distribution_networks.ActorDistributionNetwork'>, critic_network_cls=<class 'alf.networks.critic_networks.CriticNetwork'>, actor_observation_processors=Detach(), reward_weights=None, num_critic_replicas=2, epsilon_greedy=None, env=None, config=None, target_update_tau=0.05, target_update_period=1, critic_loss_ctor=None, actor_optimizer=None, critic_optimizer=None, alpha_optimizer=None, initial_alpha=1.0, debug_summaries=False, randomize_first_state_tau=False, b1_advantage_clipping=None, max_repeat_steps=None, target_entropy=None, checkpoint=None, name='TaacAlgorithmBase')[source]#
Bases:
alf.algorithms.off_policy_algorithm.OffPolicyAlgorithmTemporally abstract actor-critic algorithm.
In a nutsell, for inference TAAC adds a second stage that chooses between a candidate trajectory \(\hat{\tau}\) output by an SAC actor and the previous trajectory \(\tau^-\). For policy evaluation, TAAC uses a compare-through Q operator for TD backup by re-using state-action sequences that have shared actions between rollout and training. For policy improvement, the new actor gradient is approximated by multiplying a scaling factor to the \(\frac{\partial Q}{\partial a}\) term in the original SAC’s actor gradient, where the scaling factor is the optimal probability of choosing the \(\hat{\tau}\) in the second stage.
Different sub-algorithms implement different forms of the ‘trajectory’ concept, for example, it can be a constant function representing the same action, or a quadratic function.
- Parameters
observation_spec (nested TensorSpec) – representing the observations.
action_spec (BoundedTensorSpec) – representing the continuous action.
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
actor_network_cls (Callable) – is used to construct the actor network. The constructed actor network will be called to sample continuous actions.
critic_network_cls (Callable) – is used to construct critic network. for estimating
Q(s,a)given that the action is continuous.actor_observation_processors (Nest) – a nest of observation processors applied to the inputs of the actor network. Note that any configured
input_preprocessorsofactor_network_clswill be overwritten by a tuple of this one and a preprocessor of the prev action, for modeling \(\pi(a|s,a^-)\).reward_weights (None|list[float]) – this is only used when the reward is multidimensional. In that case, the weighted sum of the q values is used for training the actor if reward_weights is not None. Otherwise, the sum of the q values is used.
num_critic_replicas (int) – number of critics to be used. Default is 2.
epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from
config.epsilon_greedyand thenalf.get_config_value(TrainerConfig.epsilon_greedy).env (Environment) – The environment to interact with.
envis a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.config (TrainerConfig) – config for training. It only needs to be provided to the algorithm which performs
train_iter()by itself.target_update_tau (float) – Factor for soft update of the target networks.
target_update_period (int) – Period for soft update of the target networks.
critic_loss_ctor (None|OneStepTDLoss|MultiStepLoss) – a critic loss constructor. If
None, a defaultTAACTDLosswill be used.actor_optimizer (torch.optim.optimizer) – The optimizer for actor.
critic_optimizer (torch.optim.optimizer) – The optimizer for critic.
alpha_optimizer (torch.optim.optimizer) – The optimizer for alpha.
initial_alpha (float) – the initial entropy weight for both policies.
debug_summaries (bool) – True if debug summaries should be created.
randomize_first_state_tau (bool) – whether to randomize
state.tauat the beginning of an episode during rollout and training. Potentially this helps exploration. This was turned off in Yu et al. 2021.b1_advantage_clipping (None|tuple[float]) – option for clipping the advantage (defined as \(Q(s,\hat{\tau}) - Q(s,\tau^-)\)) when computing \(\beta_1\). If not
None, it should be a pair of numbers[min_adv, max_adv].max_repeat_steps (None|int) – the max number of steps to repeat during rollout and evaluation. This value doesn’t impact the switch during training.
target_entropy (Callable|tuple[Callable]|None) – If a callable function, then it will be called on the action spec to calculate a target entropy. If
None, a default entropy will be calculated. To set separate entropy targets for the two stage policies, this argument can be a tuple of two callables.checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.name (str) – name of the algorithm
- after_update(root_inputs, info)[source]#
Do things after completing one gradient update (i.e.
update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.- Parameters
root_inputs (nest) – temporally batched inputs for the
rollout_step()of the root algorithm collected duringunroll().info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()for on-policy training ortrain_step()for off-policy training.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(inputs, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- rollout_step(inputs, state)[source]#
Rollout for one step of inputs.
It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for
calc_loss(). For off-policy training, it needs to generate necessary information fortrain_step().- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
rollout_state_spec.info (nested Tensor): For on-policy training it will be temporally batched and passed as
infofor calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved fortrain_step()asrollout_info.
- Return type
- summarize_rollout(experience)[source]#
Generate summaries for rollout.
- Parameters
experience – experience collected from
rollout_step().custom_summary – when specified it is a function that will be called every time when this
summarize_rollouthook is called. This provides a convenient way for the user to extendsummarize_rolloutfrom ALF configs.
- train_step(inputs, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class TaacCriticInfo(critics, target_critic, value_loss)#
Bases:
tupleCreate new instance of TaacCriticInfo(critics, target_critic, value_loss)
- critics#
Alias for field number 0
- target_critic#
Alias for field number 1
- value_loss#
Alias for field number 2
- class TaacInfo(reward, step_type, tau, prev_tau, discount, action_distribution, rollout_b, b, actor, critic, alpha, repeats)#
Bases:
tupleCreate new instance of TaacInfo(reward, step_type, tau, prev_tau, discount, action_distribution, rollout_b, b, actor, critic, alpha, repeats)
- action_distribution#
Alias for field number 5
- actor#
Alias for field number 8
- alpha#
Alias for field number 10
- b#
Alias for field number 7
- critic#
Alias for field number 9
- discount#
Alias for field number 4
- prev_tau#
Alias for field number 3
- repeats#
Alias for field number 11
- reward#
Alias for field number 0
- rollout_b#
Alias for field number 6
- step_type#
Alias for field number 1
- tau#
Alias for field number 2
- class TaacLAlgorithm(name='TaacLAlgorithm', inverse_mode=True, *args, **kwargs)[source]#
Bases:
alf.algorithms.taac_algorithm.TaacAlgorithmBaseTaacL: Piecewise linear trajectory policy for continuous control.
For a linear trajectory, let \(a\) be the action and \(v\) the first derivative. Its dynamics is:
\[\begin{split}\begin{array}{ll} v_{t+1} &\leftarrow v_t\\ a_{t+1} &\leftarrow v_{t+1} + a_t\\ \end{array}\end{split}\]TaacL’s trajectory is piece-wise linear. Each time the policy decides whether to repeat the previous linear traj or generate a new one. Importantly, to generate a new one the policy doesn’t directly generate the entire set of two parameters \((a,v)\) because this will result in bad exploration in the action space. Instead,
\[\begin{split}\begin{array}{ll} a_{t+1} &\sim \pi\\ v_{t+1} &\leftarrow a_{t+1} - a_t\\ \end{array}\end{split}\]For \(a\in[0,1]\) and \(v\in[0,1]\), the actual dynamics is \(a_{t+1}\leftarrow \max(\min(a_t+2v_{t+1},1),-1)\).
See
TaacAlgorithmBasefor other argument description.- Parameters
inverse_mode (bool) – this argument decides how the new traj is computed when
b=1. If it’s False, then the new action is treated as the new first derivativev; otherwise the new action is treated as the new actiona, andvis inversely inferred.
- training: bool#
- class TaacLossInfo(actor, critic, alpha)#
Bases:
tupleCreate new instance of TaacLossInfo(actor, critic, alpha)
- actor#
Alias for field number 0
- alpha#
Alias for field number 2
- critic#
Alias for field number 1
- class TaacQAlgorithm(name='TaacQAlgorithm', inverse_mode=True, *args, **kwargs)[source]#
Bases:
alf.algorithms.taac_algorithm.TaacLAlgorithmTaacQ: Piecewise quadratic trajectory policy for continuous control.
For a quadratic trajectory, let \(a\) be the action, \(u\) be the second derivative, and \(v\) be the first derivative. Its dynamics is:
\[\begin{split}\begin{array}{ll} u_{t+1} &\leftarrow u_t\\ v_{t+1} &\leftarrow u_{t+1} + v_t\\ a_{t+1} &\leftarrow v_{t+1} + a_t\\ \end{array}\end{split}\]TaacQ’s trajectory is piece-wise quadratic. Each time the policy decides whether to repeat the previous quadratic traj or generate a new one. Importantly, to generate a new one the policy doesn’t directly generate the entire set of three parameters \((a,v,u)\) because this will result in bad exploration in the action space. Instead,
\[\begin{split}\begin{array}{ll} a_{t+1} &\sim \pi\\ v_{t+1} &\leftarrow a_{t+1} - a_t\\ u_{t+1} &\leftarrow v_{t+1}\\ \end{array}\end{split}\]where the last two steps assume resetting \(v_t\) to zero.
For \(a\in[0,1]\), \(v\in[0,1]\), and \(u\in[0,1]\), the actual dynamics is \(v_{t+1}\leftarrow \max(\min(v_t+2u_{t+1},1),-1)\) and \(a_{t+1}\leftarrow \max(\min(a_t+2v_{t+1},1),-1)\).
See
TaacAlgorithmBasefor other argument description.- Parameters
inverse_mode (bool) – this argument decides how the new traj is computed when
b=1. If it’s False, then the new action is treated as the new second derivativeu; otherwise the new action is treated as the new actiona, anduis inversely inferred. In either case, the currentvis first set to 0, and then a newvis computed.
- training: bool#
alf.algorithms.td_loss#
- class TDLoss(gamma=0.99, td_error_loss_fn=<function element_wise_squared_loss>, td_lambda=0.95, normalize_target=False, debug_summaries=False, name='TDLoss')[source]#
Bases:
torch.nn.modules.module.ModuleTemporal difference loss.
Let \(G_{t:T}\) be the bootstraped return from t to T:
\[G_{t:T} = \sum_{i=t+1}^T \gamma^{t-i-1}R_i + \gamma^{T-t} V(s_T)\]If
td_lambda= 1, the target for step t is \(G_{t:T}\).If
td_lambda= 0, the target for step t is \(G_{t:t+1}\)If 0 <
td_lambda< 1, the target for step t is the \(\lambda\)-return:\[G_t^\lambda = (1 - \lambda) \sum_{i=t+1}^{T-1} \lambda^{i-t}G_{t:i} + \lambda^{T-t-1} G_{t:T}\]There is a simple relationship between \(\lambda\)-return and the generalized advantage estimation \(\hat{A}^{GAE}_t\):
\[G_t^\lambda = \hat{A}^{GAE}_t + V(s_t)\]where the generalized advantage estimation is defined as:
\[\hat{A}^{GAE}_t = \sum_{i=t}^{T-1}(\gamma\lambda)^{i-t}(R_{i+1} + \gamma V(s_{i+1}) - V(s_i))\]References:
Schulman et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation
Sutton et al. Reinforcement Learning: An Introduction, Chapter 12, 2018
- Parameters
gamma (
Union[float,List[float]]) – A discount factor for future rewards. For multi-dim reward, this can also be a list of discounts, each discount applies to a reward dim.td_error_loss_fn (
Callable) – A function for computing the TD errors loss. This function takes as input the target and the estimated Q values and returns the loss for each element of the batch.td_lambda (
float) – Lambda parameter for TD-lambda computation.normalize_target (bool) – whether to normalize target. Note that the effect of this is to change the loss. The critic value itself is not normalized.
debug_summaries (
bool) – True if debug summaries should be created.name (
str) – The name of this loss.
- compute_td_target(info, target_value)[source]#
Calculate the td target.
The first dimension of all the tensors is time dimension and the second dimesion is the batch dimension.
- Parameters
info (namedtuple) – experience collected from
unroll()or a replay buffer. All tensors are time-major.infoshould contain the following fields: - reward: - step_type: - discount:target_value (torch.Tensor) – the time-major tensor for the value at each time step. This is used to calculate return.
target_valuecan be same asvalue.
- Returns
td_target
- forward(info, value, target_value)[source]#
Calculate the loss.
The first dimension of all the tensors is time dimension and the second dimesion is the batch dimension.
- Parameters
info (
namedtuple) – experience collected fromunroll()or a replay buffer. All tensors are time-major.infoshould contain the following fields: - reward: - step_type: - discount:value (
Tensor) – the time-major tensor for the value at each time step. The loss is between this and the calculated return.target_value (
Tensor) – the time-major tensor for the value at each time step. This is used to calculate return.target_valuecan be same asvalue.
- Returns
with the
extrafield same asloss.- Return type
- property gamma#
Return the \(\gamma\) value for discounting future rewards.
- Returns
a rank-0 or rank-1 (multi-dim reward) floating tensor.
- Return type
Tensor
- training: bool#
- class TDQRLoss(num_quantiles=50, gamma=0.99, td_error_loss_fn=<function huber_function>, td_lambda=1.0, sum_over_quantiles=False, debug_summaries=False, name='TDQRLoss')[source]#
Bases:
alf.algorithms.td_loss.TDLossTemporal difference quantile regression loss. Compared to TDLoss, GAE support has not been implemented.
- Parameters
num_quantiles (
int) – the number of quantiles.gamma (
Union[float,List[float]]) – A discount factor for future rewards. For multi-dim reward, this can also be a list of discounts, each discount applies to a reward dim.td_error_loss_fn (
Callable) – A function for computing the TD errors loss. This function takes as input the target and the estimated Q values and returns the loss for each element of the batch.td_lambda (
float) – Lambda parameter for TD-lambda computation. Currently only supports 1 and 0.sum_over_quantiles (
bool) – If True, the quantile regression loss will be summed along the quantile dimension. Otherwise, it will be averaged along the quantile dimension instead. Default is False.debug_summaries (
bool) – True if debug summaries should be createdname (
str) – The name of this loss.
- forward(info, value, target_value)[source]#
Calculate the loss.
The first dimension of all the tensors is time dimension and the second dimesion is the batch dimension.
- Parameters
info (
namedtuple) – experience collected fromunroll()or a replay buffer. All tensors are time-major.infoshould contain the following fields: - reward: - step_type: - discount:value (
Tensor) – the time-major tensor for the value at each time step. The loss is between this and the calculated return.target_value (
Tensor) – the time-major tensor for the value at each time step. This is used to calculate return.target_valuecan be same asvalue.
- Returns
with the
extrafield same asloss.- Return type
- training: bool#
alf.algorithms.trac_algorithm#
Trusted Region Actor critic algorithm.
- class TracAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), env=None, config=None, ac_algorithm_cls=<class 'alf.algorithms.actor_critic_algorithm.ActorCriticAlgorithm'>, action_dist_clip_per_dim=0.01, checkpoint=None, debug_summaries=False, name='TracAlgorithm')[source]#
Bases:
alf.algorithms.rl_algorithm.RLAlgorithmTrust-region actor-critic. It compares the action distributions after the SGD with the action distributions from the previous model. If the average distance is too big, the new parameters are shrinked as: .. code-block:: python
w_new’ = old_w + 0.9 * distance_clip / distance * (w_new - w_old)
If the distribution is
Categorical, the distance is \(||logits_1 - logits_2||^2\), and if the distribution isDeterministic, it is \(||loc_1 - loc_2||^2\), otherwise it’s \(KL(d1||d2) + KL(d2||d1)\). The reason of using \(||logits_1 - logits_2||^2\) for categorical distributions is that KL can be small even if there are large differences in logits when the entropy is small. This means that KL cannot fully capture how much the change is.- Parameters
action_spec (nested BoundedTensorSpec) – representing the actions.
ac_algorithm_cls (type) – Actor Critic Algorithm cls.
action_dist_clip_per_dim (float) – action dist clip per dimension
checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.debug_summaries (bool) – True if debug summaries should be created.
name (str) – Name of this algorithm.
- calc_loss(info)[source]#
Calculate the loss at each step for each sample.
- Parameters
info (nest) – information collected for training. It is batched from each
AlgStep.inforeturned byrollout_step()(on-policy training) ortrain_step()(off-policy training).- Returns
- loss at each time step for each sample in the
batch. The shapes of the tensors in loss info should be \((T, B)\).
- Return type
- predict_step(time_step, state)[source]#
Predict for one step of observation.
This only used for evaluation. So it only need to perform computations for generating action distribution.
- Parameters
time_step (TimeStep) – Current observation and other inputs for computing action.
state (nested Tensor) – should be consistent with predict_state_spec
- Returns
output (nested Tensor): should be consistent with
action_spec.state (nested Tensor): should be consistent with
predict_state_spec.
- Return type
- preprocess_experience(root_inputs, rollout_info, batch_info)[source]#
This function is called on the experiences obtained from a replay buffer. An example usage of this function is to calculate advantages and returns in
PPOAlgorithm.The shapes of tensors in experience are assumed to be \((B, T, ...)\).
- Parameters
root_inputs (nest) – input for rollout_step() of the root algorithm. This is from replay buffer. Note this is not same as the input of rollout_step() of self unless self is the root algorithm.
rollout_info (nested Tensor) –
AlgStep.infofrom rollout_step() for this algorithm.batch_info (BatchInfo) – information about this batch of data
- Returns
processed root_inputs
processed rollout_info
- Return type
tuple
- train_step(exp, state, rollout_info)[source]#
Perform one step of training computation.
It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for
calc_loss().- Parameters
inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with
train_state_spec.rollout_info (nested Tensor) – info from
rollout_step(). It is retrieved from replay buffer.
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
train_state_spec.info (nested Tensor): information for training. It will temporally batched and passed as
infofor calc_loss(). If this isLossInfo,calc_loss()inAlgorithmcan be used. Otherwise, the user needs to overridecalc_loss()to calculate loss or overrideupdate_with_gradient()to do customized training.
- Return type
- training: bool#
- class TracExperience(observation, step_type, state, action_param, prev_action)#
Bases:
tupleCreate new instance of TracExperience(observation, step_type, state, action_param, prev_action)
- action_param#
Alias for field number 3
- observation#
Alias for field number 0
- prev_action#
Alias for field number 4
- state#
Alias for field number 2
- step_type#
Alias for field number 1
- class TracInfo(action_distribution, observation, state, ac, prev_action)#
Bases:
tupleCreate new instance of TracInfo(action_distribution, observation, state, ac, prev_action)
- ac#
Alias for field number 3
- action_distribution#
Alias for field number 0
- observation#
Alias for field number 1
- prev_action#
Alias for field number 4
- state#
Alias for field number 2
alf.algorithms.vae#
Variational auto encoder.
- class DiscreteVAE(z_spec, input_tensor_spec=None, z_network_cls=<class 'alf.networks.encoding_networks.EncodingNetwork'>, prior_input_tensor_spec=None, prior_z_network_cls=None, mode='st', gumbel_temp_scheduler=1.0, beta=1.0, target_kld_per_categorical=None, beta_optimizer=None, name='DiscreteVAE')[source]#
Bases:
alf.algorithms.vae.VariationalAutoEncoderVAE with a discrete posterior distribution. The latent
zmight be a single categorical variable or a vector of categorials. Because the re-parameterization trick can no longer be applied to the discrete distribution, we instead use the straight-through (ST) gradient estimator to train the encoder.Bengio et al., "Estimating or Propagating Gradients Through Stochastic Neurons for Conditional Computation", 2013.
In short, we can re-parameterize the one-hot latent embedding \(z\) as
\[\hat{z} = z + z_{prob} - SG(z_{prob})\]Because \(z\) is a sampled discrete variable, it has no gradient. So the parameter gradient is
\[\frac{\partial L}{\partial \hat{z}}\frac{\partial \hat{z}}{\partial \theta} = \frac{\partial L}{\partial \hat{z}}\frac{\partial z_{prob}}{\partial \theta}\]Alternatively, we provide the option of ST Gumbel Softmax gradient estimator.
Jang et al., "CATEGORICAL REPARAMETERIZATION WITH GUMBEL-SOFTMAX", 2017.
Which applies the above ST trick to the Gumbel-softmax distribution that uses the Gumbel trick to reparameterize the categorical sampling process. The paper claims that ST Gumbel-softmax gradient estimator has a lower variance than the plain ST estimator.
- Parameters
z_spec (
BoundedTensorSpec) – a tensor spec for the discrete posterior. It has to be rank-one, representing a vector of discrete variables. The value bould of each variable must be identical and the lower bound has to be 0.input_tensor_spec (
Union[TensorSpec,List[ForwardRef],Tuple[()],Tuple[ForwardRef, …],Dict[str,ForwardRef]]) – the input spec.z_network_cls (
Callable) – an encoding network to encode input data into a vector of logits. Ifprior_z_network_clsis None, this network must handle input with specinput_tensor_spec. Ifprior_z_network_clsis not None, this network must be handle input with spec(prior_input_tensor_spec, input_tensor_spec, prior_z_network.output_spec).prior_input_tensor_spec (
Union[TensorSpec,List[ForwardRef],Tuple[()],Tuple[ForwardRef, …],Dict[str,ForwardRef]]) – the input spec forprior_z_network.prior_z_network_cls (
Callable) – an encoding network that outputs a vector of logits representing the a priorzdistribution given the prior input.mode (
str) – either ‘st’ or ‘st-gumbel’.gumbel_temp_scheduler (
Scheduler) – the temperature scheduler for gumbel-softmax. Only used whenmode=='st-gumbel'.beta (
float) – the weight for KL-divergencetarget_kld_per_categorical (
float) – if not None, then this will be used as the target KLD per Categorical to automatically tune beta.beta_optimizer (
Optimizer) – if not None, will be used to train beta.name (str) –
- property output_spec#
Because the output is a floating one-hot vector, the shape is rank-two.
- training: bool#
- class VAEInfo(kld, z_std, loss, beta_loss, beta)#
Bases:
tupleCreate new instance of VAEInfo(kld, z_std, loss, beta_loss, beta)
- beta#
Alias for field number 4
- beta_loss#
Alias for field number 3
- kld#
Alias for field number 0
- loss#
Alias for field number 2
- z_std#
Alias for field number 1
- class VAEOutput(z, z_mode, z_std)#
Bases:
tupleCreate new instance of VAEOutput(z, z_mode, z_std)
- z#
Alias for field number 0
- z_mode#
Alias for field number 1
- z_std#
Alias for field number 2
- class VariationalAutoEncoder(z_dim, input_tensor_spec=None, preprocess_network=None, z_prior_network=None, beta=1.0, target_kld_per_dim=None, beta_optimizer=None, checkpoint=None, name='VariationalAutoEncoder')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmVariationalAutoEncoder encodes data into diagonal multivariate gaussian, performs sampling with reparametrization trick, and returns KL divergence between posterior and prior.
Mathematically:
\(\log p(x) >= E_z \log P(x|z) - \beta KL(q(z|x) || prior(z))\)
train_step()method returns sampled z and KLD, it is up to the user of this class to use the returned z to decode and compute reconstructive loss to combine with kl loss returned here to optimize the whole network.See vae_test.py for example usages to train vanilla vae, conditional vae and vae with prior network on mnist dataset.
- Parameters
z_dim (
int) – dimension of latent vectorz, namely, the dimension for generatingz_meanandz_log_var.input_tensor_spec (
Union[TensorSpec,List[ForwardRef],Tuple[()],Tuple[ForwardRef, …],Dict[str,ForwardRef]]) – the input spec which can be a nest. If preprocess_network is None, then it must be provided.preprocess_network (
EncodingNetwork) – an encoding network to preprocess input data before projecting it into (mean, log_var). Ifz_prior_networkis None, this network must be handle input with specinput_tensor_spec. Ifz_prior_networkis not None, this network must be handle input with spec(z_prior_network.input_tensor_spec, input_tensor_spec, z_prior_network.output_spec). If this is None, an MLP of hidden sizes(z_dim*2, z_dim*2)will be used.z_prior_network (
EncodingNetwork) – an encoding network that outputs concatenation of a prior mean and prior log var given the prior input. The network shouldn’t activate its output.beta (
float) – the weight for KL-divergencetarget_kld_per_dim (
float) – if not None, then this will be used as the target KLD per dim to automatically tune beta.beta_optimizer (
Optimizer) – if not None, will be used to train beta.checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.name (str) –
- train_step(inputs, state=())[source]#
- Parameters
inputs (nested Tensor) – data to be encoded. If there is a prior network, then
inputsis a tuple of(prior_input, new_obs).state (Tensor) – empty tuple ()
- Returns
output (VAEOutput):
state: empty tuple ()
info (VAEInfo):
- Return type
- training: bool#
alf.algorithms.vq_vae#
Vector Quantized Variational AutoEncoder Algorithm.
- class Vqvae(input_tensor_spec, num_embeddings, embedding_dim, encoder_ctor=<class 'alf.networks.encoding_networks.EncodingNetwork'>, decoder_ctor=<class 'alf.networks.encoding_networks.EncodingNetwork'>, optimizer=None, commitment_loss_weight=1.0, checkpoint=None, debug_summaries=False, name='Vqvae')[source]#
Bases:
alf.algorithms.algorithm.AlgorithmVector Quantized Variational AutoEncoder (VQVAE) algorithm, described in:
- ::
A van den Oord et al. “Neural Discrete Representation Learning”, NeurIPS 2017.
VQVAE is different from standard VAE mainly in the follows aspects:
Discrete latent is used, instead of continuous latent as in standard VAE.
Standard VAE uses Gaussian prior and posterior. VQVAE can be viewed as using a determinstic form of posterior, which is a categorical distribution with onehot samples computed by nearest neighbor matching (Eq.1 of the paper). By using a uniform prior, the KL divergence is constant.
- Parameters
input_tensor_spec (TensorSpec) – the tensor spec of the input.
num_embeddings (int) – the number of embeddings (size of codebook)
embedding_dim (int) – the dimensionality of embedding vectors
encoder_ctor (Callable) – called as
encoder_ctor(observation_spec)to construct the encodingNetwork. The network takes raw observation as input and output the latent representation.decoder_ctor (Callable) – called as
decoder_ctor(latent_spec)to construct the decoder.optimizer (Optimzer|None) – if provided, it will be used to optimize the parameter of encoder_net, decoder_net and embedding vectors.
commitment_loss_weight (float) – the weight for commitment loss.
checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to
Algorithmfor more details.
- predict_step(inputs, state=())[source]#
Predict for one step of inputs.
- Parameters
inputs (nested Tensor) – inputs for prediction.
state (nested Tensor) – network state (for RNN).
- Returns
output (nested Tensor): prediction result.
state (nested Tensor): should match
predict_state_spec.- info (nest): information for analyzing the agent. In particular,
if an element of the info is
alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.
- Return type
- train_step(inputs, state=())[source]#
- Parameters
inputs (tensor) – with the shape the same as input_tensor_spec
- training: bool#