alf.algorithms#

alf.algorithms.actor_critic_algorithm#

Actor critic algorithm.

class ActorCriticAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), reward_weights=None, actor_network_ctor=<class 'alf.networks.actor_distribution_networks.ActorDistributionNetwork'>, value_network_ctor=<class 'alf.networks.value_networks.ValueNetwork'>, epsilon_greedy=None, env=None, config=None, loss=None, loss_class=<class 'alf.algorithms.actor_critic_loss.ActorCriticLoss'>, optimizer=None, checkpoint=None, debug_summaries=False, name='ActorCriticAlgorithm')[source]#

Bases: alf.algorithms.on_policy_algorithm.OnPolicyAlgorithm

Actor critic algorithm.

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • reward_weights (None|list[float]) – this is only used when the reward is multidimensional. In that case, the weighted sum of the v values is used for training the actor if reward_weights is not None. Otherwise, the sum of the v values is used.

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. env only needs to be provided to the root Algorithm.

  • epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from config.epsilon_greedy and then alf.get_config_value(TrainerConfig.epsilon_greedy).

  • config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs train_iter() by itself.

  • actor_network_ctor (Callable) – Function to construct the actor network. actor_network_ctor needs to accept input_tensor_spec and action_spec as its arguments and return an actor network. The constructed network will be called with forward(observation, state).

  • value_network_ctor (None | Callable) – Function to construct the value network. value_network_ctor needs to accept input_tensor_spec as its arguments and return a value netwrok. The contructed network will be called with forward(observation, state) and returns value tensor for each observation given observation and network state. Note that if the algorithm is constructed for evaluation or deployment only, the value_network_ctor can be set to None and the value network will not be constructed at all.

  • loss (None|ActorCriticLoss) – an object for calculating loss. If None, a default loss of class loss_class will be used.

  • loss_class (type) – the class of the loss. The signature of its constructor: loss_class(debug_summaries)

  • optimizer (torch.optim.Optimizer) – The optimizer for training

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – Name of this algorithm.

calc_loss(info)[source]#

Calculate loss.

convert_train_state_to_predict_state(state)[source]#

Convert RNN state for train_step() to RNN state for predict_step().

predict_step(inputs, state)[source]#

Predict for one step.

rollout_step(inputs, state)[source]#

Rollout for one step.

training: bool#
class ActorCriticInfo(step_type, discount, reward, action, log_prob, action_distribution, value, reward_weights)#

Bases: tuple

Create new instance of ActorCriticInfo(step_type, discount, reward, action, log_prob, action_distribution, value, reward_weights)

action#

Alias for field number 3

action_distribution#

Alias for field number 5

discount#

Alias for field number 1

log_prob#

Alias for field number 4

reward#

Alias for field number 2

reward_weights#

Alias for field number 7

step_type#

Alias for field number 0

value#

Alias for field number 6

class ActorCriticState(actor, value)#

Bases: tuple

Create new instance of ActorCriticState(actor, value)

actor#

Alias for field number 0

value#

Alias for field number 1

alf.algorithms.actor_critic_loss#

class ActorCriticLoss(gamma=0.99, td_error_loss_fn=<function element_wise_squared_loss>, use_gae=False, td_lambda=0.95, use_td_lambda_return=True, normalize_advantages=False, advantage_clip=None, entropy_regularization=None, td_loss_weight=1.0, debug_summaries=False, name='ActorCriticLoss')[source]#

Bases: alf.algorithms.algorithm.Loss

An actor-critic loss equals to

(policy_gradient_loss
+ td_loss_weight * td_loss
- entropy_regularization * entropy)
Parameters
  • gamma (float|list[float]) – A discount factor for future rewards. For multi-dim reward, this can also be a list of discounts, each discount applies to a reward dim.

  • td_errors_loss_fn (Callable) – A function for computing the TD errors loss. This function takes as input the target and the estimated Q values and returns the loss for each element of the batch.

  • use_gae (bool) – If True, uses generalized advantage estimation for computing per-timestep advantage. Else, just subtracts value predictions from empirical return.

  • use_td_lambda_return (bool) – Only effective if use_gae is True. If True, uses td_lambda_return for training value function. (td_lambda_return = gae_advantage + value_predictions).

  • td_lambda (float) – Lambda parameter for TD-lambda computation.

  • normalize_advantages (bool) – If True, normalize advantage to zero mean and unit variance within batch for caculating policy gradient. This is commonly used for PPO.

  • advantage_clip (float) – If set, clip advantages to \([-x, x]\)

  • entropy_regularization (float) – Coefficient for entropy regularization loss term.

  • td_loss_weight (float) – the weigt for the loss of td error.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

forward(info)[source]#

Cacluate actor critic loss. The first dimension of all the tensors is time dimension and the second dimesion is the batch dimension.

Parameters

info (namedtuple) – information for calculating loss. All tensors are time-major. It should contain the following fields: - reward: - step_type: - discount: - action: - action_distribution: - value:

Returns

with extra being ActorCriticLossInfo.

Return type

LossInfo

property gamma#
training: bool#
class ActorCriticLossInfo(pg_loss, td_loss, neg_entropy)#

Bases: tuple

Create new instance of ActorCriticLossInfo(pg_loss, td_loss, neg_entropy)

neg_entropy#

Alias for field number 2

pg_loss#

Alias for field number 0

td_loss#

Alias for field number 1

alf.algorithms.agent#

Agent for integrating multiple algorithms.

class Agent(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), env=None, config=None, rl_algorithm_cls=<class 'alf.algorithms.actor_critic_algorithm.ActorCriticAlgorithm'>, reward_weight_algorithm_cls=None, representation_learner_cls=None, representation_use_rl_state=False, goal_generator=None, intrinsic_reward_module=None, intrinsic_reward_coef=1.0, extrinsic_reward_coef=1.0, enforce_entropy_target=False, entropy_target_cls=None, optimizer=None, debug_summaries=False, name='AgentAlgorithm')[source]#

Bases: alf.algorithms.rl_algorithm.RLAlgorithm

Agent is a master algorithm that integrates different algorithms together.

Args: observation_spec (nested TensorSpec): representing the observations. action_spec (nested BoundedTensorSpec): representing the actions. reward_spec (TensorSpec): a rank-1 or rank-0 tensor spec representing

the reward(s).

env (Environment): The environment to interact with. env is a

batched environment, which means that it runs multiple simulations simultaneously. Running multiple environments in parallel is crucial to on-policy algorithms as it increases the diversity of data and decreases temporal correlation. env only needs to be provided to the root Algorithm.

config (TrainerConfig): config for training. config only needs to be

provided to the algorithm which performs train_iter() by itself.

rl_algorithm_cls (type): The algorithm class for learning the policy.

It will be called as rl_algorithm_cls(observation_spec=?, action_spec=?, reward_spec=?, config=?, debug_summaries=?).

reward_weight_algorithm_cls (type): The algorithm class for adjusting

reward weights when multi-dim rewards are used. If provided, the the default reward_weights of rl_algorithm will be overwritten by this algorithm.

representation_learner_cls (type): The algorithm class for learning

the representation. If provided, the constructed learner will calculate the representation from the original observation as the observation for downstream algorithms such as rl_algorithm. Similar to rl_algorithm_cls, it will be called as rl_algorithm_cls(observation_spec=?, action_spec=?, reward_spec=?, config=?, debug_summaries=?).

representation_use_rl_state: When set to True, representation learner

will receive (previous) state from the RL algorithm as input instead of its own state for rollout_step() and predict_step(). This is particularly useful for algorithm such as MuZero representation learner, whose reanalyze component requires access to the RL algorithm’s state.

intrinsic_reward_module (Algorithm): an algorithm whose outputs

is a scalar intrinsic reward.

goal_generator (Algorithm): an algorithm which outputs a tuple of goal

vector and a reward. The reward can be () if no reward is given.

intrinsic_reward_coef (float): Coefficient for intrinsic reward extrinsic_reward_coef (float): Coefficient for extrinsic reward enforce_entropy_target (bool): If True, use (Nested)EntropyTargetAlgorithm

to dynamically adjust entropy regularization so that entropy is not smaller than entropy_target supplied for constructing (Nested)EntropyTargetAlgorithm. If this is enabled, make sure you don’t use entropy_regularization for loss (see ActorCriticLoss or PPOLoss). In order to use this, The AlgStep.info from rl_algorithm_cls.train_step() and rl_algorithm_cls.rollout_step() needs to contain action_distribution.

entropy_target_cls (type): If provided, will be used to dynamically

adjust entropy regularization.

optimizer (optimizer): The optimizer for training debug_summaries (bool): True if debug summaries should be created. name (str): Name of this algorithm.

after_train_iter(experience, info)[source]#

Call after_train_iter() of the RL algorithm and goal generator, respectively.

after_update(experience, train_info)[source]#

Call after_update() of the RL algorithm and goal generator, respectively.

calc_loss(info)[source]#

Calculate loss.

calc_loss_offline(info, pre_train)[source]#

Calculate loss for the offline RL branch.

predict_step(time_step, state)[source]#

Predict for one step.

preprocess_experience(root_inputs, rollout_info, batch_info)[source]#

Add intrinsic rewards to extrinsic rewards if there is an intrinsic reward module. Also call preprocess_experience() of the rl algorithm.

rollout_step(time_step, state)[source]#

Rollout for one step.

set_path(path)[source]#

Set the path from the root algorithm to this algorithm.

See AlgorithmInterface.path for description about path. This function is called by the trainer before training starts. It needs to be implemented if the algorithm contains some other sub-algorithms.

If an algorithm does not have any sub-algorithm or its sub-algorithm does not need to access the root replay buffer directly, it does not implement this function.

summarize_rollout(experience)[source]#

First call RLAlgorithm.summarize_rollout() to summarize basic rollout statisics. If the rl algorithm has overridden this function, then also call its customized version.

train_step(time_step, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

train_step_offline(time_step, state, rollout_info, pre_train)[source]#

Perform one step of offline training computation.

It is called to calculate output for every time step for a batch of experience from offline replay buffer. It also needs to generate necessary information for calc_loss_offline(). By default, this function calls train_step as its default implementation.

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

  • pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class AgentInfo(rl, irm, goal_generator, entropy_target, repr, rw, rewards)#

Bases: tuple

Create new instance of AgentInfo(rl, irm, goal_generator, entropy_target, repr, rw, rewards)

entropy_target#

Alias for field number 3

goal_generator#

Alias for field number 2

irm#

Alias for field number 1

repr#

Alias for field number 4

rewards#

Alias for field number 6

rl#

Alias for field number 0

rw#

Alias for field number 5

class AgentState(rl, irm, goal_generator, repr, rw)#

Bases: tuple

Create new instance of AgentState(rl, irm, goal_generator, repr, rw)

goal_generator#

Alias for field number 2

irm#

Alias for field number 1

repr#

Alias for field number 3

rl#

Alias for field number 0

rw#

Alias for field number 4

alf.algorithms.agent_helpers#

Some helper functions for constructing an Agent instance.

class AgentHelper(state_ctor)[source]#

Bases: object

Create three state specs given the state creator.

static accumulate_algorithm_rewards(rewards, weights, names, summary_prefix, summarize_fn)[source]#

Sum a list of rewards by their weights. Also summarize the rewards statistics given their names.

Parameters
  • rewards (list[Tensor]) – a list of rewards tensors

  • weights (list[float]) – a list of floating numbers

  • names (list[str]) – a list of reward names

  • summary_prefix (str) – a string prefix for summary

  • summarize_fn (Callable) – a summarize function that accepts a name and a reward.

Returns

A single reward after accumulation.

Return type

Tensor

accumulate_loss_info(algorithms, train_info, offline=False, pre_train=False)[source]#

Given an overall Agent training info that contains various training infos for different algorithms, compute the accumulated loss info for updating parameters.

Parameters
  • algorithms (list[Algorithm]) – the list of algorithms whose loss infos are to be accumulated.

  • experience (Experience) – experience used for gradient update.

  • train_info (nested Tensor) – information collected for training algorithms. It is batched from each AlgStep.info returned by train_step() or rollout_step().

  • offline (bool) – whether the accumulation is done for offline RL part or the online RL part.

  • pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.

Returns

the accumulated loss info.

Return type

LossInfo

after_train_iter(algorithms, root_inputs, rollout_info=None)[source]#

For each provided algorithm, call its after_train_iter() to do things after the agent finishes one training iteration (i.e., train_iter()).

Parameters
  • algorithms (list[Algorithm]) – the list of algorithms whose after_train_iter is to be called.

  • root_inputs (TimeStep) – experience collected from rollout_step().

  • rollout_info (AgentInfo) – information collected for training algorithms. It is batched from each AlgStep.info returned by rollout_step().

after_update(algorithms, root_inputs, train_info)[source]#

For each provided algorithm, call its after_update() to do things after the agent completes one gradient update (i.e. update_with_gradient()).

Parameters
  • algorithms (list[Algorithm]) – the list of algorithms whose after_update is to be called.

  • root_inputs (TimeStep) – experience used for the gradient update.

  • train_info (AgentInfo) – information collected for training algorithms. It is batched from each AlgStep.info returned by train_step() or rollout_step().

register_algorithm(alg, alg_field)[source]#

Collect state specs from algorithms. For code conciseness, we collect all three state specs even though some of them will not be used during unroll or train.

This function also registers alg with alg_field.

Parameters
  • alg (Algorithm) – a child algorithm in the agent.

  • alg_field (str) – the corresponding algorithm field in an AgentState or AgentInfo.

set_path(path)[source]#

Set the path for the sub-algorithms.

state_specs()[source]#

Return the state specs collected from child algorithms.

alf.algorithms.algorithm#

Algorithm base class.

class Algorithm(train_state_spec=(), rollout_state_spec=None, predict_state_spec=None, is_on_policy=None, optimizer=None, checkpoint=None, config=None, debug_summaries=False, name='Algorithm')[source]#

Bases: alf.algorithms.algorithm_interface.AlgorithmInterface

Base implementation for AlgorithmInterface.

Each algorithm can have a default optimimzer. By default, the parameters and/or modules under an algorithm are optimized by the default optimizer. One can also specify an optimizer for a set of parameters and/or modules using add_optimizer. You can find out which parameter is handled by which optimizer using get_optimizer_info().

A requirement for this optimizer structure to work is that there is no algorithm which is a submodule of a non-algorithm module. Currently, this is not checked by the framework. It’s up to the user to make sure this is true.

Parameters
  • train_state_spec (nested TensorSpec) – for the network state of train_step().

  • rollout_state_spec (nested TensorSpec) – for the network state of rollout_step(). If None, it’s assumed to be the same as train_state_spec.

  • predict_state_spec (nested TensorSpec) – for the network state of predict_step(). If None, it’s assume to be same as rollout_state_spec.

  • is_on_policy (None|bool) –

  • optimizer (None|Optimizer) – The default optimizer for training. See comments above for detail.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the - “prefix” is the prefix to the contents in the checkpoint to be loaded. It can be a multi-step path denoted by “A.B.C”. If the checkpoint comes from a previous ALF training session, the standard prefix starts with “alg” (e.g. “alg._sub_alg1”). If prefix is omitted, the effects is the same as providing “alg”, which will load the full ‘alg’ part of the checkpoint. - “path” is the full path to the checkpoint file saved by ALF, e.g. “/path_to_experiment/train/algorithm/ckpt-100”. Therefore, an example value for checkpoint is “alg._sub_alg1@/path_to_experiment/train/algorithm/ckpt-100”.

  • config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs a training iteration by itself.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – name of this algorithm.

activate_ddp(rank)[source]#

Prepare the Algorithm with DistributedDataParallel wrapper

Note that Algorithm does not need to remember the rank of the device.

Parameters

rank (int) – DDP wrapper needs to know on which GPU device this module’s parameters and buffers are supposed to be.

add_optimizer(optimizer, modules_and_params)[source]#

Add an optimizer.

Note that the modules and params contained in modules_and_params should still be the attributes of the algorithm (i.e., they can be retrieved in self.children() or self.parameters()).

Parameters
  • optimizer (Optimizer) – optimizer

  • modules_and_params (list of Module or Parameter) – The modules and parameters to be optimized by optimizer.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

calc_loss_offline(info_offline, pre_train=False)[source]#

Calculate the hybrid loss at each step for each sample. By default, this function calls calc_loss as its default implementation.

Parameters
  • info_offline (nest) – information collected for training from the offline training branch. It is returned by train_step_offline() (hybrid off-policy training).

  • pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

compute_paras_statistics()[source]#

Compute some simple statistics of the algorithm’s parameters.

This function uses L1, L2, mean, std as the statistics.

Returns

a dict of 1D numpy arrays, each containing simple

parameter statistics, which can be used as a proxy for checking the consistency between two parameter set. The keys are parameter names of the module.

Return type

Dict[np.ndarray]

convert_train_state_to_predict_state(state)[source]#

Convert RNN state for train_step() to RNN state for predict_step().

property default_optimizer#

Get the default optimizer for this algorithm.

property experience_spec#

Spec for experience.

property force_params_visible_to_parent: bool#

Whether the already optimizer-handled parameters are seen by the paranet algorithm.

Normally, when the parameters of this algorithm is handled by its optimizer, _setup_optimizers_ will prevent the parent algorithm’s optimizer to see and more importantly, handle them. Setting this value to true will force the parameters to be seen and handled by the parent algorithm, even if they are already handled by this algorithm.

Note that parameters ignored by _trainable_attributes_to_ignore() will stay invisible to the parent algorithm.

It is by default False, and can be changed with the following setter.

Return type

bool

forward(*input)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

get_initial_predict_state(batch_size)[source]#
get_initial_rollout_state(batch_size)[source]#
get_initial_train_state(batch_size)[source]#
get_initial_transform_state(batch_size)[source]#
get_optimizer_info()[source]#

Return the optimizer info for all the modules in a string.

TODO: for a subalgorithm that’s an ignored attribute, its optimizer info won’t be obtained.

Returns

the json string of the information about all the optimizers.

Return type

str

get_param_name(param)[source]#

Get the name of the parameter.

Returns

the name if the parameter can be found; otherwise None.

Return type

string

get_unoptimized_parameter_info()[source]#

Return the information about the parameters not being optimized.

Note: the difference of this with the parameters contained in the optimizer ‘None’ from get_optimizer_info() is that get_optimizer_info() does not traverse all the parameters (e.g., parameters in list, tuple, dict, or set).

Returns

path of all parameters not being optimized

Return type

str

property has_offline#

Whether has offline data for RL algorithms. Always return False for non-RL algorithms.

is_rl()[source]#

Always returns False for non-RL algorithms.

load_state_dict(state_dict, strict=True, skip_preloded=True)[source]#

Load state dictionary for the algorithm.

Parameters
  • state_dict (dict) – a dict containing parameters and persistent buffers.

  • strict (bool, optional) – whether to strictly enforce that the keys in state_dict match the keys returned by this module’s torch.nn.Module.state_dict function. If strict=True, will keep lists of missing and unexpected keys; if strict=False, missing/unexpected keys will be omitted. (Default: True)

  • skip_preloded (bool) – whether to skip the modules that support pre-loading and have been pre-loaded. Currently only Algorithm and its derivatives support pre-loading. (Default: True)

Returns

  • missing_keys: a list of str containing the missing keys.

  • unexpected_keys: a list of str containing the unexpected keys.

Return type

namedtuple

property name#

The name of this algorithm.

need_full_rollout_state()[source]#

Whether AlgStep.state from rollout_step should be full.

If True, it means that rollout_step() should return the complete state for train_step().

observe_for_metrics(time_step)[source]#

Observe a time step for recording environment metrics.

Parameters

time_step (TimeStep) – the current time step during unroll().

observe_for_replay(exp)[source]#

Record an experience in a replay buffer.

Parameters

exp (nested Tensor) – exp (nested Tensor): The shape is \([B, \ldots]\), where \(B\) is the batch size of the batched environment.

property on_policy#

Whether is on-policy training.

For on-policy training, train_step() will not be called. And info passed to calc_loss() is info collected from rollout_step().

For off-policy training, train_step() will be called with the experience from replay buffer. And info passed to calc_loss() is info collected from train_step.

An algorithm can override this to indicate whether it is an on-policy or off-policy algorithm. If an algorithm does not override this, it needs to support both on-policy and off-policy training, which means that rollout_step() and train_step() need to have the correct behavior for on-policy and off-policy training. It can check wether it is on-policy training by calling this function.

Returns

True if on-policy training, False if off-policy training,

None if not set.

Return type

bool | None

optimizers(recurse=True, include_ignored_attributes=False)[source]#

Get all the optimizers used by this algorithm.

Parameters
  • recurse (bool) – If True, including all the sub-algorithms

  • include_ignored_attributes (bool) – If True, still include all child attributes without ignoring any.

Returns

list of ``Optimizer``s.

Return type

list

property path#

Path from the root algorithm to this algorithm.

Currently, path is useful when an algorithm needs to directly access the data about itself in replay buffer. There are two types of data about an algorithm are stored in replay buffer: one is rollout_info, which is AlgStep.info returned by rollout_step(), the other is state, which is the state argument used to call rollout_step(). The data in replay buffer is organized as Experience which includes rollout_info and state.

Given an experience structure, the input state to rollout_step() can be retrieved by:

nest.get_field(experience.state, self.path)

The info from rollout_step() can be retrieved by:

nest.get_field(experience.rollout_info, self.path)
Returns

path from the root algorithm to this algorithm

Return type

str

property pre_loaded#

A property indicating whether a checkpoint for the current instance has been pre-loaded, by specifying checkpoint_prefix@checkpoint_path where checkpoint_prefix@ is optional.

property predict_state_spec#

Returns the RNN state spec for predict_step().

property processed_experience_spec#

Spec for processed experience.

Returns

Spec for the experience returned by preprocess_experience().

Return type

TensorSpec

property rollout_state_spec#

Returns the RNN state spec for rollout_step().

set_on_policy(is_on_policy)[source]#

Set whether this algorithm is on-policy or not.

Parameters

is_on_policy (bool) –

set_path(path)[source]#

Set the path from the root algorithm to this algorithm.

See AlgorithmInterface.path for description about path. This function is called by the trainer before training starts. It needs to be implemented if the algorithm contains some other sub-algorithms.

If an algorithm does not have any sub-algorithm or its sub-algorithm does not need to access the root replay buffer directly, it does not implement this function.

set_replay_buffer(num_envs, max_length, prioritized_sampling=False)[source]#

Set the parameters for the replay buffer.

Parameters
  • num_envs (int) – the total number of environments from all batched environments.

  • max_length (int) – the maximum number of steps the replay buffer store for each environment.

  • prioritized_sampling (bool) – Use prioritized sampling if this is True.

state_dict(destination=None, prefix='', visited=None)[source]#

Get state dictionary recursively, including both model state and optimizers’ state (if any). It can handle a number of special cases:

  • graph with cycle: save all the states and avoid infinite loop

  • parameter sharing: save only one copy of the shared module/param

  • optimizers: save the optimizers for all the (sub-)algorithms

Parameters
  • destination (OrderedDict) – the destination for storing the state.

  • prefix (str) – a string to be added before the name of the items (modules, params, algorithms etc) as the key used in the state dictionary.

  • visited (set) – a set keeping track of the visited objects.

Returns

the dictionary including both model state and optimizers’ state (if any).

Return type

OrderedDict

summarize_train(experience, train_info, loss_info, params)[source]#

Generate summaries for training & loss info after each gradient update. The default implementation of this function only summarizes params (with grads) and the loss. An algorithm can override this for additional summaries. See RLAlgorithm.summarize_train() for an example.

Parameters
  • experience (nested Tensor) – samples used for the most recent update_with_gradient(). By default it’s not summarized.

  • train_info (nested Tensor) – AlgStep.info returned by either rollout_step() (on-policy training) or train_step() (off-policy training). By default it’s not summarized.

  • loss_info (LossInfo) – loss

  • params (list[Parameter]|None) – list of parameters with gradients

train_from_replay_buffer(**kwargs)#

This function can be called by any algorithm that has its own replay buffer configured.

Parameters

update_global_counter (bool) – controls whether this function changes the global counter for summary. If there are multiple algorithms, then only the parent algorithm should change this quantity and child algorithms should disable the flag. When it’s True, it will affect the counter only if config.update_counter_every_mini_batch=True.

train_from_unroll(experience, train_info)[source]#

Train given the info collected from unroll(). This function can be called by any child algorithm that doesn’t have the unroll logic but has a different training logic with its parent (e.g., off-policy).

Parameters
  • experience (Experience) – collected during unroll().

  • train_info (nest) – AlgStep.info returned by rollout_step().

Returns

number of steps that have been trained

Return type

int

property train_info_spec#

The spec for the AlgStep.info returned from train_step().

property train_state_spec#

Returns the RNN state spec for train_step().

train_step_offline(inputs, state, rollout_info, pre_train=False)[source]#

Perform one step of offline training computation.

It is called to calculate output for every time step for a batch of experience from offline replay buffer. It also needs to generate necessary information for calc_loss_offline(). By default, this function calls train_step as its default implementation.

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

  • pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
transform_experience(experience)[source]#

Transform an Experience structure.

This is used on the experience data retrieved from replay buffer.

Parameters

experience (Experience) – the experience retrieved from replay buffer. Note that experience.batch_info, experience.replay_buffer need to be set.

Returns

transformed experience

Return type

Experience

transform_timestep(time_step, state)[source]#

Transform time_step.

transform_timestep is called for all raw time_step got from the environment before passing to predict_step and rollout_step. For off-policy algorithms, the replay buffer stores raw time_step. So when experiences are retrieved from the replay buffer, they are tranformed by transform_timestep in OffPolicyAlgorithm before passing to _update().

The transformation should be stateless. By default, only observation is transformed.

Parameters
  • time_step (TimeStep or Experience) – time step

  • state (nested Tensor) – state of the transformer(s)

Returns

transformed time step

Return type

TimeStep or Experience

update_with_gradient(loss_info, valid_masks=None, weight=1.0, batch_info=None)[source]#

Complete one iteration of training.

Update parameters using the gradient with respect to loss_info.

Parameters
  • loss_info (LossInfo) – loss with shape \((T, B)\) (except for loss_info.scalar_loss)

  • valid_masks (Tensor) – masks indicating which samples are valid. (shape=(T, B), dtype=torch.float32)

  • weight (float) – weight for this batch. Loss will be multiplied with this weight before calculating gradient.

  • batch_info (BatchInfo) – information about this batch returned by ReplayBuffer.get_batch()

Returns

  • loss_info (LossInfo): loss information.

  • params (list[(name, Parameter)]): list of parameters being updated.

Return type

tuple

property use_rollout_state#

If True, when off-policy training, the RNN states will be taken from the replay buffer; otherwise they will be set to 0.

In the case of True, the train_state_spec of an algorithm should always be a subset of the rollout_state_spec.

class Loss(loss_weight=1.0, name='LossAlg')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Algorithm that uses its input as loss.

It can be subclassed to customize calc_loss().

Each algorithm can have a default optimimzer. By default, the parameters and/or modules under an algorithm are optimized by the default optimizer. One can also specify an optimizer for a set of parameters and/or modules using add_optimizer. You can find out which parameter is handled by which optimizer using get_optimizer_info().

A requirement for this optimizer structure to work is that there is no algorithm which is a submodule of a non-algorithm module. Currently, this is not checked by the framework. It’s up to the user to make sure this is true.

Parameters
  • train_state_spec (nested TensorSpec) – for the network state of train_step().

  • rollout_state_spec (nested TensorSpec) – for the network state of rollout_step(). If None, it’s assumed to be the same as train_state_spec.

  • predict_state_spec (nested TensorSpec) – for the network state of predict_step(). If None, it’s assume to be same as rollout_state_spec.

  • is_on_policy (None|bool) –

  • optimizer (None|Optimizer) – The default optimizer for training. See comments above for detail.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the - “prefix” is the prefix to the contents in the checkpoint to be loaded. It can be a multi-step path denoted by “A.B.C”. If the checkpoint comes from a previous ALF training session, the standard prefix starts with “alg” (e.g. “alg._sub_alg1”). If prefix is omitted, the effects is the same as providing “alg”, which will load the full ‘alg’ part of the checkpoint. - “path” is the full path to the checkpoint file saved by ALF, e.g. “/path_to_experiment/train/algorithm/ckpt-100”. Therefore, an example value for checkpoint is “alg._sub_alg1@/path_to_experiment/train/algorithm/ckpt-100”.

  • config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs a training iteration by itself.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – name of this algorithm.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(inputs, state=None)[source]#

Predict for one step of inputs.

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match predict_state_spec.

  • info (nest): information for analyzing the agent. In particular,

    if an element of the info is alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.

Return type

AlgStep

rollout_step(inputs, state=None)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

train_step(inputs, state=None, rollout_info=None)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#

alf.algorithms.algorithm_interface#

class AlgorithmInterface[source]#

Bases: torch.nn.modules.module.Module

The interface for algorithm.

It is a generic interface for reinforcement learning (RL) and non-RL algorithms. The key interface functions are:

  1. predict_step(): one step of computation of action for evaluation.

  2. rollout_step(): one step of computation for rollout. It is used for collecting experiences during training. Different from predict_step, rollout_step may include addtional computations for training. An algorithm could immediately use the collected experiences to update parameters after one rollout (multiple rollout steps) is performed; or it can put these collected experiences into a replay buffer.

  3. train_step(): only used by algorithms that put experiences into replay buffers. The training data are sampled from the replay buffer filled by rollout_step().

  4. train_from_unroll(): perform a training iteration from the unrolled result.

  5. train_from_replay_buffer(): perform a training iteration from a replay buffer.

  6. update_with_gradient(): do one gradient update based on the loss. It is used by the default train_from_unroll() and train_from_replay_buffer() implementations. You can override to implement your own update_with_gradient().

  7. calc_loss(): calculate loss based on the info collected from rollout_step() or train_step(). It is used by the default implementations of train_from_unroll() and train_from_replay_buffer(). If you want to use these two functions, you need to implement calc_loss().

  8. after_update(): called by train_iter() after every call to update_with_gradient(), mainly for some postprocessing steps such as copying a training model to a target model in SAC or DQN.

  9. after_train_iter(): called by train_iter() after every call to train_from_unroll() (on-policy training iter) or train_from_replay_buffer (off-policy training iter). It’s mainly for training additional modules that have their own training logic (e.g., on/off-policy, replay buffers, etc). Other things might also be possible as long as they should be done once every training iteration.

For algorithms that have additional offline training flows, they can be implemented by using the following additional interface functions: 10. train_step_offline(): only used by algorithms that has offline

training flows. The training data are sampled from a replay buffer that is loaded from an offline replay buffer checkpoint.

  1. calc_loss_offline(): It calculates the loss based on the info collected from train_step_offline().

The offline training flows can be invoked by specifying a valid path to a replay buffer for TrainerConfig.offline_buffer_dir.

Note

A non-RL algorithm will not directly interact with an environment. The interation loop will always be driven by an RLAlgorithm that outputs actions and gets rewards. So a non-RL algorithm is always attached to an RLAlgorithm and cannot change the timing of (when to launch) a training iteration. However, it can have its own logic of a training iteration (e.g., train_from_unroll() and train_from_replay_buffer()) which can be triggered by a parent RLAlgorithm inside its after_train_iter().

Initializes internal Module state, shared by both nn.Module and ScriptModule.

after_train_iter(root_inputs, rollout_info)[source]#

Do things after completing one training iteration (i.e. train_iter() that consists of one or multiple gradient updates). This function can be used for training additional modules that have their own training logic (e.g., on/off-policy, replay buffers, etc). These modules should be added to _trainable_attributes_to_ignore in the parent algorithm.

Other things might also be possible as long as they should be done once every training iteration.

This function will serve the same purpose with after_update if there is always only one gradient update in each training iteration. Otherwise it’s less frequently called than after_update.

Parameters
  • root_inputs (nest|None) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll(). In the case where no data is available from the rollout_step() (e.g. in a offline pre-training phase where the online interaction is not started yet) root_inputs will be None.

  • rollout_info (nest|None) – information collected from rollout_step() for this algorithm during unroll(). In the case where no data is available from the rollout_step() (e.g. in a offline pre-training phase where the online interaction is not started yet) rollout_info will be None.

after_update(root_inputs, info)[source]#

Do things after completing one gradient update (i.e. update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.

Parameters
  • root_inputs (nest) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll().

  • info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() for on-policy training or train_step() for off-policy training.

calc_loss(info)[source]#

Calculate the loss for one mini-batch.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training). The shape of the tensors in info is (T, B, ...), where T is the mini-batch length and B is the mini-batch size.

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

calc_loss_offline(info, pre_train=False)[source]#

Calculate the loss for one mini-batch.

Parameters
  • info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training). The shape of the tensors in info is (T, B, ...), where T is the mini-batch length and B is the mini-batch size.

  • pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

property on_policy#

Whether is on-policy training.

For on-policy training, train_step() will not be called. And info passed to calc_loss() is info collected from rollout_step().

For off-policy training, train_step() will be called with the experience from replay buffer. And info passed to calc_loss() is info collected from train_step.

An algorithm can override this to indicate whether it is an on-policy or off-policy algorithm. If an algorithm does not override this, it needs to support both on-policy and off-policy training, which means that rollout_step() and train_step() need to have the correct behavior for on-policy and off-policy training. It can check wether it is on-policy training by calling this function.

Returns

True if on-policy training, False if off-policy training,

None if not set.

Return type

bool | None

property path#

Path from the root algorithm to this algorithm.

Currently, path is useful when an algorithm needs to directly access the data about itself in replay buffer. There are two types of data about an algorithm are stored in replay buffer: one is rollout_info, which is AlgStep.info returned by rollout_step(), the other is state, which is the state argument used to call rollout_step(). The data in replay buffer is organized as Experience which includes rollout_info and state.

Given an experience structure, the input state to rollout_step() can be retrieved by:

nest.get_field(experience.state, self.path)

The info from rollout_step() can be retrieved by:

nest.get_field(experience.rollout_info, self.path)
Returns

path from the root algorithm to this algorithm

Return type

str

predict_step(inputs, state)[source]#

Predict for one step of inputs.

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match predict_state_spec.

  • info (nest): information for analyzing the agent. In particular,

    if an element of the info is alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.

Return type

AlgStep

preprocess_experience(root_inputs, rollout_info, batch_info)[source]#

This function is called on the experiences obtained from a replay buffer. An example usage of this function is to calculate advantages and returns in PPOAlgorithm.

The shapes of tensors in experience are assumed to be \((B, T, ...)\).

Parameters
  • root_inputs (nest) – input for rollout_step() of the root algorithm. This is from replay buffer. Note this is not same as the input of rollout_step() of self unless self is the root algorithm.

  • rollout_info (nested Tensor) – AlgStep.info from rollout_step() for this algorithm.

  • batch_info (BatchInfo) – information about this batch of data

Returns

  • processed root_inputs

  • processed rollout_info

Return type

tuple

rollout_step(inputs, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

set_on_policy(is_on_policy)[source]#

Set whether this algorithm is on-policy or not.

Parameters

is_on_policy (bool) –

set_path(path)[source]#

Set the path from the root algorithm to this algorithm.

See AlgorithmInterface.path for description about path. This function is called by the trainer before training starts. It needs to be implemented if the algorithm contains some other sub-algorithms.

If an algorithm does not have any sub-algorithm or its sub-algorithm does not need to access the root replay buffer directly, it does not implement this function.

train_from_replay_buffer(update_global_counter=False)[source]#

This function can be called by any algorithm that has its own replay buffer configured.

Parameters

update_global_counter (bool) – controls whether this function changes the global counter for summary. If there are multiple algorithms, then only the parent algorithm should change this quantity and child algorithms should disable the flag. When it’s True, it will affect the counter only if config.update_counter_every_mini_batch=True.

train_from_unroll(experience, train_info)[source]#

Train given the info collected from unroll(). This function can be called by any child algorithm that doesn’t have the unroll logic but has a different training logic with its parent.

Parameters
  • experience (Experience) – collected during unroll().

  • train_info (nest) – AlgStep.info returned by rollout_step().

Returns

number of steps that have been trained

Return type

int

train_iter()[source]#

Perform one iteration of training.

Users may choose to implement their own train_iter().

Returns

  • number of samples being trained on (including duplicates).

Return type

int

train_step(inputs, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

train_step_offline(inputs, state, rollout_info, pre_train=False)[source]#

Perform one step of offline training computation.

It is called to calculate output for every time step for a batch of experience from offline replay buffer. It also needs to generate necessary information for calc_loss_offline().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

  • pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#

alf.algorithms.async_unroller#

class AsyncUnroller(algorithm, config)[source]#

Bases: object

A helper class for unroll asynchronously.

The asynchronous unroll is performed in a different process. The unroll results are transmitted to the main process through a Queue. The main process should call gather_unroll_results() to retrieve the unroll results. Since the unroll process has its own algorithm parameters, the main process needs to call update_parameters() to update the parameters for the unroll process periodically. Once the main process finishes, it should call close() to release the resouces.

The following settings in TrainerConfig are related to the functionality of AsyncUnroller: unroll_length, async_unroll, max_unroll_length, unroll_queue_size, unroll_step_interval. See algorithms.config.py for their documentation.

TODO: redirect the log and summary to the training process. Currently, all the logs are written to a different log file and summary during rollout_step() is not enabled.

Parameters
  • algorithm – the root RL algorithm

  • unroll_queue_size – the size of the queue for transmitting the unroll results to the main process

  • root_dir – directory for saving summary and checkpoints

  • conf_file – config file name

close()[source]#

Close the unroller and release resources.

gather_unroll_results(unroll_length, max_unroll_length)[source]#

Gather the unroll results:

Parameters
  • unroll_length (int) – the desired unroll length. If is 0, any length up to max_unroll_length is possible (including zero length) depending on how much data is in the queue.

  • max_unroll_length (int) – maximal length of unroll results. This is only used if unroll_length is 0.

Return type

List[UnrollResult]

Returns

A list of UnrollResult

get_queue_size()[source]#
Return type

int

update_parameter(algorithm)[source]#

Update the the model parameter for unroll.

Parameters

algorithm (RLAlgorithm) – the root RL algorithm

class UnrollJob(type, step_metrics, global_counter, state_dict)#

Bases: tuple

Create new instance of UnrollJob(type, step_metrics, global_counter, state_dict)

global_counter#

Alias for field number 2

state_dict#

Alias for field number 3

step_metrics#

Alias for field number 1

type#

Alias for field number 0

class UnrollResult(time_step, policy_step, policy_state, env_step_time, step_time)#

Bases: tuple

Create new instance of UnrollResult(time_step, policy_step, policy_state, env_step_time, step_time)

env_step_time#

Alias for field number 3

policy_state#

Alias for field number 2

policy_step#

Alias for field number 1

step_time#

Alias for field number 4

time_step#

Alias for field number 0

alf.algorithms.bc_algorithm#

Behavior Cloning (BC) Algorithm.

class BcAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), actor_network_cls=<class 'alf.networks.actor_networks.ActorNetwork'>, actor_optimizer=None, env=None, config=None, checkpoint=None, debug_summaries=False, epsilon_greedy=None, name='BcAlgorithm')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

Behavior cloning algorithm. Behavior cloning is an offline approach to learn a policy \(\pi_{\theta}(a|s)\), which is a function that maps an input observation \(s\) to an action \(a\). The paramerates (\(\theta\)) of this policy is learned by using the expert action as supervision for training, e.g., by maximizing the probability of the expert actions on the training data \(D\): \(\max_{\theta} E_{(s,a)~D}\log \pi_{\theta}(a|s)\)

Reference:

Pomerleau ALVINN: An Autonomous Land Vehicle in a Neural Network, NeurIPS 1988.
Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions; can be a mixture of discrete and continuous actions. The number of continuous actions can be arbitrary while only one discrete action is allowed currently. If it’s a mixture, then it must be a tuple/list (discrete_action_spec, continuous_action_spec).

  • reward_spec (Callable) – a rank-1 or rank-0 tensor spec representing the reward(s). For interface compatiblity purpose. Not actually used in BcAlgorithm.

  • actor_network_cls (Callable) – is used to construct the actor network. The constructed actor network is a determinstic network and will be used to generate continuous actions.

  • actor_optimizer (torch.optim.optimizer) – The optimizer for actor.

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.

  • config (TrainerConfig) – config for training. It only needs to be provided to the algorithm which performs train_iter() by itself.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • debug_summaries (bool) – True if debug summaries should be created.

  • epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from config.epsilon_greedy and then alf.get_config_value(TrainerConfig.epsilon_greedy).

  • name (str) – The name of this algorithm.

calc_loss_offline(info, pre_train=False)[source]#

Calculate the hybrid loss at each step for each sample. By default, this function calls calc_loss as its default implementation.

Parameters
  • info_offline (nest) – information collected for training from the offline training branch. It is returned by train_step_offline() (hybrid off-policy training).

  • pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(inputs, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

train_step_offline(inputs, state, rollout_info, pre_train=False)[source]#

Perform one step of offline training computation.

It is called to calculate output for every time step for a batch of experience from offline replay buffer. It also needs to generate necessary information for calc_loss_offline(). By default, this function calls train_step as its default implementation.

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

  • pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class BcInfo(actor)#

Bases: tuple

Create new instance of BcInfo(actor,)

actor#

Alias for field number 0

BcLossInfo#

alias of alf.algorithms.bc_algorithm.LossInfo

class BcState(actor)#

Bases: tuple

Create new instance of BcState(actor,)

actor#

Alias for field number 0

alf.algorithms.causal_bc_algorithm#

Causal Behavior Cloning Algorithm.

class BcInfo(actor, discriminator, target)#

Bases: tuple

Create new instance of BcInfo(actor, discriminator, target)

actor#

Alias for field number 0

discriminator#

Alias for field number 1

target#

Alias for field number 2

BcLossInfo#

alias of alf.algorithms.causal_bc_algorithm.LossInfo

class BcState(actor)#

Bases: tuple

Create new instance of BcState(actor,)

actor#

Alias for field number 0

class CausalBcAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), actor_network_cls=<class 'alf.networks.actor_networks.ActorNetwork'>, discriminator_network_cls=<class 'alf.networks.encoding_networks.EncodingNetwork'>, actor_optimizer=None, discriminator_optimizer=None, f_norm_penalty_weight=0.001, bc_regulatization_weight=0.05, env=None, config=None, checkpoint=None, debug_summaries=False, epsilon_greedy=None, name='CausalBcAlgorithm')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

Causal behavior cloning algorithm. This is the implementation of ResiduIL algorithm proposed in the following paper:

Swamy et al. Causal Imitation Learning under Temporally Correlated Noise,
ICML 2022
Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions; can be a mixture of discrete and continuous actions. The number of continuous actions can be arbitrary while only one discrete action is allowed currently. If it’s a mixture, then it must be a tuple/list (discrete_action_spec, continuous_action_spec).

  • reward_spec (Callable) – a rank-1 or rank-0 tensor spec representing the reward(s). For interface compatiblity purpose. Not actually used in CausalBcAlgorithm.

  • actor_network_cls (Callable) – is used to construct the actor network. The constructed actor network is a determinstic network and will be used to generate continuous actions.

  • discriminator_network_cls (Callable) – is used to construct the discriminator network. The discrimonator is trained in a way that is adversarial to the training of the policy, to help with the learning of a robust policy. It takes the observation from the previous time step to generate the lagrange multiplier for the current step.

  • actor_optimizer (torch.optim.optimizer) – The optimizer for actor.

  • discriminator_optimizer (torch.optim.optimizer) – the optimizer for discriminator.

  • f_norm_penalty_weight (float) – penalty weight for the output of the discriminator.

  • bc_regulatization_weight (float) – weight for the squared prediction error based regularization term.

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.

  • config (TrainerConfig) – config for training. It only needs to be provided to the algorithm which performs train_iter() by itself.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • debug_summaries (bool) – True if debug summaries should be created.

  • epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from config.epsilon_greedy and then alf.get_config_value(TrainerConfig.epsilon_greedy).

  • name (str) – The name of this algorithm.

calc_loss_offline(info, pre_train=False)[source]#

Calculate the hybrid loss at each step for each sample. By default, this function calls calc_loss as its default implementation.

Parameters
  • info_offline (nest) – information collected for training from the offline training branch. It is returned by train_step_offline() (hybrid off-policy training).

  • pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(inputs, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

residuIL_loss(targets, predictions, pred_residuals)[source]#
train_step_offline(inputs, state, rollout_info, pre_train=False)[source]#

Perform one step of offline training computation.

It is called to calculate output for every time step for a batch of experience from offline replay buffer. It also needs to generate necessary information for calc_loss_offline(). By default, this function calls train_step as its default implementation.

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

  • pre_train (bool) – whether in pre_training phase. This flag can be used for algorithms that need to implement different training procedures at different phases.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#

alf.algorithms.config#

class TrainerConfig(root_dir, conf_file='', ml_type='rl', algorithm_ctor=None, data_transformer_ctor=None, random_seed=None, num_iterations=1000, num_env_steps=0, unroll_length=8, unroll_with_grad=False, async_unroll=False, max_unroll_length=0, unroll_queue_size=200, unroll_step_interval=0, unroll_parameter_update_period=10, use_rollout_state=False, temporally_independent_train_step=None, num_checkpoints=10, confirm_checkpoint_upon_crash=True, no_thread_env_for_conf=False, evaluate=False, num_evals=None, eval_interval=10, epsilon_greedy=0.0, eval_uncertainty=False, num_eval_episodes=10, num_eval_environments=1, async_eval=True, ddp_paras_check_interval=0, num_summaries=None, summary_interval=50, summarize_first_interval=True, update_counter_every_mini_batch=False, summaries_flush_secs=1, summary_max_queue=10, metric_min_buffer_size=10, debug_summaries=False, profiling=False, enable_amp=False, code_snapshots=None, summarize_grads_and_vars=False, summarize_gradient_noise_scale=False, summarize_action_distributions=False, summarize_output=False, initial_collect_steps=0, num_updates_per_train_iter=4, mini_batch_length=None, mini_batch_size=None, whole_replay_buffer_training=True, replay_buffer_length=1024, priority_replay=False, priority_replay_alpha=0.7, priority_replay_beta=0.4, priority_replay_eps=1e-06, offline_buffer_dir=None, offline_buffer_length=None, rl_train_after_update_steps=0, rl_train_every_update_steps=1, empty_cache=False, normalize_importance_weights_by_max=False, clear_replay_buffer=True)[source]#

Bases: object

Configuration for training.

Parameters
  • root_dir (str) – directory for saving summary and checkpoints

  • ml_type (str) – type of learning task, one of [‘rl’, ‘sl’]

  • algorithm_ctor (Callable) – callable that create an OffPolicyAlgorithm or OnPolicyAlgorithm instance

  • data_transformer_ctor (Callable|list[Callable]) – Function(s) for creating data transformer(s). Each of them will be called as data_transformer_ctor(observation_spec) to create a data transformer. Available transformers are in algorithms.data_transformer. The data transformer constructed by this can be access as TrainerConfig.data_transformer. Important Note: HindsightExperienceTransformer, FrameStacker or any data transformer that need to access the replay buffer for additional data need to be before all other data transformers. The reason is the following: In off policy training, the replay buffer stores raw input w/o being processed by any data transformer. If say ObservationNormalizer is applied before hindsight, then data retrieved by replay will be normalized whereas hindsight data directly pulled from the replay buffer will not be normalized. Data will be in mismatch, causing training to suffer and potentially fail.

  • random_seed (None|int) – random seed, a random seed is used if None

  • num_iterations (int) – For RL trainer, indicates number of update iterations (ignored if 0). Note that for off-policy algorithms, if initial_collect_steps>0, then the first initial_collect_steps//(unroll_length*num_envs) iterations won’t perform any training. For SL trainer, indicates the number of training epochs. If both num_iterations and num_env_steps are set, num_iterations must be big enough to consume so many environment steps. And after num_env_steps enviroment steps are generated, the training will not interact with environments anymore, which means that it will only train on replay buffer.

  • num_env_steps (int) – number of environment steps (ignored if 0). The total number of FRAMES will be (num_env_steps*frame_skip) for calculating sample efficiency. See alf/environments/wrappers.py for the definition of FrameSkip.

  • unroll_length (float) – number of time steps each environment proceeds per iteration. The total number of time steps from all environments per iteration can be computed as: num_envs * env_batch_size * unroll_length. If unroll_length is not an integer, the actual unroll_length being used will fluctuate between floor(unroll_length) and ceil(unroll_length) and the expectation will be equal to unroll_length.

  • unroll_with_grad (bool) – a bool flag indicating whether we require grad during unroll(). This flag is only used by OffPolicyAlgorithm where unrolling with grads is usually unnecessary and turned off for saving memory. However, when there is an on-policy sub-algorithm, we can enable this flag for its training. OnPolicyAlgorithm always unrolls with grads and this flag doesn’t apply to it.

  • async_unroll (bool) – whether to unroll asynchronously. If True, unroll will be performed in parallel with training.

  • max_unroll_length (int) – the maximal length of unroll results for each iteration. If the time for one step of training is less than the time for unrolling max_unroll_length steps, the length of the unroll results will be less than max_unroll_length. Only used if async_unroll is True and unroll_length==0.

  • unroll_queue_size (int) – the size of the queue for transmitting unroll results from the unroll process to the main process. Only used if async_unroll is True. If the queue is full, the unroll process will wait for the main process to retrieve unroll results from the queue before performing more unrolls.

  • unroll_step_interval (float) – if not zero, the time interval in second between each two environment steps. Only used if async_unroll is True. This is useful if the interaction with the environment happens in real time (e.g. real world robot or real time simulation) and you want a fixed interaction frequency with the environment. Note that this will not has any effect if environment step and rollout step together spend more than unroll_step_interval.

  • unroll_parameter_update_period (int) – update the parameter for the asynchronous unroll every so many interations. Only used if async_unroll is True.

  • use_rollout_state (bool) – If True, when off-policy training, the RNN states will be taken from the replay buffer; otherwise they will be set to 0. In the case of True, the train_state_spec of an algorithm should always be a subset of the rollout_state_spec.

  • temporally_independent_train_step (bool|None) – If True, the train_step is called with all the experiences in one batch instead of being called sequentially with mini_batch_length batches. Only used by OffPolicyAlgorithm. In general, this option can only be used if the algorithm has no state. For Algorithm with state (e.g. SarsaAlgorithm not using RNN), if there is no need to recompute state at train_step, this option can also be used. If None, its value is inferred based on whether the algorithm has RNN state (True if there is RNN state, False if not).

  • num_checkpoints (int) – how many checkpoints to save for the training

  • confirm_checkpoint_upon_crash (bool) – whether to prompt for whether do checkpointing after crash.

  • no_thread_env_for_conf (bool) – not to create an unwrapped env for the purpose of showing operative configurations. If True, no ThreadEnvironment will ever be created, regardless of the value of TrainerConfig.evaluate. If False, a ThreadEnvironment will be created if TrainerConfig.evaluate or the training env is a ParallelAlfEnvironment instance. For an env that consume lots of resources, this flag can be set to True if no evaluation is needed to save resources. The decision of creating an unwrapped env won’t affect training; it’s used to correctly display inoperative configurations in subprocesses.

  • evaluate (bool) – A bool to evaluate when training

  • num_evals (int) – how many evaluations are needed throughout the training. If not None, an automatically calculated eval_interval will replace config.eval_interval.

  • eval_interval (int) – evaluate every so many iteration

  • epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation.

  • eval_uncertainty (bool) – whether to evluate uncertainty after training.

  • num_eval_episodes (int) – number of episodes for one evaluation.

  • num_eval_environments (int) – the number of environments for evaluation.

  • async_eval (bool) – whether to do evaluation asynchronously in a different process. Note that this may use more memory.

  • ddp_paras_check_interval (int) – if >0, then every so many iterations the trainer will perform a consistency check of the model parameters across different worker processes, if multi-gpu training is used.

  • num_summaries (int) – how many summary calls are needed throughout the training. If not None, an automatically calculated summary_interval will replace config.summary_interval. Note that this number doesn’t include the summary steps of the first interval if summarize_first_interval=True. In this case, the actual number of summaries will be roughly this number plus the calculated summary interval.

  • summary_interval (int) – write summary every so many training steps

  • summarize_first_interval (bool) – whether to summarize every step of the first interval (default True). It might be better to turn this off for an easier post-processing of the curve.

  • update_counter_every_mini_batch (bool) – whether to update counter for every mini batch. The summary_interval is based on this counter. Typically, this should be False. Set to True if you want to have summary for every mini batch for the purpose of debugging. Only used by OffPolicyAlgorithm.

  • summaries_flush_secs (int) – flush summary to disk every so many seconds

  • summary_max_queue (int) – flush to disk every so mary summaries

  • metric_min_buffer_size (int) – a minimal size of the buffer used to construct some average episodic metrics used in RLAlgorithm.

  • debug_summaries (bool) – A bool to gather debug summaries.

  • profiling (bool) – If True, use cProfile to profile the training. The profile result will be written to root_dir/py_train.INFO.

  • enable_amp – whether to use automatic mixed precision for training. This can makes the training faster if the algorithm is GPU intensive. However, the result may be different (mostly likely due to random fluctuation).

  • code_snapshots (list[str]) – an optional list of code files to write to tensorboard text. Note: the code file path should be relative to “<ALF_ROOT>/alf”, e.g., “algorithms/agent.py”. This can be useful for tracking code changes when running a job.

  • summarize_grads_and_vars (bool) – If True, gradient and network variable summaries will be written during training.

  • summarize_gradient_noise_scale (bool) – whether summarize gradient noise scale. See alf.optimizers.utils.py for details.

  • summarize_output (bool) – If True, summarize output of certain networks.

  • initial_collect_steps (int) – if positive, number of steps each single environment steps before perform first update. Only used by OffPolicyAlgorithm.

  • num_updates_per_train_iter (int) – number of optimization steps for one iteration. Only used by OffPolicyAlgorithm.

  • mini_batch_size (int) – number of sequences for each minibatch. If None, it’s set to the replayer’s batch_size. Only used by OffPolicyAlgorithm.

  • mini_batch_length (int) – the length of the sequence for each sample in the minibatch. Only used by OffPolicyAlgorithm.

  • whole_replay_buffer_training (bool) – whether use all data in replay buffer to perform one update. Only used by OffPolicyAlgorithm.

  • clear_replay_buffer (bool) – whether use all data in replay buffer to perform one update and then wiped clean. Only used by OffPolicyAlgorithm.

  • replay_buffer_length (int) – the maximum number of steps the replay buffer store for each environment. Only used by OffPolicyAlgorithm.

  • priority_replay (bool) – Use prioritized sampling if this is True.

  • priority_replay_alpha (float|Scheduler) – The priority from LossInfo is powered to this as an argument for ReplayBuffer.update_priority(). Note that the effect of ReplayBuffer.initial_priority may change with different values of priority_replay_alpha. Hence you may need to adjust ReplayBuffer.initial_priority accordingly.

  • priority_replay_beta (float|Scheduler) – weight the loss of each sample by importance_weight**(-priority_replay_beta), where importance_weight is from the BatchInfo returned by ReplayBuffer.get_batch(). This is only useful if prioritized_sampling is enabled for ReplayBuffer.

  • priority_replay_eps (float) – minimum priority for priority replay.

  • offline_buffer_dir (str|[str]) – path to the offline replay buffer checkpoint to be loaded. If a list of strings provided, each will represent the directory to one replay buffer checkpoint.

  • offline_buffer_length (int) – the maximum length will be loaded from each replay buffer checkpoint. Therefore the total buffer length is offline_buffer_length * len(offline_buffer_dir). If None, all the samples from all the provided replay buffer checkpoints will be loaded.

  • rl_train_after_update_steps (int) – only used in the hybrid training mode. It is used as a starting criteria for the normal (non-offline) part of the RL training, which only starts after so many number of update steps (according to global_counter).

  • rl_train_every_update_steps (int) – only used in the hybrid training mode. It is used to control the update frequency of the normal (non-offline) part of the RL training (according to global_counter). Through this flag, we can have a more fine grained control over the update frequencies of online and offline RL training (currently assumes the training frequency of offline RL is always higher or equal to the online RL part). For example, we can set rl_train_every_update_steps = 2 to have a train config that executes online RL training at the half frequency of that of the offline RL training.

  • empty_cache (bool) – empty GPU memory cache at the start of every iteration to reduce GPU memory usage. This option may slightly slow down the overall speed.

  • normalize_importance_weights_by_max (bool) – if True, normalize the importance weights by its max to prevent instability caused by large importance weight.

alf.algorithms.containers#

class AlgorithmContainer(algs, train_state_spec, rollout_state_spec, predict_state_spec, is_on_policy, debug_summaries, name)[source]#

Bases: alf.algorithms.algorithm.Algorithm

Algorithm that contains several sub-algorithms.

It provides sensible implementation of several interface functions of Algorithm.

Parameters
  • algs (dict[Algorithm]) – a dictionary of algorithms.

  • train_state_spec (nested TensorSpec) – for the network state of train_step().

  • rollout_state_spec (nested TensorSpec) – for the network state of predict_step(). If None, it’s assumed to be the same as train_state_spec.

  • predict_state_spec (nested TensorSpec) – for the network state of predict_step(). If None, it’s assume to be same as rollout_state_spec.

  • is_on_policy (None|bool) – whether the algorithm is on-policy or not. If None, the on-policiness will be decided based on the on-policiness of each sub-algorithm.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – name of this algorithm.

after_train_iter(root_inputs, rollout_info)[source]#

Call after_train_iter of each sub-algorithm.

after_update(root_inputs, info)[source]#

Call after_update of each sub-algorithm.

calc_loss(info)[source]#

Call calc_loss of each sub-algorithm and accumulate the loss.

preprocess_experience(root_inputs, rollout_info, batch_info)[source]#

Call the preprocess_experience of each sub-algorithm.

set_on_policy(is_on_policy)[source]#

Call set_on_policy of each sub-algorithm.

set_path(path)[source]#

Set the path for each sub-algorithm.

training: bool#
class EchoAlg(alg, echo_spec, name='EchoAlg')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Echo Algorithm.

Echo algorithm uses part of the output of alg of current step as part of the input of alg for the next step. It assumes that the input of alg is a dict with two keys: ‘input’ and ‘echo’, and the output of alg is a dict with two keys: ‘output’ and ‘echo’. The ‘echo’ output of current step will be the ‘echo’ input of the next step. ‘input’ of alg’s input is from the input of EchoAlg and ‘output’ of alg’s output is the output of EchoAlg.

Parameters
  • alg (Algorithm) – the module for performing the actual computation

  • echo_spec (nested TensorSpec) – describe the data format of echo.

  • name (str) –

after_train_iter(root_inputs, rollout_info)[source]#

Do things after completing one training iteration (i.e. train_iter() that consists of one or multiple gradient updates). This function can be used for training additional modules that have their own training logic (e.g., on/off-policy, replay buffers, etc). These modules should be added to _trainable_attributes_to_ignore in the parent algorithm.

Other things might also be possible as long as they should be done once every training iteration.

This function will serve the same purpose with after_update if there is always only one gradient update in each training iteration. Otherwise it’s less frequently called than after_update.

Parameters
  • root_inputs (nest|None) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll(). In the case where no data is available from the rollout_step() (e.g. in a offline pre-training phase where the online interaction is not started yet) root_inputs will be None.

  • rollout_info (nest|None) – information collected from rollout_step() for this algorithm during unroll(). In the case where no data is available from the rollout_step() (e.g. in a offline pre-training phase where the online interaction is not started yet) rollout_info will be None.

after_update(root_inputs, info)[source]#

Do things after completing one gradient update (i.e. update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.

Parameters
  • root_inputs (nest) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll().

  • info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() for on-policy training or train_step() for off-policy training.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(inputs, state)[source]#

Predict for one step of inputs.

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match predict_state_spec.

  • info (nest): information for analyzing the agent. In particular,

    if an element of the info is alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.

Return type

AlgStep

preprocess_experience(root_inputs, rollout_info, batch_info)[source]#

This function is called on the experiences obtained from a replay buffer. An example usage of this function is to calculate advantages and returns in PPOAlgorithm.

The shapes of tensors in experience are assumed to be \((B, T, ...)\).

Parameters
  • root_inputs (nest) – input for rollout_step() of the root algorithm. This is from replay buffer. Note this is not same as the input of rollout_step() of self unless self is the root algorithm.

  • rollout_info (nested Tensor) – AlgStep.info from rollout_step() for this algorithm.

  • batch_info (BatchInfo) – information about this batch of data

Returns

  • processed root_inputs

  • processed rollout_info

Return type

tuple

rollout_step(inputs, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

set_on_policy(is_on_policy)[source]#

Set whether this algorithm is on-policy or not.

Parameters

is_on_policy (bool) –

set_path(path)[source]#

Set the path from the root algorithm to this algorithm.

See AlgorithmInterface.path for description about path. This function is called by the trainer before training starts. It needs to be implemented if the algorithm contains some other sub-algorithms.

If an algorithm does not have any sub-algorithm or its sub-algorithm does not need to access the root replay buffer directly, it does not implement this function.

train_step(inputs, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class RLAlgWrapper(observation_spec, action_spec, algorithm, env=None, reward_spec=TensorSpec(shape=(), dtype=torch.float32), config=None, optimizer=None, debug_summaries=False, name='RLAlgWrapper')[source]#

Bases: alf.algorithms.rl_algorithm.RLAlgorithm

Wrap an Algorithm instance as an RLAlgorithm instance so that it can be used for RLTrainer.

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • algorithm (Algorithm) – algorithm to be wrapped. It should take TimeStep as input and its output will be used as action.

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultaneously. Running multiple environments in parallel is crucial to on-policy algorithms as it increases the diversity of data and decreases temporal correlation. env only needs to be provided to the root Algorithm.

  • config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs a training iteration by itself.

  • optimizer (torch.optim.Optimizer) – The default optimizer for training.

  • debug_summaries (bool) – If True, debug summaries will be created.

  • name (str) – Name of this algorithm.

after_train_iter(root_inputs, rollout_info)[source]#

Do things after completing one training iteration (i.e. train_iter() that consists of one or multiple gradient updates). This function can be used for training additional modules that have their own training logic (e.g., on/off-policy, replay buffers, etc). These modules should be added to _trainable_attributes_to_ignore in the parent algorithm.

Other things might also be possible as long as they should be done once every training iteration.

This function will serve the same purpose with after_update if there is always only one gradient update in each training iteration. Otherwise it’s less frequently called than after_update.

Parameters
  • root_inputs (nest|None) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll(). In the case where no data is available from the rollout_step() (e.g. in a offline pre-training phase where the online interaction is not started yet) root_inputs will be None.

  • rollout_info (nest|None) – information collected from rollout_step() for this algorithm during unroll(). In the case where no data is available from the rollout_step() (e.g. in a offline pre-training phase where the online interaction is not started yet) rollout_info will be None.

after_update(root_inputs, info)[source]#

Do things after completing one gradient update (i.e. update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.

Parameters
  • root_inputs (nest) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll().

  • info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() for on-policy training or train_step() for off-policy training.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(inputs, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

preprocess_experience(root_inputs, rollout_info, batch_info)[source]#

This function is called on the experiences obtained from a replay buffer. An example usage of this function is to calculate advantages and returns in PPOAlgorithm.

The shapes of tensors in experience are assumed to be \((B, T, ...)\).

Parameters
  • root_inputs (nest) – input for rollout_step() of the root algorithm. This is from replay buffer. Note this is not same as the input of rollout_step() of self unless self is the root algorithm.

  • rollout_info (nested Tensor) – AlgStep.info from rollout_step() for this algorithm.

  • batch_info (BatchInfo) – information about this batch of data

Returns

  • processed root_inputs

  • processed rollout_info

Return type

tuple

rollout_step(inputs, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

set_on_policy(is_on_policy)[source]#

Set whether this algorithm is on-policy or not.

Parameters

is_on_policy (bool) –

set_path(path)[source]#

Set the path from the root algorithm to this algorithm.

See AlgorithmInterface.path for description about path. This function is called by the trainer before training starts. It needs to be implemented if the algorithm contains some other sub-algorithms.

If an algorithm does not have any sub-algorithm or its sub-algorithm does not need to access the root replay buffer directly, it does not implement this function.

train_step(inputs, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
SequentialAlg(*modules, output='', is_on_policy=None, name='SequentialAlg', **named_modules)[source]#

Compose Algorithms Networks sequentially as a new Algorithm.

All the modules provided through modules and named_modules are calculated sequentially in the same order as they appear in the call to SequentialAlg. By default, each module takes the output of the previous module as its input (or the input to the SequentialAlg if it is the first module), and the output of the last module is the output of the SequentialAlg. Note that the output of a module means differently depending on the type of the module:

  • Algorithm: AlgStep.output field from predict_step, rollout_step

    or train_step

  • Network: the first element of the tuple returned from forward()

  • torch.nn.Module or Callable: the return value of the Callable.

In addition to using the output of the previous module as input, SequentialAlg also allow using other output, state or info from previous module as the input to a module. To do this, one can pass a tuple of (nested_str, module) instead of module as an argument to SequentialAlg. With this, the inputs to the module will be obtained using get_nested_field(named_results, nested_str), where named_results is a dictionary containing the inputs to SequentialAlg and all the results calulcated by previous modules. More specifically, named_results['input'] is the inputs to this algorithm. named_results['a'] is the output of the module named ‘a’. named_results['info']['a'] is the info output of the algorithm named ‘a’. And named_results['state']['a'] is state output of the algorithm/network named ‘a’.

Example 1:

The following contructs an algorithm which predicts the future of its input:

predictor = EncodingNetwork(...)

alg = SequentialAlg(
    predicted=predictor,
    delayed=networks.Delay(),
    error=(('delayed', 'input'), lambda xy: (xy[0] - xy[1]) ** 2),
    loss=Loss(),
    output='predicted',
)

It is equivalent to the following:

class PredictAlgorithm(Algorithm):
    def __init__(self, predictor):
        super().__init__(train_state_spec=(
            predictor.state_spec,
            predictor.input_tensor_spec))
        self._predictor = predictor
        self._loss = Loss()

    def rollout_step(self, inputs, state):
        return self._step(inputs, state)

    def train_step(self, inputs, state, rollout_info):
        return self._step(inputs, state)

    def _step(self, inputs, state):
        predictor_state, delayed = state
        predicted, predictor_state = self._predictor(inputs, predictor_state)
        error = (delayed - inputs) ** 2
        loss_step = self._loss.rollout_step(error)
        return AlgStep(
            output=predicted,
            state=(predictor_state, predicted),
            info=loss_step.info)

    def calc_loss(info):
        return self._loss.calc_loss(info)

alg = PredictAlgorithm(predictor)

Example 2:

The following example constructs an actor-critic algorithm:

value_net = ValueNetwork(...)
actor_net = ActorDistributionNetwork(...)

alg = SequentialAlg(
    is_on_policy=True,
    value=('input.observation', value_net),
    action_dist=('input.observation', actor_net),
    action=dist_utils.sample_action_distribution,
    loss=(ActorCriticInfo(
        reward='input.reward',
        step_type='input.step_type',
        discount='input.discount',
        action_distribution='action_dist',
        action='action',
        value='value'), ActorCriticLoss()),
    output='action')

It is equivalent to the following:

class ACAlgorithm(Algorithm):
    def __init__(self, value_net, actor_net):
        super().__init__(
            train_state_spec=(value_net.state_spec, actor_net.state_spec),
            is_on_policy=True)
        self._value_net = value_net
        self._actor_net = actor_net
        self._loss = ActorCriticLoss()

    def rollout_step(self, inputs, state):
        value, value_state = self._value_net(inputs.observation, state[0])
        action_dist, actor_state = self._actor_net(inputs.observation, state[1])
        action = dist_utils.sample_action_distribution(action_dist)
        loss_step = self._loss.rollout_step(ActorCriticInfo(
            reward=inputs.reward,
            step_type=inputs.step_type,
            discount=inputs.discount,
            action_distribution=action_dist,
            action=action,
            value=value))
        )
        return AlgStep(
            output=action,
            state=(value_state, actor_state),
            info=loss_step.info)

    def calc_loss(self, info):
        self._loss.calc_loss(info)

alg = ACAlgorithm(value_net, actor_net)
Parameters
  • modules (Callable | Algorithm | (nested str, Callable) | (nested str, Algorithm)) – The Callable can be a torch.nn.Module, alf.nn.Network or plain Callable. Optionally, their inputs can be specified by the first element of the tuple. If input is not provided, it is assumed to be the result of the previous module (or input to this Sequential for the first module). If input is provided, it should be a nested str. It will be used to retrieve results from the dictionary of the current named_results. For modules specified by modules, because no named_modules has been invoked, named_outputs is {'input': input}.

  • named_modules (Callable | Algorithm | (nested str, Callable) | (nested str, Algorithm)) – The Callable can be a torch.nn.Module, alf.nn.Network or plain Callable. Optionally, their inputs can be specified by the first element of the tuple. If input is not provided, it is assumed to be the result of the previous module (or input to this Sequential for the first module). If input is provided, it should be a nested str. It will be used to retrieve results from the dictionary of the current named_results. named_results is updated once the result of a named module is calculated.

  • output (nested str) – if not provided, the result from the last module will be used as output. Otherwise, it will be used to retrieve results from named_results after the results of all modules have been calculated.

  • is_on_policy (bool) – wether this supports on-policy or off-policy training. If is None, it should supports both on-policy and off-policy training.

  • name (str) – name of this algorithm

alf.algorithms.data_transformer#

Data transformers for transforming data from environment or replay buffer.

class DataTransformer(transformed_observation_spec, state_spec)[source]#

Bases: torch.nn.modules.module.Module

Base class for data transformers.

DataTransformer is used for transforming raw data from environment before passing to actual algorithms.

Most data transformers can subclass from SimpleDataTransformer, which provides a simpler interface.

Parameters
  • transformed_observation_spec (nested TensorSpec) – describing the transformed observation

  • state_spec (nested TensorSpec) – describing the state of the transformer when it is used to transform TimeStep

property stack_size#

The number of frames being stacked as one observation.

property state_spec#

Get the state spec of this transformer.

training: bool#
transform_experience(experience)[source]#

Transform an Experience structure.

This is used on the experience data retrieved from replay buffer.

Parameters

experience (Experience) – the experience retrieved from replay buffer. Note that experience.batch_info, experience.replay_buffer need to be set.

Returns

transformed experience

Return type

Experience

transform_timestep(timestep, state)[source]#

Transform a TimeStep structure.

This is used during unroll or predict.

Parameters
  • timestep (TimeStep) – the TimeStep needs to be transformed

  • state (nested Tensor) – the state of the transformer running over the timestep sequence. It should be the returned state from the previous call to transform_timestep. For the initial call to transform_timestep an zero state following the state_spec can be used.

Returns

  • transformed TimeStep

  • state of the transformer

Return type

tuple

property transformed_observation_spec#

Get the transformed observation_spec.

class FrameStackState(steps, prev_frames)#

Bases: tuple

Create new instance of FrameStackState(steps, prev_frames)

prev_frames#

Alias for field number 1

steps#

Alias for field number 0

class FrameStacker(observation_spec, stack_size=4, stack_axis=0, fields=None)[source]#

Bases: alf.algorithms.data_transformer.DataTransformer

Create a FrameStacker object.

Parameters
  • observation_spec (nested TensorSpec) – describing the observation in timestep

  • stack_size (int) – stack so many frames

  • stack_axis (int) – the dimension to stack the observation.

  • fields (list[str]) – fields to be stacked, A field str is a multi-level path denoted by “A.B.C”. If None, then non-nested observation is stacked.

property stack_size#

Get stack_size.

training: bool#
transform_experience(experience)[source]#

Transform an Experience structure.

This is used on the experience data retrieved from replay buffer.

Parameters

experience (Experience) – the experience retrieved from replay buffer. Note that experience.batch_info, experience.replay_buffer need to be set.

Returns

transformed experience

Return type

Experience

transform_timestep(time_step, state)[source]#

Transform a TimeStep structure.

This is used during unroll or predict.

Parameters
  • timestep (TimeStep) – the TimeStep needs to be transformed

  • state (nested Tensor) – the state of the transformer running over the timestep sequence. It should be the returned state from the previous call to transform_timestep. For the initial call to transform_timestep an zero state following the state_spec can be used.

Returns

  • transformed TimeStep

  • state of the transformer

Return type

tuple

class FunctionalRewardTransformer(func, observation_spec=())[source]#

Bases: alf.algorithms.data_transformer.RewardTransformer

Transform reward according to a provided function.

Can be used as a reward shaping function passed to an algorithm (e.g. ActorCriticAlgorithm).

Parameters
  • func (Callable) – the transformation function to be applied to the reward. It takes reward as input and outputs a transformed reward.

  • observation_spec (nested TensorSpec) – describing the observation in timestep

forward(reward)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool#
class HindsightExperienceTransformer(observation_spec, her_proportion=0.8, achieved_goal_field='time_step.observation.achieved_goal', desired_goal_field='time_step.observation.desired_goal', reward_fn=<function l2_dist_close_reward_fn>)[source]#

Bases: alf.algorithms.data_transformer.DataTransformer

Randomly transform her_proportion of batch_size trajectories with hindsight relabel.

This transformer assumes that input observation is a dict of at least two fields: 1) an achieved_goal field, indicating the current state of the environment, and 2) a desired_goal field, indicating the desired state of the environment. The achieved_goal from a future timestep will be used to relabel the desired_goal of the current timestep. The exact field names can be provided via arguments to the class __init__.

To use this class, add it to any existing data transformers, e.g. use this config if ObservationNormalizer is an existing data transformer:

ReplayBuffer.keep_episodic_info=True
HindsightExperienceTransformer.her_proportion=0.8
TrainerConfig.data_transformer_ctor=[@HindsightExperienceTransformer, @ObservationNormalizer]

See unit test for more details on behavior.

Parameters
  • her_proportion (float) – proportion of hindsight relabeled experience.

  • achieved_goal_field (str) – path to the achieved_goal field in the exp nest.

  • desired_goal_field (str) – path to the desired_goal field in the exp nest.

  • reward_fn (Callable) – function to recompute reward based on achieve_goal and desired_goal. Default gives reward 0 when L2 distance less than 0.05 and -1 otherwise, same as is done in suite_robotics environments.

training: bool#
transform_experience(experience)[source]#

Hindsight relabel experience Note: The environments where the samples are from are ordered in the

returned batch.

Parameters

experience (Experience) – experience sampled from replay buffer with batch_info and batch_info.replay_buffer both populated.

Returns

the relabeled experience, with batch_info potentially changed.

Return type

Experience

transform_timestep(timestep, state)[source]#

Transform a TimeStep structure.

This is used during unroll or predict.

Parameters
  • timestep (TimeStep) – the TimeStep needs to be transformed

  • state (nested Tensor) – the state of the transformer running over the timestep sequence. It should be the returned state from the previous call to transform_timestep. For the initial call to transform_timestep an zero state following the state_spec can be used.

Returns

  • transformed TimeStep

  • state of the transformer

Return type

tuple

class IdentityDataTransformer(observation_spec=None)[source]#

Bases: alf.algorithms.data_transformer.SimpleDataTransformer

A data transformer that keeps the data unchanged.

observation_spec (nested TensorSpec): describing the observation. This

should be provided when transformed_observation_spec propery needs to be accessed.

training: bool#
class ImageScaleTransformer(observation_spec, min=- 1.0, max=1.0, fields=None)[source]#

Bases: alf.algorithms.data_transformer.SimpleDataTransformer

Scale image to min and max (0->min, 255->max).

Parameters
  • observation_spec (nested TensorSpec) – describing the observation in timestep

  • fields (list[str]) – the fields to be applied with the transformation. If None, then observation must be a Tensor with dtype uint8. A field str can be a multi-step path denoted by “A.B.C”.

  • min (float) – normalize minimum to this value

  • max (float) – normalize maximum to this value

training: bool#
class ObservationNormalizer(observation_spec, fields=None, clipping=0.0, window_size=10000, update_rate=0.0001, speed=8.0, zero_mean=True, update_mode='replay', mode='adaptive')[source]#

Bases: alf.algorithms.data_transformer.SimpleDataTransformer

Create an observation normalizer with optional value clipping to be used as the data_transformer of an algorithm. It will be called before both rollout_step() and train_step().

The normalizer by default doesn’t automatically update the mean and std. Instead, it will check when self.forward() is called, whether an algorithm is unrolling or training. It only updates the mean and std during unroll. This is the suggested way of using an observation normalizer (i.e., update the stats when encountering new data for the first time). This same strategy has been used by OpenAI’s baselines for training their Robotics environments.

Parameters
  • observation_spec (nested TensorSpec) – describing the observation in timestep

  • fields (None|list[str]) – If None, normalize all fields. Otherwise, only normalized the specified fields. Each string in fields is a a multi-step path denoted by “A.B.C”.

  • clipping (float) – a floating value for clipping the normalized observation into [-clipping, clipping]. Only valid if it’s greater than 0.

  • window_size (int) – the window size of WindowNormalizer.

  • update_rate (float) – the update rate of EMNormalizer.

  • speed (float) – the speed of updating for AdaptiveNormalizer.

  • zero_mean (bool) – whether to make the normalized value be zero-mean

  • update_mode (str) – update stats during specified mode in [“replay”, “rollout”, “pretrain”].

  • mode (str) – a value in [“adaptive”, “window”, “em”] indicates which normalizer to use.

training: bool#
class RewardClipping(observation_spec=(), minmax=(- 1, 1))[source]#

Bases: alf.algorithms.data_transformer.RewardTransformer

Clamp immediate rewards to the range \([min, max]\).

Can be used as a reward shaping function passed to an algorithm (e.g. ActorCriticAlgorithm).

Note that if the reward is multi-dimensional, the clipping is applied to all the dimensions. If per-dimension operation is desired,

Parameters
  • observation_spec (nested TensorSpec) – describing the observation in timestep

  • minmax (tuple[float]) – clip this range

forward(reward)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool#
class RewardNormalizer(observation_spec=(), normalizer=None, update_max_calls=0, clip_value=- 1.0, update_mode='replay')[source]#

Bases: alf.algorithms.data_transformer.RewardTransformer

Transform reward to be zero-mean and unit-variance.

Parameters
  • observation_spec (nested TensorSpec) – describing the observation in timestep

  • normalizer (Normalizer) – the normalizer to be used to normalizer the reward. If None, will use AdaptiveNormalizer according to env reward spec.

  • update_max_calls (int) – If >0, then the normalier’s statistics will only be updated so many first calls of _transform().

  • clip_value (float) – if > 0, will clip the normalized reward within [-clip_value, clip_value]. Do not clip if clip_value < 0

  • update_mode (str) – update stats during either “replay” or “rollout”.

property clip_value#
forward(reward)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

property normalizer#
training: bool#
class RewardScaling(scale, observation_spec=())[source]#

Bases: alf.algorithms.data_transformer.RewardTransformer

Scale immediate rewards by a factor of scale.

Can be used as a reward shaping function passed to an algorithm (e.g. ActorCriticAlgorithm).

Note that if the reward is multi-dimensional, the scaling is applied to all the dimensions. If per-dimension operation is desired, FunctionalRewardTransformer can be used.

Parameters
  • scale (float) – scale factor

  • observation_spec (nested TensorSpec) – describing the observation in timestep

forward(reward)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool#
class RewardShifting(bias, observation_spec=())[source]#

Bases: alf.algorithms.data_transformer.RewardTransformer

Shift immediate rewards by a displacement of bias.

Note that if the reward is multi-dimensional, the shifting is applied to all the dimensions. If per-dimension operation is desired, FunctionalRewardTransformer can be used.

Parameters
  • bias (float) – displacement amount

  • observation_spec (nested TensorSpec) – describing the observation in timestep

forward(reward)[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool#
class RewardTransformer(observation_spec)[source]#

Bases: alf.algorithms.data_transformer.SimpleDataTransformer

Base class for transforming reward.

Parameters

observation_spec (nested TensorSpec) – describing the observation in timestep

training: bool#
class SequentialDataTransformer(data_transformer_ctors, observation_spec)[source]#

Bases: alf.algorithms.data_transformer.DataTransformer

A data transformer consisting of a sequence of data transformers.

Parameters
  • data_transformer_ctor (list[Callable]) – Functions for creating data transformers. Each of them will be called as data_transformer_ctors[i](observation_spec) to create a data transformer.

  • observation_spec (nested TensorSpec) – describing the raw observation in timestep. It is the observation passed to the first data transfomer.

members()[source]#
property stack_size#

The number of frames being stacked as one observation.

training: bool#
transform_experience(experience)[source]#

Transform an Experience structure.

This is used on the experience data retrieved from replay buffer.

Parameters

experience (Experience) – the experience retrieved from replay buffer. Note that experience.batch_info, experience.replay_buffer need to be set.

Returns

transformed experience

Return type

Experience

transform_timestep(timestep, state)[source]#

Transform a TimeStep structure.

This is used during unroll or predict.

Parameters
  • timestep (TimeStep) – the TimeStep needs to be transformed

  • state (nested Tensor) – the state of the transformer running over the timestep sequence. It should be the returned state from the previous call to transform_timestep. For the initial call to transform_timestep an zero state following the state_spec can be used.

Returns

  • transformed TimeStep

  • state of the transformer

Return type

tuple

class SimpleDataTransformer(transformed_observation_spec)[source]#

Bases: alf.algorithms.data_transformer.DataTransformer

Base class for simple data transformers.

For simple data transformers, there is no state for transform_timestep and transform_experience. And transform_experience use the same function _transform to do the transformation of the time_step field of the experience.

Args: transformed_observation_spec (nested TensorSpec): describing the

transformed observation

state_spec (nested TensorSpec): describing the state of the

transformer when it is used to transform TimeStep

training: bool#
transform_experience(experience)[source]#

Transform Experience.

For Experience, the shapes are [B, T, …]

Parameters

experience (Experience) – data to be transformed

Returns

transformed Experience

transform_timestep(timestep, state)[source]#

Transform TimeStep. Note that for TimeStep, the shapes are [B, …].

Parameters

timestep (TimeStep) – data to be transformed

Returns

transformed TimeStep

class UntransformedTimeStep(observation_spec=None)[source]#

Bases: alf.algorithms.data_transformer.SimpleDataTransformer

Put the time step itself to its field “untransformed”. Note that this data transformer must be applied first, before any other data transformer.

observation_spec (nested TensorSpec): describing the observation. This

should be provided when transformed_observation_spec propery needs to be accessed.

training: bool#
create_data_transformer(data_transformer_ctor, observation_spec, device=None)[source]#

Create a data transformer.

Parameters
  • data_transformer_ctor (Callable|list[Callable]) – Function(s) for creating data transformer(s). Each of them will be called as data_transformer_ctor(observation_spec) to create a data transformer. Available transformers are in algorithms.data_transformer.

  • observation_spec (nested TensorSpec) – the spec of the raw observation.

  • device (Optional[str]) – If not None, the data transformer(s) will be created on the specified device.

Returns

DataTransformer

l2_dist_close_reward_fn(achieved_goal, goal, threshold=0.05)[source]#

Giving -1/0 reward based on how close the achieved state is to the goal state.

Parameters
  • achieved_goal (Tensor) – achieved state, of shape [batch_size, batch_length, ...]

  • goal (Tensor) – goal state, of shape [batch_size, batch_length, ...]

  • threshold (float) – L2 distance threshold for the reward.

Returns

Tensor for -1/0 reward of shape [batch_size, batch_length].

alf.algorithms.ddpg_algorithm#

Deep Deterministic Policy Gradient (DDPG).

class DdpgActorState(actor, critics)#

Bases: tuple

Create new instance of DdpgActorState(actor, critics)

actor#

Alias for field number 0

critics#

Alias for field number 1

class DdpgAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), actor_network_ctor=<class 'alf.networks.actor_networks.ActorNetwork'>, critic_network_ctor=<class 'alf.networks.critic_networks.CriticNetwork'>, reward_weights=None, epsilon_greedy=None, calculate_priority=False, env=None, config=None, ou_stddev=0.2, ou_damping=0.15, critic_loss_ctor=None, num_critic_replicas=1, target_update_tau=0.05, target_update_period=1, rollout_random_action=0.0, dqda_clipping=None, action_l2=0, actor_optimizer=None, critic_optimizer=None, checkpoint=None, debug_summaries=False, name='DdpgAlgorithm')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

Deep Deterministic Policy Gradient (DDPG).

Reference: Lillicrap et al “Continuous control with deep reinforcement learning” https://arxiv.org/abs/1509.02971

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • actor_network_ctor (Callable) – Function to construct the actor network. actor_network_ctor needs to accept input_tensor_spec and action_spec as its arguments and return an actor network. The constructed network will be called with forward(observation, state).

  • critic_network_ctor (Callable) – Function to construct the critic network. critic_netwrok_ctor needs to accept input_tensor_spec which is a tuple of (observation_spec, action_spec). The constructed network will be called with forward((observation, action), state).

  • reward_weights (list[float]) – this is only used when the reward is multidimensional. In that case, the weighted sum of the q values is used for training the actor.

  • epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from config.epsilon_greedy and then alf.get_config_value(TrainerConfig.epsilon_greedy).

  • calculate_priority (bool) – whether to calculate priority. This is only useful if priority replay is enabled.

  • num_critic_replicas (int) – number of critics to be used. Default is 1.

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. env only needs to be provided to the root algorithm.

  • config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs train_iter() by itself.

  • ou_stddev (float) – Standard deviation for the Ornstein-Uhlenbeck (OU) noise added in the default collect policy.

  • ou_damping (float) – Damping factor for the OU noise added in the default collect policy.

  • critic_loss_ctor (None|OneStepTDLoss|MultiStepLoss) – a critic loss constructor. If None, a default OneStepTDLoss will be used.

  • target_update_tau (float) – Factor for soft update of the target networks.

  • target_update_period (int) – Period for soft update of the target networks.

  • rollout_random_action (float) – the probability of taking a uniform random action during a rollout_step(). 0 means always directly taking actions added with OU noises and 1 means always sample uniformly random actions. A bigger value results in more exploration during rollout.

  • dqda_clipping (float) – when computing the actor loss, clips the gradient dqda element-wise between [-dqda_clipping, dqda_clipping]. Does not perform clipping if dqda_clipping == 0.

  • action_l2 (float) – weight of squared action l2-norm on actor loss.

  • actor_optimizer (torch.optim.optimizer) – The optimizer for actor.

  • critic_optimizer (torch.optim.optimizer) – The optimizer for critic.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – The name of this algorithm.

after_update(root_inputs, info)[source]#

Do things after completing one gradient update (i.e. update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.

Parameters
  • root_inputs (nest) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll().

  • info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() for on-policy training or train_step() for off-policy training.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(inputs, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

rollout_step(time_step, state=None)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

train_step(inputs, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class DdpgCriticInfo(q_values, target_q_values)#

Bases: tuple

Create new instance of DdpgCriticInfo(q_values, target_q_values)

q_values#

Alias for field number 0

target_q_values#

Alias for field number 1

class DdpgCriticState(critics, target_actor, target_critics)#

Bases: tuple

Create new instance of DdpgCriticState(critics, target_actor, target_critics)

critics#

Alias for field number 0

target_actor#

Alias for field number 1

target_critics#

Alias for field number 2

class DdpgInfo(reward, step_type, discount, action, action_distribution, actor_loss, critic, discounted_return)#

Bases: tuple

Create new instance of DdpgInfo(reward, step_type, discount, action, action_distribution, actor_loss, critic, discounted_return)

action#

Alias for field number 3

action_distribution#

Alias for field number 4

actor_loss#

Alias for field number 5

critic#

Alias for field number 6

discount#

Alias for field number 2

discounted_return#

Alias for field number 7

reward#

Alias for field number 0

step_type#

Alias for field number 1

class DdpgLossInfo(actor, critic)#

Bases: tuple

Create new instance of DdpgLossInfo(actor, critic)

actor#

Alias for field number 0

critic#

Alias for field number 1

class DdpgState(actor, critics)#

Bases: tuple

Create new instance of DdpgState(actor, critics)

actor#

Alias for field number 0

critics#

Alias for field number 1

alf.algorithms.decoding_algorithm#

Decoding algorithm.

class DecodingAlgorithm(decoder, loss=MSELoss(), loss_weight=1.0, name='DecodingAlgorithm')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Generic decoding algorithm.

Parameters
  • decoder (Network) – network for decoding target from input.

  • loss (Callable) – loss function with signature loss(y_pred, y_true). Note that it should not reduce to a scalar. It should at least keep the batch dimension in the returned loss.

  • loss_weight (float) – weight for the loss.

train_step(inputs, state=(), rollout_info=None)[source]#

Train one step.

Parameters
  • inputs (tuple) – tuple of (input, target)

  • state (nested Tensor) – network state for decoder

Returns

  • output: decoding result

  • state: rnn state from decoder

  • info: loss of decoding

Return type

AlgStep

training: bool#

alf.algorithms.diayn_algorithm#

class DIAYNAlgorithm(skill_spec, encoding_net, reward_adapt_speed=8.0, observation_spec=None, hidden_size=(), hidden_activation=<built-in method relu_ of type object>, name='DIAYNAlgorithm')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Diversity is All You Need Module

This module learns a set of skill-conditional policies in an unsupervised way. See Eysenbach et al “Diversity is All You Need: Learning Diverse Skills without a Reward Function” for more details.

Create a DIAYNAlgorithm.

Parameters
  • skill_spec (TensorSpec) – supports both discrete and continuous skills. In the discrete case, the algorithm will predict 1-of-K skills using the cross entropy loss; in the continuous case, the algorithm will predict the skill vector itself using the mean square error loss.

  • encoding_net (EncodingNetwork) – network for encoding observation into a latent feature.

  • reward_adapt_speed (float) – how fast to adapt the reward normalizer. rouphly speaking, the statistics for the normalization is calculated mostly based on the most recent T/speed samples, where T is the total number of samples.

  • observation_spec (TensorSpec) – If not None, this spec is to be used by a observation normalizer to normalize incoming observations. In some cases, the normalized observation can be easier for training the discriminator.

  • hidden_size (tuple[int]) – a tuple of hidden layer sizes used by the discriminator.

  • hidden_activation (torch.nn.functional) – activation for the hidden layers.

  • name (str) – module’s name

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

rollout_step(inputs, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

train_step(inputs, state, rollout_info=None)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class DIAYNInfo(loss)#

Bases: tuple

Create new instance of DIAYNInfo(loss,)

loss#

Alias for field number 0

create_discrete_skill_spec(num_of_skills)[source]#

alf.algorithms.dqn_algorithm#

DQN Algorithm.

class DqnAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), q_network_cls=<class 'alf.networks.q_networks.QNetwork'>, q_optimizer=None, rollout_epsilon_greedy=0.1, target_net_target_action=True, num_critic_replicas=2, env=None, config=None, critic_loss_ctor=None, checkpoint=None, debug_summaries=False, name='DqnAlgorithm')[source]#

Bases: alf.algorithms.sac_algorithm.SacAlgorithm

DQN/DDQN algorithm:

Mnih et al "Playing Atari with Deep Reinforcement Learning", arXiv:1312.5602
Hasselt et al "Deep Reinforcement Learning with Double Q-learning", arXiv:1509.06461

The difference with DQN is that a minimum is taken from the two critics, similar to TD3, instead of choosing the maximum action using the Q network and evaluating the action value using the target Q network.

The implementation is based on the SAC algorithm.

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (BoundedTensorSpec) – Only one discrete action allowed.

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • q_network – is used to construct QNetwork for estimating Q(s,a) given that the action is discrete. Its output spec must be consistent with the discrete action in action_spec.

  • q_optimizer (Optional[Optimizer]) – A custom optimizer for the q network. Uses the enclosing algorithm’s optimizer if None.

  • rollout_epsilon_greedy (Union[float, Scheduler]) – epsilon greedy policy for rollout. Together with the following two parameters, the SAC algorithm can be converted to a DQN or DDQN algorithm when e.g. rollout_epsilon_greedy=0.3, max_target_action=True, and use_entropy_reward=False.

  • target_net_target_action (bool) – when True uses target critic network to get target action (similar as DDPG). When False, uses critic network to get target action (similar as DDQN/SAC).

  • num_critic_replicas (int) – number of critics to be used. Default is 2.

  • env (Optional[AlfEnvironment]) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.

  • config (Optional[TrainerConfig]) – config for training. It only needs to be provided to the algorithm which performs train_iter() by itself.

  • critic_loss_ctor (Optional[Callable[…, TDLoss]]) – a critic loss constructor. If None, a default OneStepTDLoss will be used.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – The name of this algorithm.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

rollout_step(inputs, state)[source]#

rollout_step() basically predicts actions like what is done by predict_step(). Additionally, if states are to be stored a in replay buffer, then this function also call _critic_networks and _target_critic_networks to maintain their states.

training: bool#

alf.algorithms.dynamic_action_repeat_agent#

class ActionRepeatState(rl, action, steps, k, rl_discount, rl_reward, sample_rewards, repr)#

Bases: tuple

Create new instance of ActionRepeatState(rl, action, steps, k, rl_discount, rl_reward, sample_rewards, repr)

action#

Alias for field number 1

k#

Alias for field number 3

repr#

Alias for field number 7

rl#

Alias for field number 0

rl_discount#

Alias for field number 4

rl_reward#

Alias for field number 5

sample_rewards#

Alias for field number 6

steps#

Alias for field number 2

class DynamicActionRepeatAgent(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), env=None, config=None, K=5, rl_algorithm_cls=<class 'alf.algorithms.sac_algorithm.SacAlgorithm'>, representation_learner_cls=None, reward_normalizer_ctor=None, gamma=0.99, optimizer=None, debug_summaries=False, name='DynamicActionRepeatAgent')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

Create an agent which learns a variable action repetition duration. At each decision step, the agent outputs both the action to repeat and the number of steps to repeat. These two quantities together constitute the action of the agent. We use SAC with mixed action type for training.

The core idea is similar to Learning to Repeat: Fine Grained Action Repetition for Deep Reinforcement Learning.

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions; can only be continuous actions for now.

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.

  • config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs a training iteration by itself.

  • K (int) – the maximal repeating times for an action.

  • rl_algorithm_cls (Callable) – creates an RL algorithm to be augmented by this dynamic action repeating ability.

  • representation_learner_cls (type) – The algorithm class for learning the representation. If provided, the constructed learner will calculate the representation from the original observation as the observation for downstream algorithms such as rl_algorithm. We assume that the representation is trained by rl_algorithm.

  • reward_normalizer_ctor (Callable) – if not None, it must be RewardNormalizer and environment rewards will be normalized for training.

  • gamma (float) – the reward discount to be applied when accumulating k steps’ rewards for a repeated action. Note that this value should be equal to the gamma used by the critic loss for target values.

  • optimizer (None|Optimizer) – The default optimizer for training. See comments above for detail.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – name of this agent.

after_update(root_inputs, info)[source]#

Call self._rl.after_update().

calc_loss(info)[source]#

Calculate the loss for training self._rl.

observe_for_replay(exp)[source]#

Record an experience in a replay buffer.

Parameters

exp (nested Tensor) – exp (nested Tensor): The shape is \([B, \ldots]\), where \(B\) is the batch size of the batched environment.

predict_step(time_step, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

preprocess_experience(root_inputs, rollout_info, batch_info)[source]#

Normalize training rewards if a reward normalizer is provided. Shape of rl_exp is [B, T, ...]. The statistics of the normalizer is updated by random sample rewards.

rollout_step(time_step, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

summarize_train(experience, train_info, loss_info, params)[source]#

Overwrite the function because the training action spec is different from the rollout action spec.

train_step(inputs, state, rollout_info)[source]#

Train the underlying RL algorithm self._rl. Because in self.rollout_step() the replay buffer only stores info related to self._rl, here we can directly call self._rl.train_step().

Parameters
training: bool#

alf.algorithms.dynamics_learning_algorithm#

class DeterministicDynamicsAlgorithm(action_spec, feature_spec, hidden_size=256, num_replicas=1, dynamics_network_ctor=None, name='DeterministicDynamicsAlgorithm')[source]#

Bases: alf.algorithms.dynamics_learning_algorithm.DynamicsLearningAlgorithm

Deterministic Dynamics Learning Module

This module trys to learn the dynamics of environment with a determinstic model.

Create a DeterministicDynamicsAlgorithm.

Parameters
  • hidden_size (int|tuple) – size of hidden layer(s)

  • num_replicas (int) – number of network replicas to be used in the ensemble for dynamics learning

  • dynamics_network_ctor (Optional[Callable[[Any, Any], DynamicsNetwork]]) – Used to construct a network for predicting the change of the next feature based on the previous feature and action. It should accept input with spec of the format [feature_spec, encoded_action_spec] and output a tensor of the shape feature_spec. For discrete action case, encoded_action is a one-hot representation of the action. For continuous action, encoded action is the original action.

predict_step(time_step, state)[source]#
Predict the next observation given the current time_step.

The next step is predicted using the prev_action from time_step and the feature from state.

Parameters
  • time_step (TimeStep) – time step structure. The prev_action from time_step will be used for predicting feature of the next step. It should be a Tensor of the shape [B, …], or [B, n, …] when n > 1, where n denotes the number of dynamics network replicas. When the input tensor has the shape of [B, …] and n > 1, it will be first expanded to [B, n, …] to match the number of dynamics network replicas.

  • state (DynamicsState) –

    state for dynamics learning with the following fields: - feature (Tensor): features of the previous observation of the

    shape [B, …], or [B, n, …] when n > 1. When state.feature has the shape of [B, …] and n > 1, it will be first expanded to [B, n, …] to match the number of dynamics network replicas. It is used for predicting the feature of the next step together with time_step.prev_action.

    • network: the input state of the dynamics network

Returns

outputs (Tensor): predicted feature of the next step, of the

shape [B, …], or [B, n, …] when n > 1.

state (DynamicsState): with the following fields
  • feature (Tensor): [B, n, …] (or [B, n, …] when n > 1)

    shape tensor representing the predicted feature of the next step

  • network: the updated state of the dynamics network

info: empty tuple ()

Return type

AlgStep

train_step(time_step, state)[source]#
Parameters
  • time_step (TimeStep) – time step structure. The prev_action from time_step will be used for predicting feature of the next step. It should be a Tensor of the shape [B, …] or [B, n, …] when n > 1, where n denotes the number of dynamics network replicas. When the input tensor has the shape of [B, …] and n > 1, it will be first expanded to [B, n, …] to match the number of dynamics network replicas.

  • state (DynamicsState) –

    state for dynamics learning with the following fields: - feature (Tensor): features of the previous observation of the

    shape [B, …] or [B, n, …] when n > 1. When state.feature has the shape of [B, …] and n > 1, it will be first expanded to [B, n, …] to match the number of dynamics network replicas. It is used for predicting the feature of the next step together with time_step.prev_action.

    • network: the input state of the dynamics network

Returns

outputs: empty tuple () state (DynamicsState): with the following fields

  • feature (Tensor): [B, …] (or [B, n, …] when n > 1)

    shape tensor representing the predicted feature of the next step

  • network: the updated state of the dynamics network

info (DynamicsInfo): with the following fields being updated:
  • loss (LossInfo):

Return type

AlgStep

training: bool#
update_state(time_step, state)[source]#
Update the state based on TimeStep data. This function is

mainly used during rollout together with a planner. This function is necessary as we need to update the feature in DynamicsState with those of the current observation, after each step of rollout.

Parameters
  • time_step (TimeStep) – input data for dynamics learning

  • state (DynamicsState) – state for DeterministicDynamicsAlgorithm (previous observation)

Returns

updated dynamics state

Return type

state (DynamicsState)

class DynamicsInfo(loss, dist)#

Bases: tuple

Create new instance of DynamicsInfo(loss, dist)

dist#

Alias for field number 1

loss#

Alias for field number 0

class DynamicsLearningAlgorithm(train_state_spec, action_spec, feature_spec, hidden_size=256, num_replicas=1, dynamics_network=None, checkpoint=None, name='DynamicsLearningAlgorithm')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Base Dynamics Learning Module

This module learns the dynamics of environment with a determinstic model.

Create a DynamicsLearningAlgorithm.

Parameters
  • hidden_size (int|tuple) – size of hidden layer(s)

  • dynamics_network (Network) – network for predicting the change of the next feature based on the previous feature and action. It should accept input with spec of the format [feature_spec, encoded_action_spec] and output a tensor of the shape feature_spec. For discrete action case, encoded_action is a one-hot representation of the action. For continuous action, encoded action is the original action.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

get_state_specs()[source]#

Get the state specs of the current module. This function is mainly used for constructing the nested state specs by the upper-level module.

property num_replicas#
predict_step(time_step, state)[source]#
Predict the current observation using time_step.prev_action

and the feature of the previous observation from state.

Parameters
  • time_step (TimeStep) – input data for dynamics learning

  • state (DynamicsState) – state for dynamics learning

Returns

output: state (DynamicsState): info (DynamicsInfo):

Return type

AlgStep

train_step(time_step, state)[source]#
Parameters
  • time_step (TimeStep) – input data for dynamics learning

  • state (DynamicsState) – state for dynamics learning (previous observation)

Returns

output: state (DynamicsState): state for training info (DynamicsInfo):

Return type

AlgStep

training: bool#
update_state(time_step, state)[source]#
Update the state based on TimeStep data. This function is

mainly used during rollout together with a planner.

Parameters
  • time_step (TimeStep) – input data for dynamics learning

  • state (DynamicsState) – state for DynamicsLearningAlgorithm (previous observation)

Returns

updated dynamics state

Return type

state (DynamicsState)

class DynamicsState(feature, network)#

Bases: tuple

Create new instance of DynamicsState(feature, network)

feature#

Alias for field number 0

network#

Alias for field number 1

class StochasticDynamicsAlgorithm(action_spec, feature_spec, hidden_size=256, num_replicas=1, dynamics_network_ctor=None, name='StochasticDynamicsAlgorithm')[source]#

Bases: alf.algorithms.dynamics_learning_algorithm.DeterministicDynamicsAlgorithm

Stochastic Dynamics Learning Module

This module learns the dynamics of environment with a stochastic model.

Create a StochasticDynamicsAlgorithm.

Parameters
  • hidden_size (int|tuple) – size of hidden layer(s)

  • num_replicas (int) – number of network replicas to be used in the ensemble for dynamics learning

  • dynamics_network_ctor (Optional[Callable[[Any, Any], DynamicsNetwork]]) – used to construct network for predicting next feature based on the previous feature and action. It should accept input with spec [feature_spec, encoded_action_spec] and output a tensor of shape feature_spec. For discrete action, encoded_action is an one-hot representation of the action. For continuous action, encoded action is the original action.

predict_step(time_step, state)[source]#
Predict the next observation given the current time_step.

The next step is predicted using the prev_action from time_step and the feature from state.

Parameters
  • time_step (TimeStep) – time step structure. The prev_action from time_step will be used for predicting feature of the next step. It should be a Tensor of the shape [B, …], or [B, n, …] when n > 1, where n denotes the number of dynamics network replicas. When the input tensor has the shape of [B, …] and n > 1, it will be first expanded to [B, n, …] to match the number of dynamics network replicas.

  • state (DynamicsState) –

    state for dynamics learning with the following fields: - feature (Tensor): features of the previous observation of the

    shape [B, …], or [B, n, …] when n > 1. When state.feature has the shape of [B, …] and n > 1, it will be first expanded to [B, n, …] to match the number of dynamics network replicas. It is used for predicting the feature of the next step together with time_step.prev_action.

    • network: the input state of the dynamics network

Returns

outputs (Tensor): predicted feature of the next step, of the

shape [B, …], or [B, n, …] when n > 1.

state (DynamicsState): with the following fields
  • feature (Tensor): [B, n, …] (or [B, n, …] when n > 1)

    shape tensor representing the predicted feature of the next step

  • network: the updated state of the dynamics network

info (DynamicsInfo): with the following fields being updated:
  • dist (td.Distribution): the predictive distribution which

    can be used for further calculation or summarization.

Return type

AlgStep

train_step(time_step, state)[source]#
Parameters
  • time_step (TimeStep) – time step structure. The prev_action from time_step will be used for predicting feature of the next step. It should be a Tensor of the shape [B, …] or [B, n, …] when n > 1, where n denotes the number of dynamics network replicas. When the input tensor has the shape of [B, …] and n > 1, it will be first expanded to [B, n, …] to match the number of dynamics network replicas.

  • state (DynamicsState) –

    state for dynamics learning with the following fields: - feature (Tensor): features of the previous observation of the

    shape [B, …] or [B, n, …] when n > 1. When state.feature has the shape of [B, …] and n > 1, it will be first expanded to [B, n, …] to match the number of dynamics network replicas. It is used for predicting the feature of the next step together with time_step.prev_action.

    • network: the input state of the dynamics network

Returns

outputs: empty tuple () state (DynamicsState): with the following fields

  • feature (Tensor): [B, …] (or [B, n, …] when n > 1)

    shape tensor representing the predicted feature of the next step

  • network: the updated state of the dynamics network

info (DynamicsInfo): with the following fields being updated:
  • loss (LossInfo):

  • dist (td.Distribution): the predictive distribution which

    can be used for further calculation or summarization.

Return type

AlgStep

training: bool#

alf.algorithms.encoding_algorithm#

Encoding algorithm.

class EncodingAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), encoder_cls=<class 'alf.networks.encoding_networks.EncodingNetwork'>, time_step_as_input=False, output_fields=None, loss_fields=None, loss_weights=None, optimizer=None, config=None, checkpoint=None, debug_summaries=False, name='EncodingAlgorithm')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Basic encoding algorithm.

It uses the provided encoding network to computed the representation. It also supports the training of the encoding network by using some of its output as losses.

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – not used

  • encoder_cls (type) – The class or function to create the encoder. It can be called using encoder_cls(input_tensor_spec).

  • time_step_as_input (bool) – If True, use the whole TimeStep strucuture as the input to the encoder instead of the observation.

  • output_fields (None | nested str) – if None, all the output from the encoder will be used as the output. Otherwise, output_fields will be used to retrieve the fields from the encoder output.

  • loss_fields (None | nested str) – there is not loss if this is None. Otherwise, loss_fields will be used to retrieve fields from encoder output and use them as loss. Note that those fields must be scalar.

  • loss_weights (None | nested str) – if provided, must have the same structure as loss_fields and will be used as weights for the corresponding loss values.

  • config (Optional[TrainerConfig]) – The trainer config. Present as representation learner interface to be used with Agent.

  • optimizer (torch.optim.Optimizer) – if provided, will be used to optimize the parameters of encoder.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – Name of this algorithm.

property output_spec#
predict_step(inputs, state)[source]#

override predict_step

Parameters
  • inputs (TimeStep) – time step structure

  • state (nested Tensor) – network state for encoder

Returns

  • output: encoding result

  • state: rnn state from encoder

Return type

AlgStep

rollout_step(inputs, state)[source]#

override rollout_step

Parameters
  • inputs (TimeStep) – time step structure

  • state (nested Tensor) – network state for encoder

Returns

  • output: encoding result

  • state: rnn state from encoder

  • info: LossInfo

Return type

AlgStep

train_step(inputs, state, rollout_info=None)[source]#

override train_step

Parameters
  • inputs (TimeStep) – time step structure

  • state (nested Tensor) – network state for encoder

  • rollout_info – not used

Returns

  • output: encoding result

  • state: rnn state from encoder

  • info: LossInfo

Return type

AlgStep

training: bool#

alf.algorithms.entropy_target_algorithm#

An algorithm for adjusting entropy regularization strength.

class EntropyTargetAlgorithm(action_spec, initial_alpha=0.1, skip_free_stage=False, max_entropy=None, target_entropy=None, very_slow_update_rate=0.001, slow_update_rate=0.01, fast_update_rate=0.6931471805599453, min_alpha=0.0001, average_window=2, debug_summaries=False, name='EntropyTargetAlgorithm')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Algorithm for adjusting entropy regularization.

It tries to adjust the entropy regularization (i.e. alpha) so that the the entropy is not smaller than target_entropy.

The algorithm has three stages:

  1. init stage. This is an optional stage. If the initial entropy is already below max_entropy, then this stage is skipped. Otherwise, the alpha will be slowly decreased so that the entropy will land at max_entropy to trigger the next free_stage. Basically, this stage let the user to choose an arbitrary large init alpha without considering every specific case.

  2. free stage. During this stage, the alpha is not changed. It transitions to adjust_stage once entropy drops below target_entropy.

  3. adjust stage. During this stage, log_alpha is adjusted using this formula:

    ((below + 0.5 * above) * decreasing - (above + 0.5 * below) * increasing) * update_rate
    

    Note that log_alpha will always be decreased if entropy is increasing even when the entropy is below the target entropy. This is to prevent overshooting log_alpha to a too big value. Same reason for always increasing log_alpha even when the entropy is above the target entropy. update_rate is initialized to fast_update_rate and is reduced by a factor of 0.9 whenever the entropy crosses target_entropy. udpate_rate is reset to fast_update_rate if entropy drops too much below target_entropy (i.e., fast_stage_thresh in the code, which is the half of target_entropy if it is positive, and twice of target_entropy if it is negative.

EntropyTargetAlgorithm can be used to approximately reproduce the learning of temperature in Soft Actor-Critic Algorithms and Applications. To do so, you need to use the same target_entropy, set skip_free_stage to True, and set slow_update_rate and fast_update_rate to the 4 times of the learning rate for temperature.

Parameters
  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • initial_alpha (float) – initial value for alpha; make sure that it’s large enough for initial meaningful exploration

  • skip_free_stage (bool) – If True, directly goes to the adjust stage.

  • max_entropy (float|None) – the upper bound of the total entropy. If it is None, min(initial_entropy * 0.8, initial_entropy / 0.8) is used. initial_entropy is estimated from the first average_window steps. 0.8 is to ensure that we can get a policy a less random as the initial policy before starting the free stage.

  • target_entropy (float|None) – the lower bound of the total entropy. If it is None, a default value proportional to the action dimension is used. This value should be less or equal than max_entropy.

  • very_slow_update_rate (float) – a tiny update rate for log_alpha; used in stage 0.

  • slow_update_rate (float) – minimal update rate for log_alpha; used in stage 2.

  • fast_update_rate (float) – maximum update rate for log_alpha; used in state 2.

  • min_alpha (float) – the minimal value of alpha. If <=0, \(e^{-100}\) is used.

  • average_window (int) – window size for averaging past entropies.

  • debug_summaries (bool) – True if debug summaries should be created.

adjust_alpha(entropy)[source]#

Adjust alpha according to the current entropy.

Parameters

entropy (scalar Tensor) – the current entropy.

Returns

adjusted entropy regularization

calc_loss(info, valid_mask=None)[source]#

Calculate loss.

Parameters
  • info (EntropyTargetInfo) – for computing loss.

  • valid_mask (tensor) – valid mask to be applied on time steps.

Returns

Return type

LossInfo

predict_step(distribution_and_step_type, state)[source]#

Predict for one step of inputs.

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match predict_state_spec.

  • info (nest): information for analyzing the agent. In particular,

    if an element of the info is alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.

Return type

AlgStep

rollout_step(distribution_and_step_type, state=None)[source]#

Rollout step.

Parameters
  • distribution (nested Distribution) – action distribution from the policy.

  • step_type (StepType) – the step type for the distributions.

  • on_policy_training (bool) – If False, this step does nothing.

Returns

info field is LossInfo, other fields are empty. All fields are empty If on_policy_training=False.

Return type

AlgStep

train_step(distribution_and_step_type, state=None, rollout_info=None)[source]#

Train step.

Parameters
  • distribution (nested Distribution) – action distribution from the policy.

  • step_type (StepType) – the step type for the distributions.

Returns

info field is LossInfo, other fields are empty.

Return type

AlgStep

training: bool#
class EntropyTargetInfo(step_type, loss)#

Bases: tuple

Create new instance of EntropyTargetInfo(step_type, loss)

loss#

Alias for field number 1

step_type#

Alias for field number 0

class EntropyTargetLossInfo(neg_entropy)#

Bases: tuple

Create new instance of EntropyTargetLossInfo(neg_entropy,)

neg_entropy#

Alias for field number 0

class NestedEntropyTargetAlgorithm(action_spec, initial_alpha=0.1, skip_free_stage=False, max_entropy=None, target_entropy=None, very_slow_update_rate=0.001, slow_update_rate=0.01, fast_update_rate=0.6931471805599453, min_alpha=0.0001, average_window=2, debug_summaries=False, name='EntropyTargetAlgorithm')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Algorithm for adjusting entropy regularization.

Similar to EntropyTargetAlgorithm, NestedEntropyTargetAlgorithm adjusts the entropy regularization for each action in a nested action so that the entropy for each action in the nest is not smaller than the corresponding target_entropy. It uses EntropyTargetAlgorithm to do the actual work. See EntropyTargetAlgorithm for how it works.

Parameters
  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • initial_alpha (float) – initial value for alpha; make sure that it’s large enough for initial meaningful exploration

  • skip_free_stage (bool) – If True, directly goes to the adjust stage.

  • max_entropy (Nested[float|None]) –

    the upper bound of the entropy for each corresponding action in action_spec. If it is None, min(initial_entropy * 0.8, initial_entropy / 0.8) is used. initial_entropy is estimated from the first average_window steps. 0.8 is to ensure that we can get a policy a less random as the initial policy before starting the free stage. If target_entropy is nested and:

    • If max_entropy is None: the max entropy of each of the distribution in action_spec is calculated as using the estimated initial entropy for that distribution.

    • If max_entropy is nested: it should have the same structure as action_spec and each element indicates the max entropy for the corresponding distribution in action_spec.

    • If max_entropy is a float: it is the max entropy for each of the distributions in action_spec

  • target_entropy (Nested[float|None]) – the lower bound of the the entropy for each corresponding action in action_spec. If it is None, a default value proportional to the action dimension is used. This value should be less or equal than max_entropy. If action_spec is nested, target_entropy can also be a nest with the same structure and each element indicates the target entropy for the corresponding distribution in action_spec.

  • very_slow_update_rate (float) – a tiny update rate for log_alpha; used in stage 0.

  • slow_update_rate (float) – minimal update rate for log_alpha; used in stage 2.

  • fast_update_rate (float) – maximum update rate for log_alpha; used in state 2.

  • min_alpha (float) – the minimal value of alpha. If <=0, \(e^{-100}\) is used.

  • average_window (int) – window size for averaging past entropies.

  • debug_summaries (bool) – True if debug summaries should be created.

calc_loss(info, valid_mask=None)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(distribution_and_step_type, state=None)[source]#

Predict for one step of inputs.

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match predict_state_spec.

  • info (nest): information for analyzing the agent. In particular,

    if an element of the info is alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.

Return type

AlgStep

rollout_step(distribution_and_step_type, state=None)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

train_step(distribution_and_step_type, state=None, rollout_info=None)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class SGDEntropyTargetAlgorithm(action_spec, initial_alpha=0.1, target_entropy=None, window_size=1, optimizer=None, debug_summaries=False, name='SGDEntropyTargetAlgorithm')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Adjusting the entropy weight using SGD according to a target, similar to the way of SAC.

Parameters
  • action_spec (TensorSpec) – nested tensor spec for the action

  • initial_alpha (float) – initial value for alpha; make sure that it’s large enough for initial meaningful exploration

  • target_entropy (Union[Callable[[], float], float, None]) – the target of the total entropy. If it is None, a default value proportional to the action dimension is used.

  • window_size (int) – window size for averaging past entropies.

  • optimizer (Optional[Optimizer]) – the optimizer for adjusting the weight

  • debug_summaries (bool) – whether to turn on debugging info

  • name (str) – name of the class

calc_loss(info)[source]#

Calculate the losses for training. It will compute two losses, one for training the entropy weight, and the other for maximizing the entropy of the action distribution.

predict_step(distribution_and_step_type)[source]#

Predict for one step of inputs.

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match predict_state_spec.

  • info (nest): information for analyzing the agent. In particular,

    if an element of the info is alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.

Return type

AlgStep

rollout_step(distribution_and_step_type)[source]#
Parameters

distribution_and_step_type (nested Distribution) – action distribution from the policy, and the step type for the distributions.

Returns

info is EntropyTargetInfo and info.loss is

LossInfo, other fields are empty. All fields are empty for off-policy training.

Return type

AlgStep

train_step(distribution_and_step_type)[source]#
Parameters

distribution_and_step_type (nested Distribution) – action distribution from the policy, and the step type for the distributions.

Returns

info is EntropyTargetInfo and info.loss is

LossInfo, other fields are empty.

Return type

AlgStep

training: bool#

alf.algorithms.functional_particle_vi_algorithm#

ParticleVI algorithm on parameterized functions.

class FuncParVIAlgorithm(data_creator=None, data_creator_outlier=None, input_tensor_spec=None, output_dim=None, param_net=None, conv_layer_params=None, fc_layer_params=None, use_conv_bias=False, use_conv_ln=False, use_fc_bias=True, use_fc_ln=False, activation=<built-in method relu_ of type object>, last_activation=<function identity>, last_use_bias=True, last_use_ln=False, num_particles=10, entropy_regularization=1.0, loss_type='classification', voting='soft', par_vi='svgd', function_vi=False, function_bs=None, function_extra_bs_ratio=0.1, function_extra_bs_sampler='uniform', function_extra_bs_std=1.0, critic_hidden_layers=(100, 100), critic_iter_num=2, critic_l2_weight=10.0, critic_use_bn=True, num_train_classes=10, optimizer=None, critic_optimizer=None, logging_network=False, logging_training=False, logging_evaluate=False, config=None, debug_summaries=False, name='FuncParVIAlgorithm')[source]#

Bases: alf.algorithms.particle_vi_algorithm.ParVIAlgorithm

Functional ParVI Algorithm

Functional ParVI algorithm maintains a set of functional particles, where each particle is a neural network. All particles are updated using particle-based VI approaches.

There are two ways of treating a neural network as a particle:

  • All the weights of the neural network as a particle.

  • Outputs of the neural network for an input mini-batch as a particle.

Parameters
  • data_creator (Callable) – called as data_creator() to get a tuple of (train_dataloader, test_dataloader)

  • data_creator_outlier (Callable) – called as data_creator() to get a tuple of (outlier_train_dataloader, outlier_test_dataloader)

  • input_tensor_spec (nested TensorSpec) – the (nested) tensor spec of the input. If nested, then preprocessing_combiner must not be None. It must be provided if data_creator is not provided.

  • output_dim (int) – dimension of the output of the generated network. It must be provided if data_creator is not provided.

  • param_net (ParamNetwork) – input parametric network.

  • conv_layer_params (tuple[tuple]) – a tuple of tuples where each tuple takes a format (filters, kernel_size, strides, padding, pooling_kernel), where padding and pooling_kernel are optional.

  • fc_layer_params (tuple[tuple]) – a tuple of tuples where each tuple takes a format (FC layer sizes. use_bias), where use_bias is optional.

  • use_conv_bias (bool|None) – whether use bias for conv layers. If None, will use not use_bn for conv layers.

  • use_conv_ln (bool) – whether use layer normalization for conv layers.

  • use_fc_bias (bool) – whether use bias for fc layers.

  • use_fc_ln (bool) – whether use layer normalization for fc layers.

  • activation (Callable) – activation used for all the layers but the last layer.

  • last_activation (Callable) – activation function of the additional layer specified by last_layer_param. Note that if last_layer_param is not None, last_activation has to be specified explicitly.

  • last_use_bias (bool) – whether use bias for the last layer

  • last_use_ln (bool) – whether use normalization for the last layer.

  • num_particles (int) – number of sampling particles

  • entropy_regularization (float) – weight of the repulsive term in par_vi.

  • function_vi (bool) – whether to use funciton value based par_vi, current supported by [svgd2, svgd3, gfsf].

  • function_bs (int) – mini batch size for par_vi training. Needed for critic initialization when function_vi is True.

  • function_extra_bs_ratio (float) – ratio of extra sampled batch size w.r.t. the function_bs.

  • function_extra_bs_sampler (str) – type of sampling method for extra training batch, types are [uniform, normal].

  • function_extra_bs_std (float) – std of the normal distribution for sampling extra training batch when using normal sampler.

  • critic_hidden_layers (tuple) – sizes of hidden layers of the critic, used for minmax.

  • critic_l2_weight (float) – weight of L2 regularization in training the critic, used for minmax.

  • critic_iter_num (int) – number of critic updates for each generator train_step, used for minmax.

  • critic_use_bn (book) – whether use batch norm for each layers of the critic, used for minmax.

  • critic_optimizer (torch.optim.Optimizer) – Optimizer for training the critic, used for minmax.

  • loss_type (str) – loglikelihood type for the generated functions, types are [classification, regression]

  • voting (str) – types of voting results from sampled functions, types are [soft, hard]

  • par_vi (str) –

    types of particle-based methods for variational inference, types are [svgd, gfsf, minmax]

    • svgd: empirical expectation of SVGD is evaluated by reusing the same batch of particles.

    • gfsf: wasserstein gradient flow with smoothed functions. It involves a kernel matrix inversion, so computationally more expensive, but in some cases the convergence seems faster than svgd approaches.

  • function_vi – whether to use function value based par_vi.

  • num_train_classes (int) – number of classes in training set.

  • optimizer (torch.optim.Optimizer) – The optimizer for training.

  • logging_network (bool) – whether logging the archetectures of networks.

  • logging_training (bool) – whether logging loss and acc during training.

  • logging_evaluate (bool) – whether logging loss and acc of evaluate.

  • config (TrainerConfig) – configuration for training

  • name (str) –

eval_uncertainty()[source]#

Function to evaluate the epistemic uncertainty of the ensemble. This method computes the following metrics:

  • AUROC (AUC) evaluates the separability of model predictions with respect to the training data and a prespecified outlier dataset. AUC is computed with respect to the entropy in the averaged softmax probabilities, as well as the sum of the variance of the softmax probabilities over the ensemble.

evaluate()[source]#

Evaluatation of the ParVI ensemble on a test dataset.

predict_step(inputs, params=None, state=None)[source]#

Predict ensemble outputs for inputs using the hypernetwork model.

Parameters
  • inputs (Tensor) – inputs to the ensemble of networks.

  • params (Tensor) – parameters of the ensemble of networks, if None, use self.particles.

  • state (None) – not used.

Returns

  • output (Tensor): predictions with shape

    [batch_size, self._param_net._output_spec.shape[0]]

  • state (None): not used

Return type

AlgStep

set_data_loader(train_loader, test_loader=None, outlier_data_loaders=None, entropy_regularization=None)[source]#

Set data loadder for training and testing.

Parameters
  • train_loader (torch.utils.data.DataLoader) – training data loader

  • test_loader (torch.utils.data.DataLoader) – testing data loader

  • outlier_data_loaders (tuple[torch.utils.data.DataLoader) – (trainloader, testloader) for outlier datasets

  • entropy_regularization (float) – weight of particle VI repulsive term.

summarize_train(loss_info, params, cum_loss=None, avg_acc=None)[source]#

Generate summaries for training & loss info after each gradient update. The default implementation of this function only summarizes params (with grads) and the loss. An algorithm can override this for additional summaries. See RLAlgorithm.summarize_train() for an example.

Parameters
  • experience (nested Tensor) – samples used for the most recent update_with_gradient(). By default it’s not summarized.

  • train_info (nested Tensor) – AlgStep.info returned by either rollout_step() (on-policy training) or train_step() (off-policy training). By default it’s not summarized.

  • loss_info (LossInfo) – loss

  • params (list[Parameter]) – list of parameters with gradients

train_iter(state=None)[source]#

Perform one epoch (iteration) of training.

Parameters

state (None) – not used

Returns

mini_batch number

train_step(inputs, entropy_regularization=None, loss_mask=None, state=None)[source]#

Perform one batch of training computation.

Parameters
  • inputs (nested Tensor) – input training data.

  • entropy_regularization (float) – weight of the repulsive term in par_vi. If None, use self._entropy_regularization.

  • loss_mask (Tensor) – mask indicating which samples are valid for loss propagation.

  • state (None) – not used

Returns

  • output(Tensor): shape is [batch_size, dim]

  • state: not used

  • info (LossInfo): loss

Return type

AlgStep

training: bool#

alf.algorithms.generator#

A generic generator.

class CriticAlgorithm(input_tensor_spec, output_dim=None, hidden_layers=(3, 3), activation=<built-in method relu_ of type object>, net=None, use_relu_mlp=False, use_bn=True, optimizer=None, name='CriticAlgorithm')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Wrap a critic network as an Algorithm for flexible gradient updates called by the Generator when par_vi is ‘minmax’.

Create a CriticAlgorithm.

Parameters
  • input_tensor_spec (TensorSpec) – spec of inputs.

  • output_dim (int) – dimension of output, default value is input_dim.

  • hidden_layers (tuple) – size of hidden layers.

  • activation (Callable) – activation used for all critic layers.

  • net (Network) – network for predicting outputs from inputs. If None, a default one with hidden_layers will be created

  • use_relu_mlp (bool) – whether use ReluMLP as default net constrctor. Diagonals of Jacobian can be explicitly computed for ReluMLP.

  • use_bn (bool) – whether use batch norm for each critic layers.

  • optimizer (torch.optim.Optimizer) – (optional) optimizer for training.

  • name (str) – name of this CriticAlgorithm.

predict_step(inputs, state=None, requires_jac_diag=False)[source]#

Predict for one step of inputs.

Parameters
  • inputs (Tensor) – inputs for prediction.

  • state – not used.

  • requires_jac_trace (bool) – whether outputs diagonals of Jacobian.

Returns

  • output (Tensor): predictions or (predictions, diag_jacobian)

    if requires_jac_diag is True.

  • state: not used.

Return type

AlgStep

reset_net_parameters()[source]#
training: bool#
class Generator(output_dim, noise_dim=32, input_tensor_spec=None, hidden_layers=(256, ), net=None, net_moving_average_rate=None, entropy_regularization=0.0, mi_weight=None, mi_estimator_cls=<class 'alf.algorithms.mi_estimator.MIEstimator'>, par_vi=None, use_kernel_averager=False, functional_gradient=False, init_lambda=1.0, lambda_trainable=False, block_inverse_mvp=False, direct_jac_inverse=False, inverse_mvp_solve_iters=1, inverse_mvp_hidden_size=100, inverse_mvp_hidden_layers=1, critic_input_dim=None, critic_hidden_layers=(100, 100), critic_l2_weight=10.0, critic_iter_num=2, critic_relu_mlp=False, critic_use_bn=True, minmax_resample=True, critic_optimizer=None, inverse_mvp_optimizer=None, optimizer=None, lambda_optimizer=None, name='Generator')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Generator generates outputs given inputs (can be None) by transforming a random noise and input using net:

outputs = net([noise, input]) if input is not None
          else net(noise)

The generator is trained to minimize the following objective:

\(E(loss\_func(net([noise, input]))) - entropy\_regulariztion \cdot H(P)\)

where P is the (conditional) distribution of outputs given the inputs implied by net and H(P) is the (conditional) entropy of P.

If the loss is the (unnormalized) negative log probability of some distribution Q and the entropy_regularization is 1, this objective is equivalent to minimizing \(KL(P||Q)\).

It uses two different ways to optimize net depending on entropy_regularization:

  • entropy_regularization = 0: the minimization is achieved by simply minimizing loss_func(net([noise, inputs]))

  • entropy_regularization > 0: the minimization is achieved using amortized particle-based variational inference (ParVI), in particular, four ParVI methods are implemented:

    1. amortized Stein Variational Gradient Descent (SVGD):

      Feng et al “Learning to Draw Samples with Amortized Stein Variational Gradient Descent” https://arxiv.org/pdf/1707.06626.pdf

    2. amortized Wasserstein ParVI with Smooth Functions (GFSF):

      Liu, Chang, et al. “Understanding and accelerating particle-based variational inference.” International Conference on Machine Learning. 2019.

    3. amortized Fisher Neural Sampler with Hutchinson’s estimator (MINMAX):

      Hu et at. “Stein Neural Sampler.” https://arxiv.org/abs/1810.03545, 2018.

    4. generative particle-based variational inference (GPVI) If functional_gradient is set to True, then GPVI is used.

      Ratzlaff, Bai, et al. “Generative Particle Variational Inference via Estimation of Functional Gradients.” International Conference on Machine Learning. 2021.

It also supports an additional optional objective of maximizing the mutual information between [noise, inputs] and outputs by using mi_estimator to prevent mode collapse. This might be useful for entropy_regulariztion = 0 as suggested in section 5.1 of the following paper:

Hjelm et al Learning Deep Representations by Mutual Information Estimation and Maximization <https://arxiv.org/pdf/1808.06670.pdf>

Create a Generator.

Parameters
  • output_dim (int) – dimension of output

  • noise_dim (int) – dimension of noise

  • input_tensor_spec (nested TensorSpec) – spec of inputs. If there is no inputs, this should be None.

  • hidden_layers (tuple) – sizes of hidden layers.

  • net (Network) – network for generating outputs from [noise, inputs] or noise (if inputs is None). If None, a default one with hidden_layers will be created

  • net_moving_average_rate (float) – If provided, use a moving average version of net to do prediction. This has been shown to be effective for GAN training (arXiv:1907.02544, arXiv:1812.04948).

  • entropy_regularization (float) – weight of entropy regularization.

  • mi_weight (float) – weight of mutual information loss.

  • mi_estimator_cls (type) – the class of mutual information estimator for maximizing the mutual information between [noise, inputs] and [outputs, inputs].

  • par_vi (string) –

    ParVI methods, options are [svgd, svgd2, svgd3, gfsf, minmax],

    • svgd: empirical expectation of SVGD is evaluated by a single resampled particle. The main benefit of this choice is it supports conditional case, while all other options do not.

    • svgd2: empirical expectation of SVGD is evaluated by splitting half of the sampled batch. It is a trade-off between computational efficiency and convergence speed.

    • svgd3: empirical expectation of SVGD is evaluated by resampled particles of the same batch size. It has better convergence but involves resampling, so less efficient computaionally comparing with svgd2.

    • gfsf: wasserstein gradient flow with smoothed functions. It involves a kernel matrix inversion, so computationally most expensive, but in some case the convergence seems faster than svgd approaches.

    • minmax: Fisher Neural Sampler, optimal descent direction of the Stein discrepancy is solved by an inner optimization procedure in the space of L2 neural networks.

  • use_kernel_averager (bool) – whether or not to use a running average of the kernel bandwith for ParVI methods.

  • functional_gradient (bool) – whether or not to optimize the generator with GPVI. When True, the dimension of the jacobian of the generator function needs to be square – therefore invertible. When the generator is not sqaure, we ensure this by sampling an input noise vector of the same size as the output, and only forwarding the first noise_dim components. We then add the full noise vector to the output, multiplied by the fullrank_diag_weight.

  • init_lambda (float) – weight on direct input-output link added to the generator output. Only used for GPVI and GPVI_Plus when forcing full rank Jacobian.

  • lambda_trainable (bool) – whether to train lambda.

  • block_inverse_mvp (bool) – whether to use the more efficient block form for inverse_mvp when functional_gradient is True. This option is recommended only when noise_dim < output_dim. as it is equivalent to the default form when noise_dim is equal to output_dim.

  • inverse_mvp_solve_iters (int) – number of iterations of inverse_mvp network training per single iteration of generator training.

  • inverse_mvp_hidden_size (int) – width of hidden layers in inverse_mvp network.

  • inverse_mvp_hidden_layers (int) – number of hidden layers in inverse_mvp network.

  • critic_input_dim (int) – dimension of critic input, used for minmax.

  • critic_hidden_layers (tuple) – sizes of hidden layers of the critic, used for minmax.

  • critic_l2_weight (float) – weight of L2 regularization in training the critic, used for minmax.

  • critic_iter_num (int) – number of critic updates for each generator train_step, used for minmax.

  • critic_relu_mlp (bool) – whether use ReluMLP as the critic constructor, used for minmax.

  • critic_use_bn (book) – whether use batch norm for each layers of the critic, used for minmax.

  • minmax_resample (bool) – whether resample the generator for each critic update, used for minmax.

  • critic_optimizer (torch.optim.Optimizer) – Optimizer for training the critic, used for minmax.

  • inverse_mvp_optimizer (torch.optim.Optimizer) – Optimizer for training the inverse_mvp network, used when functional_gradient is True.

  • optimizer (torch.optim.Optimizer) – (optional) optimizer for training

  • lambda_optimizer (torch.optim.Optimizer) – Optimizer for training the lambda, used for GPVI and GPVI_Plus when lambda_trainable is True.

  • name (str) – name of this generator

after_update(training_info)[source]#

Do things after completing one gradient update (i.e. update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.

Parameters
  • root_inputs (nest) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll().

  • info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() for on-policy training or train_step() for off-policy training.

get_lambda(training=False)[source]#
property noise_dim#
predict_step(inputs=None, noise=None, batch_size=None, training=False, state=None)[source]#

Generate outputs given inputs.

Parameters
  • inputs (nested Tensor) – if None, the outputs is generated only from noise.

  • noise (Tensor) – input to the generator.

  • batch_size (int) – batch_size. Must be provided if inputs is None. Its is ignored if inputs is not None

  • training (bool) – whether train the generator.

  • state – not used

Returns

  • output (Tensor): predictions with shape [batch_size, output_dim]

  • state: not used.

Return type

AlgStep

train_step(inputs, loss_func, batch_size=None, transform_func=None, entropy_regularization=None, state=None)[source]#
Parameters
  • inputs (nested Tensor) – if None, the outputs is generated only from noise.

  • loss_func (Callable) – loss_func([outputs, inputs]) (loss_func(outputs) if inputs is None) returns a Tensor or namedtuple of tensors with field loss, which is a Tensor of shape [batch_size] a loss term for optimizing the generator.

  • batch_size (int) – batch_size. Must be provided if inputs is None. Its is ignored if inputs is not None.

  • transform_func (Callable) –

    transform function on generator’s outputs. Used in function value based par_vi (currently supported by [svgd2, svgd3, gfsf]) for evaluating the network(s) parameterized by the generator’s outputs (given by self._predict) on the training batch (predefined with transform_func). It can be called in two ways

    • transform_func(params): params is a tensor of parameters for a network, of shape [D] or [B, D]

      • B: batch size

      • D: length of network parameters

      In this case, transform_func first samples additional data besides the predefined training batch and then evaluate the network(s) parameterized by params on the training batch plus additional sampled data.

    • transform_func((params, extra_samples)): params is the same as above case and extra_samples is the tensor of additional sampled data. In this case, transform_func evaluates the network(s) parameterized by params on predefined training batch plus extra_samples.

    It returns three tensors:

    • outputs: outputs of network parameterized by params evaluated on predined training batch.

    • density_outputs: outputs of network parameterized by params evaluated on additional sampled data.

    • extra_samples: additional sampled data, same as input extra_samples if called as transform_func((params, extra_samples))

  • entropy_regularization (float) – weight of entropy regularization.

  • state – not used

Returns

  • output (Tensor): predictions with shape [batch_size, output_dim]

  • info (LossInfo): loss

Return type

AlgStep

training: bool#
class GeneratorLossInfo(generator, mi_estimator, inverse_mvp)#

Bases: tuple

Create new instance of GeneratorLossInfo(generator, mi_estimator, inverse_mvp)

generator#

Alias for field number 0

inverse_mvp#

Alias for field number 2

mi_estimator#

Alias for field number 1

class InverseMVPAlgorithm(input_dim, output_dim, hidden_size=100, num_hidden_layers=1, activation=<built-in method relu_ of type object>, optimizer=None, name='InverseMVPAlgorithm')[source]#

Bases: alf.algorithms.algorithm.Algorithm

InverseMVP network Algorithm

Maintain an encoding network that takes (z, vec) as input and predicts a matrix-vector product (mvp) of the form \(y=J^{-1}(z)*vec\), where \(J^{-1}(z)\) is the inverse of the Jacobian matrix of some function \(f(z)\), and vec is a vector. This network is used in GPVI in computing the functional_gradient of the generator, where \(J^{-1}\) is the inverse of the Jacobian of the generator function w.r.t. input noise \(z'\), and vec is the gradient of the kernel \(\nabla_{z'}k(z', z)\).

Training of this network is done outside of the algorithm, where the network is trained to predict \(y\) that minimize the objective :math:`||Jy - vec||^2.

Create a InverseMVPAlgorithm. :param input_dim: dimension of input z :type input_dim: int :param output_dim: output dimension, i.e., dimension of the mvp :type output_dim: int :param hidden_size: width of hidden layers :type hidden_size: int :param num_hidden_layers: number of hidden layers after :type num_hidden_layers: int :param activation: activation used for all hidden layers. :type activation: Callable :param optimizer: (optional) optimizer for training. :type optimizer: torch.optim.Optimizer :param name: name of this Algorithm. :type name: str

predict_step(inputs, state=None)[source]#
Predict for one step of inputs.
Args:

inputs (tuple of Tensors): inputs (z, vec) for prediction. - z (Tensor): of size [N2, K] or [N2, D], representing \(z'\),

where K is self._z_dim and D is self._vec_dim.

  • vec (Tensor): of size [N2, D] or [N2, N, D], representing

    :math:`

abla_{z’}k(z’, z)` in GPVI.

state: not used.

Returns:

AlgStep: - output (tuple of Tensors): predictions of InverseMVP network

and the z_inputs, which is [:, :K] of z.

  • state: not used.

training: bool#

alf.algorithms.goal_generator#

class GoalInfo(goal, loss)#

Bases: tuple

Create new instance of GoalInfo(goal, loss)

goal#

Alias for field number 0

loss#

Alias for field number 1

class GoalState(goal)#

Bases: tuple

Create new instance of GoalState(goal,)

goal#

Alias for field number 0

class RandomCategoricalGoalGenerator(observation_spec, num_of_goals, name='RandomCategoricalGoalGenerator')[source]#

Bases: alf.algorithms.rl_algorithm.RLAlgorithm

Random Goal Generation Module.

This module generates a random categorical goal for the agent in the beginning of every episode.

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • num_of_goals (int) – total number of goals the agent can sample from.

  • name (str) – name of the algorithm.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(inputs, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

rollout_step(inputs, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

train_step(inputs, state, rollout_info)[source]#

For off-policy training, the current output goal should be taken from the goal in rollout_info (historical goals generated during rollout).

Note that we cannot take the goal from state and pass it down because the first state might be a zero vector. And we also cannot resample the goal online because that might be inconsistent with the sampled experience trajectory.

Parameters
  • inputs (TimeStep) – the experience data.

  • state (nested Tensor) –

  • rollout_info (GoalInfo) –

Returns

  • output (Tensor); one-hot goal vectors

  • state (nested Tensor):

  • info (GoalInfo): for training.

Return type

AlgStep

training: bool#

alf.algorithms.handcrafted_algorithm#

Handcrafted Algorithm.

class HandcraftedAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), env=None, config=None, debug_summaries=False, name='Handcrafted')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

A base class for algorithms with handcrafted computational logic. Note that a concrete algorithm should subclass from this and implement the computational logic in _policy_func. See SimpleCarlaAlgorithm for an exmaple.

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.

  • config (TrainerConfig) – config for training. It only needs to be provided to the algorithm which performs train_iter() by itself.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – The name of this algorithm.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(inputs, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

rollout_step(inputs, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

train_step(inputs, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class SimpleCarlaAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), distance_to_decelerate=50.0, distance_to_stop=1.0, env=None, config=None, debug_summaries=False, name='SimpleCarlaAlgorithm')[source]#

Bases: alf.algorithms.handcrafted_algorithm.HandcraftedAlgorithm

A simple controller for Carla environment.

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • distance_to_decelerate (float|int) – the distance in meter to goal from which to start decreasing the speed

  • distance_to_stop (float|int) – the distance in meter to goal from which to start to make a stop

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.

  • config (TrainerConfig) – config for training. It only needs to be provided to the algorithm which performs train_iter() by itself.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – The name of this algorithm.

training: bool#

alf.algorithms.hypernetwork_algorithm#

HyperNetwork algorithm.

class HyperNetwork(data_creator=None, data_creator_outlier=None, input_tensor_spec=None, output_dim=None, conv_layer_params=None, fc_layer_params=None, activation=<built-in method relu_ of type object>, last_activation=<function identity>, last_use_bias=True, last_use_ln=False, noise_dim=32, hidden_layers=(64, 64), use_conv_bias=False, use_conv_ln=False, use_fc_bias=True, use_fc_ln=False, generator_use_fc_bn=False, num_particles=10, entropy_regularization=1.0, critic_hidden_layers=(100, 100), critic_iter_num=2, critic_l2_weight=10.0, functional_gradient=False, init_lambda=1.0, lambda_trainable=False, block_inverse_mvp=False, direct_jac_inverse=False, inverse_mvp_solve_iters=1, inverse_mvp_hidden_size=100, inverse_mvp_hidden_layers=1, function_vi=False, function_bs=None, function_extra_bs_ratio=0.1, function_extra_bs_sampler='uniform', function_extra_bs_std=1.0, loss_type='classification', voting='soft', par_vi='svgd', num_train_classes=10, critic_optimizer=None, inverse_mvp_optimizer=None, optimizer=None, lambda_optimizer=None, logging_network=False, logging_training=False, logging_evaluate=False, config=None, name='HyperNetwork')[source]#

Bases: alf.algorithms.algorithm.Algorithm

HyperNetwork algorithm maintains a generator that generates a set of parameters for a predefined neural network from a random noise input. It is based on the following work:

https://github.com/neale/HyperGAN

Ratzlaff and Fuxin. “HyperGAN: A Generative Model for Diverse, Performant Neural Networks.” International Conference on Machine Learning. 2019.

Major differences versus the original paper are:

  • A single generator that generates parameters for all network layers.

  • Remove the mixer and the discriminator.

  • The generator may be trained with generative particle-based variational inference (ParVI) method. Please refer to generator.py for details.

Parameters
  • data_creator (Callable) – called as data_creator() to get a tuple of (train_dataloader, test_dataloader)

  • data_creator_outlier (Callable) – called as data_creator() to get a tuple of (outlier_train_dataloader, outlier_test_dataloader)

  • input_tensor_spec (nested TensorSpec) – the (nested) tensor spec of the input. If nested, then preprocessing_combiner must not be None. It must be provided if data_creator is not provided.

  • output_dim (int) – dimension of the output of the generated network. It must be provided if data_creator is not provided.

  • conv_layer_params (tuple[tuple]) – a tuple of tuples where each tuple takes a format (filters, kernel_size, strides, padding, pooling_kernel), where padding and pooling_kernel are optional.

  • fc_layer_params (tuple[tuple]) – a tuple of tuples where each tuple takes a format (FC layer sizes. use_bias), where use_bias is optional.

  • activation (nn.functional) – activation used for all the layers but the last layer.

  • last_activation (nn.functional) – activation function of the last layer.

  • last_use_bias (bool) – whether use bias for the last layer

  • last_use_ln (bool) – whether use layer normalization for the additional layer.

  • noise_dim (int) – dimension of noise

  • hidden_layers (tuple) – size of hidden layers.

  • use_conv_bias (bool) – whether use bias for conv layers.

  • use_conv_ln (bool) – whether use layer normalization for conv layers.

  • use_fc_bias (bool) – whether use bias for fc layers.

  • use_fc_ln (bool) – whether use layer normalization for fc layers.

  • generator_use_fc_bn (bool) – whether use batch normalization for generator fc layers.

  • num_particles (int) – number of sampling particles

  • entropy_regularization (float) – weight for par_vi repulsive term. If None and data_creator is provided, will be set as the ratio between the batch_size and the total size of the trainset.

  • critic_optimizer (torch.optim.Optimizer) – the optimizer for training critic.

  • critic_hidden_layers (tuple) – sizes of critic hidden layeres.

  • critic_iter_num (int) – number of minmax optimization iterations to train critic

  • critic_l2_weight (float) – L2 penalty on critic to ensure boundednesss

  • functional_gradient (bool) – whether or not to use GPVI.

  • log_lambda (float) – logarithm of the weight on “extra” dimensions when forcing full rank Jacobian

  • block_inverse_mvp (bool) – whether to use the more efficient block form for inverse_mvp when functional_gradient is True. This option only makes sense when noise_dim < output_dim.

  • inverse_mvp_solve_iters (int) – number of iterations to train inverse_mvp network each training iteration of generator.

  • inverse_mvp_hidden_size (int) – width of hidden layers of inverse_mvp network.

  • inverse_mvp_hidden_layers (int) – number of hidden layers in inverse_mvp network.

  • function_vi (bool) – whether to use funciton value based par_vi, current supported by [svgd2, svgd3, gfsf].

  • function_bs (int) – mini batch size for par_vi training. Needed for critic initialization when function_vi is True.

  • function_extra_bs_ratio (float) – ratio of extra sampled batch size w.r.t. the function_bs.

  • function_extra_bs_sampler (str) – type of sampling method for extra training batch, types are [uniform, normal].

  • function_extra_bs_std (float) – std of the normal distribution for sampling extra training batch when using normal sampler.

  • loss_type (str) – loglikelihood type for the generated functions, types are [classification, regression]

  • voting (str) – types of voting results from sampled functions, types are [soft, hard]

  • par_vi (str) –

    types of particle-based methods for variational inference, types are [svgd, svgd2, svgd3, gfsf, minmax],

    • svgd: same as svgd3.

    • svgd2: empirical expectation of SVGD is evaluated by splitting half of the sampled batch. It is a trade-off between computational efficiency and convergence speed.

    • svgd3: empirical expectation of SVGD is evaluated by resampled particles of the same batch size. It has better convergence but involves resampling, so less efficient computaionally comparing with svgd2.

    • gfsf: wasserstein gradient flow with smoothed functions. It involves a kernel matrix inversion, so computationally most expensive, but in some case the convergence seems faster than svgd approaches.

    • minmax: Fisher Neural Sampler, optimal descent direction of the Stein discrepancy is solved by an inner optimization procedure in the space of L2 neural networks.

  • num_train_classes (int) – number of classes in training set.

  • critic_optimizer – The optimizer for training critic network

  • optimizer (torch.optim.Optimizer) – The optimizer for training generator.

  • logging_network (bool) – whether logging the archetectures of networks.

  • logging_training (bool) – whether logging loss and acc during training.

  • logging_evaluate (bool) – whether logging loss and acc of evaluate.

  • config (TrainerConfig) – configuration for training

  • name (str) –

eval_uncertainty(num_particles=None)[source]#

Function to evaluate the epistemic uncertainty of a sampled ensemble. This method computes the following metrics:

  • AUROC (AUC): AUC is computed with respect to the entropy in the averaged softmax probabilities, as well as the sum of the variance of the softmax probabilities over the ensemble.

Parameters

num_particles (int) – number of sampled particles. If None, then self.num_particles is used.

evaluate(num_particles=None)[source]#

Evaluate on a randomly drawn ensemble.

Parameters

num_particles (int) – number of sampled particles. If None, then self.num_particles is used.

property num_particles#

number of sampled particles.

predict_step(inputs, params=None, num_particles=None, state=None)[source]#

Predict ensemble outputs for inputs using the hypernetwork model.

Parameters
  • inputs (Tensor) – inputs to the ensemble of networks.

  • params (Tensor) – parameters of the ensemble of networks, if None, will resample.

  • num_particles (int) – size of sampled ensemble. Default is None.

  • state (None) – not used.

Returns

  • output (Tensor): shape is

    [batch_size, self._param_net._output_spec.shape[0]]

  • state (None): not used

Return type

AlgStep

sample_parameters(noise=None, num_particles=None, training=True)[source]#

Sample parameters for an ensemble of networks.

Parameters
  • noise (Tensor) – input noise to self._generator. Default is None.

  • num_particles (int) – number of sampled particles. Default is None. If both noise and num_particles are None, num_particles provided to the constructor will be used as batch_size for self._generator.

  • training (bool) – whether or not training self._generator

Returns

AlgStep.output from predict_step of self._generator

set_data_loader(train_loader, test_loader=None, outlier_data_loaders=None, entropy_regularization=None)[source]#

Set data loadder for training and testing.

Parameters
  • train_loader (torch.utils.data.DataLoader) – training data loader

  • test_loader (torch.utils.data.DataLoader) – testing data loader

  • outlier_data_loaders (tuple[torch.utils.data.DataLoader) – (trainloader, testloader) for outlier datasets

  • entropy_regularization (float) – weight for par_vi repulsive term. If None, then self._entropy_regarization is used.

set_num_particles(num_particles)[source]#

Set the number of particles to sample through one forward pass of the hypernetwork.

summarize_train(loss_info, params, cum_loss=None, avg_acc=None, inverse_mvp_loss=None)[source]#

Generate summaries for training & loss info after each gradient update. The default implementation of this function only summarizes params (with grads) and the loss. An algorithm can override this for additional summaries. See RLAlgorithm.summarize_train() for an example.

Parameters
  • experience (nested Tensor) – samples used for the most recent update_with_gradient(). By default it’s not summarized.

  • train_info (nested Tensor) – AlgStep.info returned by either rollout_step() (on-policy training) or train_step() (off-policy training). By default it’s not summarized.

  • loss_info (LossInfo) – loss.

  • params (list[Parameter]) – list of parameters with gradients.

  • cum_loss (float) – cumulative training loss of epoch.

  • avg_acc (float) – average accuracy across batches in epoch.

  • inverse_mvp_loss (float) – cumulative training loss of InverseMVPNet

train_iter(num_particles=None, state=None)[source]#

Perform one epoch (iteration) of training.

Parameters
  • num_particles (int) – number of sampled particles. Default is None.

  • state (None) – not used

Returns

mini_batch number

train_step(inputs, num_particles=None, entropy_regularization=None, state=None)[source]#

Perform one batch of training computation.

Parameters
  • inputs (nested Tensor) – input training data.

  • num_particles (int) – number of sampled particles. Default is None, in which case self._num_particles will be used for batch_size of self._generator.

  • entropy_regularization (float) – weight for par_vi repulsive term. If None, then self._entropy_regarization is used.

  • state (None) – not used

Returns

train_step of self._generator

training: bool#

alf.algorithms.icm_algorithm#

class ICMAlgorithm(action_spec, observation_spec=None, hidden_size=256, reward_adapt_speed=8.0, encoding_net=None, forward_net=None, inverse_net=None, activation=<built-in method relu_ of type object>, optimizer=None, name='ICMAlgorithm')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Intrinsic Curiosity Module

This module generate the intrinsic reward based on predition error of observation.

See Pathak et al “Curiosity-driven Exploration by Self-supervised Prediction”

Create an ICMAlgorithm.

Args

action_spec (nested TensorSpec): agent’s action spec observation_spec (nested TensorSpec): agent’s observation spec. If

not None, then a normalizer will be used to normalize the observation.

hidden_size (int or tuple[int]): size of hidden layer(s) reward_adapt_speed (float): how fast to adapt the reward normalizer.

rouphly speaking, the statistics for the normalization is calculated mostly based on the most recent T/speed samples, where T is the total number of samples.

encoding_net (Network): network for encoding observation into a

latent feature. Its input is same as the input of this algorithm.

forward_net (Network): network for predicting next feature based on

previous feature and action. It should accept input with spec [feature_spec, encoded_action_spec] and output a tensor of shape feature_spec. For discrete action, encoded_action is an one-hot representation of the action. For continuous action, encoded action is same as the original action.

inverse_net (Network): network for predicting previous action given

the previous feature and current feature. It should accept input with spec [feature_spec, feature_spec] and output tensor of shape (num_actions,).

activation (torch.nn.functional): activation used for constructing

any of the forward net and inverse net, if not provided.

optimizer (torch.optim.Optimizer): The optimizer for training name (str):

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(inputs, state)[source]#

Predict for one step of inputs.

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match predict_state_spec.

  • info (nest): information for analyzing the agent. In particular,

    if an element of the info is alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.

Return type

AlgStep

rollout_step(inputs, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

train_step(inputs, state, rollout_info=None)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class ICMInfo(step_type, forward_loss, inverse_loss)#

Bases: tuple

Create new instance of ICMInfo(step_type, forward_loss, inverse_loss)

forward_loss#

Alias for field number 1

inverse_loss#

Alias for field number 2

step_type#

Alias for field number 0

alf.algorithms.iql_algorithm#

Implicit Q-Learning Algorithm.

class IqlActionState(actor_network, critic)#

Bases: tuple

Create new instance of IqlActionState(actor_network, critic)

actor_network#

Alias for field number 0

critic#

Alias for field number 1

class IqlActorInfo(actor_loss)#

Bases: tuple

Create new instance of IqlActorInfo(actor_loss,)

actor_loss#

Alias for field number 0

class IqlAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), actor_network_cls=<class 'alf.networks.actor_distribution_networks.ActorDistributionNetwork'>, critic_network_cls=<class 'alf.networks.critic_networks.CriticNetwork'>, v_network_cls=<class 'alf.networks.value_networks.ValueNetwork'>, reward_weights=None, epsilon_greedy=None, calculate_priority=False, num_critic_replicas=2, env=None, config=None, critic_loss_ctor=None, target_update_tau=0.05, target_update_period=1, temperature=1.0, actor_optimizer=None, critic_optimizer=None, value_optimizer=None, expectile=0.8, max_exp_advantage=100, checkpoint=None, debug_summaries=False, name='IqlAlgorithm')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

Implicit q-learning algorithm (IQL).

IQL is an offline reinforcement learning method. The idea is that instead of constraining the critic network or policy to avoid the value function extrapolation issue, IQL conducts learning using only in-sample data, thus voiding the issues when querying the critic network with out-of-distribution actions, a problem commonly faced in offline RL.

Reference:

Kostrikov, et al. "Offline Reinforcement Learning with Implicit Q-Learning",
arXiv:2110.06169
Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (BoundedTensorSpec) – representing the actions. Only continuous action is supported currently.

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • actor_network_cls (Callable) – is used to construct the actor network. The constructed actor network will be called to sample continuous actions. All of its output specs must be continuous. Discrete actor network is not supported.

  • critic_network_cls (Callable) – is used to construct critic network.

  • v_network_cls (Callable) – is used to construct a value network. for estimating the expectile of q values.

  • reward_weights (None|list[float]) – this is only used when the reward is multidimensional. In that case, the weighted sum of the q values is used for training the actor if reward_weights is not None. Otherwise, the sum of the q values is used.

  • epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from config.epsilon_greedy and then alf.get_config_value(TrainerConfig.epsilon_greedy).

  • calculate_priority (bool) – whether to calculate priority. This is only useful if priority replay is enabled.

  • num_critic_replicas (int) – number of critics to be used. Default is 2. This is only applied for critic networks. The value network is not replicated.

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.

  • config (TrainerConfig) – config for training. It only needs to be provided to the algorithm which performs train_iter() by itself.

  • critic_loss_ctor (None|OneStepTDLoss|MultiStepLoss) – a critic loss constructor. If None, a default OneStepTDLoss will be used.

  • target_update_tau (float) – Factor for soft update of the target networks.

  • target_update_period (int) – Period for soft update of the target networks.

  • temperature (float) – the hyper-parameter for scaling the advantages. It corresponds to 1/beta in Eqn.(7) of the paper.

  • actor_optimizer (torch.optim.optimizer) – The optimizer for actor.

  • critic_optimizer (torch.optim.optimizer) – The optimizer for critic.

  • value_optimizer (torch.optim.optimizer) – The optimizer for value network.

  • expectile (float) – the expectile value for value learning.

  • max_exp_advantage (float) – clamp the exponentiated advantages with this value before being applied to weight the actor loss.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – The name of this algorithm.

after_update(root_inputs, info)[source]#

Do things after completing one gradient update (i.e. update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.

Parameters
  • root_inputs (nest) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll().

  • info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() for on-policy training or train_step() for off-policy training.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(inputs, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

rollout_step(inputs, state)[source]#

rollout_step() basically predicts actions like what is done by predict_step(). Additionally, if states are to be stored a in replay buffer, then this function also call _critic_networks and _target_critic_networks to maintain their states.

train_step(inputs, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class IqlCriticInfo(critics, target_value, value)#

Bases: tuple

Create new instance of IqlCriticInfo(critics, target_value, value)

critics#

Alias for field number 0

target_value#

Alias for field number 1

value#

Alias for field number 2

class IqlCriticState(critics, target_critics)#

Bases: tuple

Create new instance of IqlCriticState(critics, target_critics)

critics#

Alias for field number 0

target_critics#

Alias for field number 1

class IqlInfo(reward, step_type, discount, action, action_distribution, actor, critic)#

Bases: tuple

Create new instance of IqlInfo(reward, step_type, discount, action, action_distribution, actor, critic)

action#

Alias for field number 3

action_distribution#

Alias for field number 4

actor#

Alias for field number 5

critic#

Alias for field number 6

discount#

Alias for field number 2

reward#

Alias for field number 0

step_type#

Alias for field number 1

class IqlLossInfo(actor, critic)#

Bases: tuple

Create new instance of IqlLossInfo(actor, critic)

actor#

Alias for field number 0

critic#

Alias for field number 1

class IqlState(action, actor, critic)#

Bases: tuple

Create new instance of IqlState(action, actor, critic)

action#

Alias for field number 0

actor#

Alias for field number 1

critic#

Alias for field number 2

alf.algorithms.lagrangian_reward_weight_algorithm#

LagrangianRewardWeightAlgorithm.

class LagInfo(rollout_reward)#

Bases: tuple

Create new instance of LagInfo(rollout_reward,)

rollout_reward#

Alias for field number 0

class LagrangianPredRewardWeightAlgorithm(reward_spec, reward_thresholds, optimizer, init_weights=1.0, max_weight=None, reward_weight_normalization=True, pred_rewards_averager_ctor=functools.partial(<class 'alf.utils.averager.EMAverager'>, update_rate=0.0001), debug_summaries=False, name='LagrangianPredRewardWeightAlgorithm')[source]#

Bases: alf.algorithms.lagrangian_reward_weight_algorithm.LagrangianRewardWeightAlgorithm

Similar to LagrangianRewardWeightAlgorithm, except that the rewards used to compare with the thresholds are collected by prediction steps instead of by rollout steps. For harsh target constraints, it is important to remove the rollout stochasticity otherwise the agent’s constraint satisfaction ability will usually be under-estimated.

Because prediction output is not directly passed to training, in order to use the rewards from prediction to train the weights, here we use an Averager to maintain the reward statistics. Inside every after_train_iter we perform a gradient step by querying the current averager value.

Note

This algorithm asserts TrainerConfig.evaluate=True.

Parameters
  • reward_spec (TensorSpec) – a rank-1 tensor spec representing multi-dim rewards.

  • reward_thresholds (list[float]|None]) – a list of floating numbers, each representing a desired minimum reward threshold in expectation. If any entry is None, then the corresponding reward weight won’t be tuned; either its init value or its normalized init value (if reward_weight_normalization=True) will be used.

  • optimizer (optimizer) – optimizer for learning the reward weights.

  • init_weights (float|list[float]) – the initial reward weights.

  • max_weight (float) – the reward weights will be clipped up to this value

  • reward_weight_normalization (bool) – whether project the weights to a simplex (sum-to-one normalization)

  • pred_rewards_averager_ctor (Callable) – callable for creating an averager to maintain a moving average of prediction rewards. If None, EMAverager with an update rate of 1e-4 will be used.

  • debug_summaries (bool) –

  • name (str) –

predict_step(inputs, state=None)[source]#

Predict for one step of inputs.

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match predict_state_spec.

  • info (nest): information for analyzing the agent. In particular,

    if an element of the info is alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.

Return type

AlgStep

training: bool#
class LagrangianRewardWeightAlgorithm(reward_spec, reward_thresholds, optimizer, init_weights=1.0, max_weight=None, reward_weight_normalization=True, lambda_transform=<built-in function softplus>, debug_summaries=False, name='LagrangianRewardWeightAlgorithm')[source]#

Bases: alf.algorithms.algorithm.Algorithm

An algorithm that adjusts reward weights according to untransformed rollout rewards. The adjustment is expected to be performed after every training iteration.

Generally speaking, for each reward dimension, the algorithm compares an individual reward per step to an average expected threshold, and if the reward is greater than the threshold (requirement satisfied) then it decreases the reward weight; otherwise it increases the weight.

Note

This algorithm doesn’t put a constraint on per-step basis since it only learns a single, state-independent weight for each reward dim. Also, a reward is always assumed to be the higher the better.

Parameters
  • reward_spec (TensorSpec) – a rank-1 tensor spec representing multi-dim rewards.

  • reward_thresholds (list[float]|None]) – a list of floating numbers, each representing a desired minimum reward threshold in expectation. If any entry is None, then the corresponding reward weight won’t be tuned; either its init value or its normalized init value (if reward_weight_normalization=True) will be used.

  • optimizer (optimizer) – optimizer for learning the reward weights.

  • init_weights (float|list[float]) – the initial reward weights.

  • max_weight (float) – the reward weights will be clipped up to this value

  • reward_weight_normalization (bool) – whether project the weights to a simplex (sum-to-one normalization)

  • lambda_transform (Callable) – the transform function to make sure all lambdas (reward weights) are positive. Currently only support F.softplus and torch.exp.

  • debug_summaries (bool) –

  • name (str) –

after_train_iter(root_inputs, train_info)[source]#

Perform one gradient step of updating lambdas.

predict_step(inputs, state=None)[source]#

Predict for one step of inputs.

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match predict_state_spec.

  • info (nest): information for analyzing the agent. In particular,

    if an element of the info is alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.

Return type

AlgStep

property reward_weights#

Return the detached reward weights. These weights are expected not to be changed by external code.

rollout_step(inputs, state=None)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

training: bool#

alf.algorithms.mbrl_algorithm#

Model-based RL Algorithm.

class LatentMbrlAlgorithm(observation_spec, action_spec, planner_module_ctor, reward_spec=TensorSpec(shape=(), dtype=torch.float32), env=None, config=None, planner_optimizer=None, debug_summaries=False, name='LatentMbrlAlgorithm')[source]#

Bases: alf.algorithms.mbrl_algorithm.MbrlAlgorithm

Model-based RL algorithm in a latent space.

Create an LatentMbrlAlgorithm. The LatentMbrlAlgorithm takes as input a planner module for making decisions on actions based on the latent representation of the current observation as well as a latent dynamics model.

The latent representation as well as the latent dynamics is provided by a latent predictive representation module, which is an instance of PredictiveRepresentationLearner. It is set through the set_latent_predictive_representation_module() function. The latent predictive representation module should have a function predict_multi_step for performing multi-step imagined rollout. Currently it is assumed that the training of the latent representation module is outside of the LatentMbrlAlgorithm, although the LatentMbrlAlgorithm can also contribute to its training by using the latent representation in loss calculation.

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (BoundedTensorSpec) – representing the actions.

  • planner_module_ctor (Callable[[Any, Any], PlanAlgorithm]) – used to constrcut module for generating planned action based on specified reward function and dynamics function

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. env only needs to be provided to the root Algorithm.

  • config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs train_iter() by itself.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – The name of this algorithm.

calc_loss(training_info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

set_latent_predictive_representation_module(latent_pred_rep_module)[source]#
train_step(exp, state, rollout_info=None)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class MbrlAlgorithm(observation_spec, action_spec, reward_module, planner_module_ctor, feature_spec=None, dynamics_module_ctor=None, reward_spec=TensorSpec(shape=(), dtype=torch.float32), particles_per_replica=1, epsilon_greedy=None, env=None, config=None, dynamics_optimizer=None, reward_optimizer=None, planner_optimizer=None, checkpoint=None, debug_summaries=False, name='MbrlAlgorithm')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

Model-based RL algorithm

Create an MbrlAlgorithm. The MbrlAlgorithm takes as input the following set of modules for making decisions on actions based on the current observation: 1) learnable/fixed dynamics module 2) learnable/fixed reward module 3) learnable/fixed planner module

Parameters
  • action_spec (BoundedTensorSpec) – representing the actions.

  • dynamics_module_ctor (Optional[Callable[[Any, Any], DynamicsLearningAlgorithm]]) – used to construct the module for learning to predict the next feature based on the previous feature and action. It should accept input with spec [feature_spec, encoded_action_spec] and output a tensor of shape feature_spec. For discrete action, encoded_action is an one-hot representation of the action. For continuous action, encoded action is same as the original action.

  • reward_module (RewardEstimationAlgorithm) – module for calculating the reward, i.e., evaluating the reward for a (s, a) pair

  • planner_module_ctor: – used to construct the module for generating: planned action based on specified reward function and dynamics function

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • particles_per_replica (int) – number of particles for each replica

  • epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from config.epsilon_greedy and then alf.get_config_value(TrainerConfig.epsilon_greedy).

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. env only needs to be provided to the root Algorithm.

  • config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs train_iter() by itself.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – The name of this algorithm.

after_update(root_inputs, training_info)[source]#

Do things after completing one gradient update (i.e. update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.

Parameters
  • root_inputs (nest) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll().

  • info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() for on-policy training or train_step() for off-policy training.

calc_loss(training_info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(time_step, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

rollout_step(time_step, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

train_step(inputs, state, rollout_info=None)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class MbrlInfo(dynamics, reward, planner)#

Bases: tuple

Create new instance of MbrlInfo(dynamics, reward, planner)

dynamics#

Alias for field number 0

planner#

Alias for field number 2

reward#

Alias for field number 1

class MbrlState(dynamics, reward, planner)#

Bases: tuple

Create new instance of MbrlState(dynamics, reward, planner)

dynamics#

Alias for field number 0

planner#

Alias for field number 2

reward#

Alias for field number 1

alf.algorithms.mcts_algorithm#

Monte-Carlo Tree Search.

class MCTSAlgorithm(observation_spec, action_spec, num_simulations, root_dirichlet_alpha, root_exploration_fraction, pb_c_init, pb_c_base, discount, is_two_player_game, visit_softmax_temperature_fn, model=None, keep_model_pred_state=False, predict_action_sampler=MultinomialSampler(), rollout_action_sampler=MultinomialSampler(), learn_policy_temperature=1.0, reward_spec=TensorSpec(shape=(), dtype=torch.float32), expand_all_children=False, expand_all_root_children=False, known_value_bounds=None, value_min_max_delta=1e-30, ucb_break_tie_eps=0.0, ucb_parent_visit_count_minus_one=False, unexpanded_value_score=0.5, act_with_exploration_policy=False, search_with_exploration_policy=False, learn_with_exploration_policy=False, exploration_policy_type='rkl', max_unroll_length=1000000, num_parallel_sims=1, checkpoint=None, debug_summaries=False, name='MCTSAlgorithm')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

Monte-Carlo Tree Search algorithm.

The code largely follows the pseudocode of Schrittwieser et al. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. The pseudocode can be downloaded from https://arxiv.org/src/1911.08265v2/anc/pseudocode.py

There are several differences:

  1. In this implementation, all values and rewards are for player 0. It seems that the values and rewards in the pseudocode can be either for player 0 or player 1 depending on who is on current turn. It makes reasoning the logic of the code more difficult and error prone. And it indeed seems there is a bug in the pseudocode related to this. More concretely, in the pseudocode, line 524 suggests that the value_sum is relative to a changing player; line 528 suggests that all the rewards along a path are relative to a same player; while line 499 combines the reward and value without considering the player.

  2. When calculating UCB score, the pseudocode normalizes value before adding with reward. We normalize after summing reward and value.

  3. When calculating UCB score, if the visit count of the node is 0, the value component of the score is 0 in the pseudocode. We use 0.5 instead so that it is not always the lowest score (or highest for player 1) no matter what the outcome of its siblings are.

  4. The pseudocode initializes the visit count of root to 0. We initialize it to 1 instead so that prior is not neglected in the first select_child(). This is consistent with how the visit_count of other nodes are initialized. When other nodes are expanded, the immediately subsequenct backup() will make their initial visit_count to be 1.

  5. We add a game_over field to ModelOutput to indicate the game is over so that we won’t keep expanding over that branch.

  6. We add support for using a stochastic policy instead of using UCB to do the search/learn/act. This can be enabled by setting act_with_exploration_policy search_with_exploration_policy, learn_with_exploration_policy to True. See Grill et al. Monte-Carlo tree search as regularized policy optimization for reference.

In addition to the original MuZero paper, we also implemented the method described in the following two paper:

1. Grill et al. Monte-Carlo tree search as regularized policy optimization

It can be enabled by setting (act/learn/search)_with_exploration_policy

2. Hubert et. al. Learning and Planning in Complex Action Spaces

It is enabled when SimpleMCTSModel.num_sampled_actions is set.

The time spent on tree search is directly related to how many times the tree is expanded. To make it faster, we also support expanding multiple leaves simulaneously. In order to do this, we maintain num_parallel_sims best children for each node in the tree and use them to contruct k=num_paralilel_sims paths. Note the k best children may have duplicates, which is desired because we want to expand the most promising path more often. Depending the value of search_with_exploration_policy, this process is slightly different:

  • search_with_exploration_policy=True. The k best_children of each node are simply chosen by independently sampling the exploration policy k times. When contructing the search paths, the i-th search path is based on the i-th best child of each node.

  • search_with_exploration_policy=False. The best child is same the case k=1. The second best child is found by assuming the visit count of the best child and the parent are increased by 1 and applying the UCB criterion again. This is repeated k times to get k best children. Note that this is different from directly selecting the best k childrens based on the original UCB scores. The reason of not doing that is that if the highest score is much bigger than the second highest score, we want to both paths to select the same child. During the process of traversing from the root to contruct k search paths, if several (let’s say k’) paths are exactly same so far, we will use best k’ children of the last node of these k’ paths to extend the paths so that The k’ children (may contains duplicates) being selected to extend these k’ paths are most promising according to the UCB scores.

Parameters
  • observation_spec (nested TensorSpec) –

    if the observation is a dictionary, MCTSAlgorithm will use the following three fields if they are contained in the dictionary:

    1. valid_action_mask: a bool Tensor to indicate which actions are allowed. It will be used to mask out invalid actions. If not provided, all possible actions are considered.

    2. steps: int32 Tensor to indicate the number of steps since the beginning of the game. If not provided, an internal counter will be used. However, this internal count will not be correct if the algorithm is used to play against human because it is not used to generate all the moves of both players.

    3. to_play: int8 Tensor whose elements are 0 or 1 to indicate who is the player to take the action. If not provided, steps % 2 will be used as to_play.

  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • num_simulations (int) – the number of simulations per search (calls to model)

  • root_dirichlet_alpha (float) – alpha of dirichlet prior for exploration

  • root_exploration_fraction (float) – noise generated by the dirichlet distribution is combined with the action distribution from the model to be used as the action prior for the children of the root.

  • pb_c_init (float) – c1 of the pUCT rule in Appendix B, equation (2)

  • pb_c_base (float) – c2 of the pUCT rule in Appendix B, equation (2)

  • discount (float) – reward discount factor

  • is_two_player_game (bool) – whether this is a two player (zero-sum) game

  • model (Optional[MCTSModel]) – the model used by the algorithm. If not provided in the constructor. It should be specified using set_model before predict_step or rollout_step is used.

  • keep_model_pred_state (bool) – whether to keep ModelOutput.state.pred_state returned from model.initial_predict as part of the state of this algorithm. If so previous pred_state will be used to call initial_predict.

  • visit_softmax_temperature_fn (Callable) – function for calculating the softmax temperature for sampling action based on the visit_counts of the children of the root. \(P(a) \propto \exp(visit\_count/t)\). This function is called as visit_softmax_temperature_fn(steps), where steps is a vector representing the number of steps in the games. And it is expected to return a float vector of the same shape as steps.

  • predict_action_sampler – available choices include CategoricalSeedSampler, EpsilonGreedySampler, MultinomialSampler

  • rollout_action_sampler – available choices include CategoricalSeedSampler, EpsilonGreedySampler, MultinomialSampler

  • learn_policy_temperature (float) – transform the policy p found by MCTS by \(p^{1/learn_policy_temperature} / Z\) as policy target for model learning, where Z is a normalization factor so that the resulting probabilities sum to one.

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • expand_all_children (bool) – If True, when a new leaf is selected, immediately expand all its children. With this option, the visit count does not truly reflect the quality of a node. Hence it should be used with (act/learn)_with_exploration_policy=True

  • expand_all_root_children (bool) – whether to expand all root children before search. This is described in Appendix A of “Learning and Planning in Complex Action Spaces”. However, our implementation is different from the paper’s. The paper initializes Q(s, a) for root s for all the action being sampled. We expand all sampled action for root s. With this option, the visit count does not truly refect the quality of a node. Hence it should be used with (act/learn)_with_exploration_policy=True.

  • known_value_bounds (tuple|None) – known bound of the values.

  • value_min_max_delta (float) – when normalizing the value using the the min and max values, (max-min).clamp(min=value_min_max_delta) is used as the denominator.

  • ucb_break_tie_eps (float) – add a random number in the range of [0, ucb_break_tie_eps) to the UCB score to choose actions with close UCB score randomly. It is used only if at least one of act/search/learn_with_exploration_policy is False.

  • ucb_parent_visit_count_minus_one (bool) – This option effectively chooses the first child of a parent uniformly, which can increase exploration.

  • unexpanded_value_score (float|str) – The value score for an unexpanded child. If ‘max’/’min’/’mean’, will use the maximum/minimum/mean of the value scores of the expanded siblings. If ‘mean_with_parent’, will use the mean of the value scores of the expanded siblings and its parent (this is used in ELF OpenGo and EfficientZero). If ‘none’, when exploration policy is used, will keep the policy for the unexpanded children same as prior; when exporation is not used, ‘none’ behaves same as ‘min’.

  • act_with_exploration_policy (bool) – If True, a policy calculated using reverse KL divergence will be used for generate action.

  • search_with_exploration_policy (bool) – If True, a policy calculated using reverse KL divergence will be used for tree search.

  • learn_with_exploration_policy (bool) – If True, a policy calculated using reverse KL divergence will be used for learning.

  • exploration_policy_type (str) – Type of exploration policy. Must be one of (‘rkl’, ‘kl’)

  • max_unroll_length (int) – maximal allowed unroll steps when building the search tree. If expand_all_children is False, the maximal allowed tree depth will be max_unroll_length. Otherwise, the maximal allowed tree depth will be max_unroll_length-1

  • num_parallel_sims (int) – expanding so many leaves at a time for one tree. num_simulations must be divisable by num_parallel_sims.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • name (str) – the name of the algorithm.

property discount#
predict_step(time_step, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

rollout_step(time_step, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

set_model(model)[source]#

Set the model used by the algorithm.

training: bool#
class MCTSInfo(candidate_actions, value, candidate_action_policy)#

Bases: tuple

Create new instance of MCTSInfo(candidate_actions, value, candidate_action_policy)

candidate_action_policy#

Alias for field number 2

candidate_actions#

Alias for field number 0

value#

Alias for field number 1

class MCTSState(steps, pred_state, action_sampler_state, next_predicted_reward)#

Bases: tuple

Create new instance of MCTSState(steps, pred_state, action_sampler_state, next_predicted_reward)

action_sampler_state#

Alias for field number 2

next_predicted_reward#

Alias for field number 3

pred_state#

Alias for field number 1

steps#

Alias for field number 0

class VisitSoftmaxTemperatureByMoves(move_temperature_pairs=[(29, 1.0), (10000, 0.0001)])[source]#

Bases: object

Scheduling the temperature by move.

Parameters

move_temperature_pairs (list[tuple]) – each (moves, temperature) pair indicates using this temperature until so many moves have been played in the current game. The moves should be in ascending order. Note that num_moves used to calculate the temperature starts from 0.

class VisitSoftmaxTemperatureByProgress(progress_temperature_pairs=[(0.5, 1.0), (0.75, 0.5), (1, 0.25)])[source]#

Bases: object

Scheduling the temperature by training progress.

Parameters

progress_temperature_pairs (list[tuple]) – each (progress, temperature) pair indicates using this temperature until this training progress. Note that progress should be in ascending order.

calculate_exploration_policy(value, prior, c, tol=1e-06)[source]#

Calculate exploration policy.

The policy is based on Grill et al. Monte-Carlo tree search as regularized policy optimization

Notation:

q: prior policy

p: sampling probability

v: value

The exploration policy is found by minimizing the following:

\[p = \arg\min_p \left[ -E_p(v) + c KL(q\|p) \right]\]

which leads to the following solution:

\[p_i = c\frac{q_i}{\alpha - v_i}\]

where \(\alpha \ge \max_i(v_i)\) is such that \(\sum_i p_i = 1\)

To make the solving numerically more stable and efficient, we reparameterize the problem to the following:

\[\begin{split}\begin{array}{ll} & v^* = \max_i v_i \\ & \alpha = v^* + c \beta \\ & u_i = \frac{v_i - v^*}{c} \\ & p_i = \frac{q_i}{\beta - u_i} \\ \end{array}\end{split}\]

With this reparametrization, we need to find \(\beta>0\) s.t.

\[\sum_i \frac{q_i}{\beta - u_i} = 1\]

We use Newton’s method to update \(\beta\) iteratively:

\[\beta \leftarrow \beta - \frac{f(\beta)}{f'(\beta)} = \beta + \frac{\sum_i \frac{q_i}{\beta - v_i} - 1}{\sum_i \frac{q_i}{(\beta - v_i)^2}}\]

where \(f(\beta) = \sum_i \frac{q_i}{\beta - u_i} - 1\) and \(f'(\beta)\) is the derivative of \(f(\beta)\). Since \(f(\beta)\) is convex, starting the iteration with a \(\beta\) s.t. \(f(\beta) > 0\) gaurantees the convergence. In practice, we find that about 10 iterations can reach tolerance of 1e-6. Newton’s method is much faster than binary search.

Parameters
  • value (Tensor) – [N, K] Tensor

  • prior (Tensor) – [N, K] Tensor

  • c (Tensor) – [N, 1] Tensor

  • tol (float) – Desired acurracy. The result satisfy \(|\sum_i p_i - 1| \le tol\)

Returns

  • Tensor: [N, K], the exploration policy

  • int: the number of iterations

Return type

tuple

calculate_kl_exploration_policy(value, prior, c)[source]#

Calculate exploration policy.

This is similar to calculate_exploration_policy, but using \(KL(p\|q)\) instead of \(KL(q\|p)\) for regularization.

Notation:

q: prior policy

p: sampling probability

v: value

The exploration policy is found by minimizing the following:

\[p = \arg\min_p \left[ -E_p(v) + c KL(p\|q) \right]\]

which leads to the following solution:

\[p_i = \frac{q_i \exp(v_i/c)}{Z}\]

where \(Z\) is such that \(\sum_i p_i = 1\)

Parameters
  • value (Tensor) – [N, K] Tensor

  • prior (Tensor) – [N, K] Tensor

  • c (Tensor) – [N, 1] Tensor

Returns

  • Tensor: [N, K], the exploration policy

  • int: always 0 (to conform with the signature of calculate_exploration_policy)

Return type

tuple

create_atari_mcts(observation_spec, action_spec)[source]#

Helper function for creating MCTSAlgorithm for atari games.

create_board_game_mcts(observation_spec, action_spec, dirichlet_alpha, pb_c_init=1.25, num_simulations=800, debug_summaries=False)[source]#

Helper function for creating MCTSAlgorithm for board games.

create_chess_mcts(observation_spec, action_spec, debug_summaries)[source]#
create_control_mcts(observation_spec, action_spec, num_simulations=50, debug_summaries=False)[source]#

Helper function for creating MCTSAlgorithm for control tasks.

create_go_mcts(observation_spec, action_spec, debug_summaries)[source]#
create_shogi_mcts(observation_spec, action_spec, debug_summaries)[source]#

alf.algorithms.mcts_models#

class MCTSModel(num_unroll_steps, representation_net, dynamics_net, prediction_net, train_reward_function, train_game_over_function, train_repr_prediction=False, train_policy=True, predict_reward_sum=False, value_loss_weight=1.0, reward_loss_weight=1.0, policy_loss_weight=1.0, game_over_loss_weight=1.0, repr_prediction_loss_weight=1.0, initial_alpha=0.0, reward_loss=SquareLoss(), value_loss=SquareLoss(), repr_loss=MeanSquaredLoss(batch_dims=2), target_entropy=None, alpha_adjust_rate=0.001, initial_loss_weight=1, predict_initial_reward=True, reset_reward_sum_period=0, apply_beyond_episode_end_mask=False, apply_partial_trajectory_mask=False, debug_summaries=False, name='MCTSModel')[source]#

Bases: torch.nn.modules.module.Module

The interface for the model used by MCTSAlgorithm.

Parameters
  • representation_net (Network) – the network for generating initial latent representation from observation. It is called as representation_net(observation).

  • dynamics_net (Network) – the network for generating the next latent representation given the current latent representation and action. It is called as dynamics_net((current_latent_representation, action))

  • prediction_net (Network) –

    the network for predicting value, reward and action. It is called as prediction_net(dyn_state, pred_state) and output a tuple of four Tensors: - value_pred: the prediction for value. The way it is interpreted

    depends on value_loss.

    • reward_pred (Optional): the prediction for reward. The way it is interpreted depends on reward_loss.

    • action_distribution: The distribution of the actions of the predicted policy.

    • game_over_logit (Optional): The predicted logits for game over.

  • train_reward_function (bool) – whether to predict reward

  • train_game_over_function (bool) – whether to predict game over

  • train_repr_prediction (bool) – whether to train to predict future latent representation.

  • train_policy (bool) – whether to train a policy. Note that training policy is REQUIRED when the model is used in MCTS algorithm.

  • predict_reward_sum (bool) – If True, the loss for reward is between the predicted reward and the sum of actual reward over unroll steps. If False, the loss for reward is the mean square error between the predicted reward and the actual reward.

  • value_loss_weight (float) – the weight for value prediction loss.

  • reward_loss_weight (float) – the weight for reward prediction loss

  • policy_loss_weight (float) – the weight for policy prediction loss

  • repr_prediction_loss_weight (float) – the weight for the loss of predicting latent representation.

  • initial_alpha (float) – initial value for the weight of entropy regulariation

  • reward_loss (ScalarPredictionLoss) – the loss function for reward prediction.

  • value_loss (ScalarPredictionLoss) – the loss function for value prediction.

  • repr_loss (Callable) – the loss function for representation learning. It is called as repr_loss(predicted_representation, target_representation), where the shape of the two tensors are [B, num_unroll_steps+1, …]. It should return a loss with the shape [B, num_unroll_steps+1]``. Note that repr_loss can have its own parameters.

  • target_entropy (float) – if provided, will adjust alpha automatically so that the entropy is not smaller than this.

  • alpha_adjust_rate (float) – the speed to adjust alpha

  • initial_loss_weight (Optional[float]) – the weight for the loss at the initial step of the trajectory. If not provided, 1 / num_unroll_steps will be used.

  • predict_initial_reward (bool) – whether to predict the reward at the initial step.

  • reset_reward_sum_period (int) – reset the reward sum every so many steps. Do not reset the reward sum if this is 0.

  • apply_beyond_episode_end_mask (bool) – If True, the steps after the end of an episode is ignored for the representation prediction loss.

  • apply_partial_trajectory_mask (bool) – If True, the steps after an unfinished episode (due to TimeLimit or an ongoing episode) is ignored for all the losses.

calc_loss(model_output, target)[source]#

Calculate the loss.

The shapes of the tensors in model_output are [B, unroll_steps+1, …] :returns: the shapes of the tensors are [B] :rtype: LossInfo

calc_repr_prediction_loss(repr, target_repr)[source]#

Calculate the loss given the predicted representation and target representation.

initial_inference(observation)[source]#
Return type

ModelOutput

initial_predict(latent, pred_state=())[source]#

Make predictions based on an initial latent representation.

Note that we specialize for initial prediction (in addition to recurrent prediction made in recurrent_inference()) because some stateful initializations need to be completed.

Parameters
  • latent (Tensor) – A batch of initial representation (i.e. directly derived from a raw observation).

  • pred_state – prediction state. If provided, it should be ModelOutput.state.pred_state returned from initial_predict at the previous step

Return type

ModelOutput

Returns

A ModelOutput object produced by the prediction network.

initial_representation(observation)[source]#

Compute the initial latent representation given the observation. :param observation: A tensor or tensor nest representing a batch of

observations.

Return type

Tensor

Returns

The latent representation generated by the representation net.

property pred_state_spec: Union[alf.tensor_specs.TensorSpec, List[NestedTensorSpec], Tuple[()], Tuple[NestedTensorSpec, ...], Dict[str, NestedTensorSpec]]#

Returns the spec of the prediction_net.

Return type

Union[TensorSpec, List[ForwardRef], Tuple[()], Tuple[ForwardRef, …], Dict[str, ForwardRef]]

prediction_model(dyn_state, pred_state)[source]#
Calculate the prediction given the latent state of the dynamics model

and the state of the prediction model.

Returns

the following fields need to be provided - value_pred: - reward_pred: provide if need to predict reward - game_over: provide if need to predict game over - actions: provide if actions are sampled - action_probs - state (ModelState): dyn_state, pred_state - action_distribution: - game_over_logit: provide if need to predict game over

Return type

ModelOutput

recurrent_inference(state, action)[source]#

Generate prediction given state and action.

Parameters
  • state (Tensor) – the latent state of the model. The state should be from previous call of initial_inference or recurrent_inference.

  • action (Tensor) – the imagined action

Returns

the prediction

Return type

ModelOutput

property repr_spec: alf.tensor_specs.TensorSpec#

Returns the spec of the representation.

Used by the downstream RL algorithms as their observation spec.

Return type

TensorSpec

training: bool#
class ModelOutput(value, reward, game_over, actions, action_probs, state, action_distribution, game_over_logit, value_pred, reward_pred)#

Bases: tuple

Create new instance of ModelOutput(value, reward, game_over, actions, action_probs, state, action_distribution, game_over_logit, value_pred, reward_pred)

action_distribution#

Alias for field number 6

action_probs#

Alias for field number 4

actions#

Alias for field number 3

game_over#

Alias for field number 2

game_over_logit#

Alias for field number 7

reward#

Alias for field number 1

reward_pred#

Alias for field number 9

state#

Alias for field number 5

value#

Alias for field number 0

value_pred#

Alias for field number 8

class ModelState(state, pred_state, step, prev_reward_sum)#

Bases: tuple

Create new instance of ModelState(state, pred_state, step, prev_reward_sum)

pred_state#

Alias for field number 1

prev_reward_sum#

Alias for field number 3

state#

Alias for field number 0

step#

Alias for field number 2

class ModelTarget(is_partial_trajectory, beyond_episode_end, reward, action, action_policy, game_over, value, observation)#

Bases: tuple

Create new instance of ModelTarget(is_partial_trajectory, beyond_episode_end, reward, action, action_policy, game_over, value, observation)

action#

Alias for field number 3

action_policy#

Alias for field number 4

beyond_episode_end#

Alias for field number 1

game_over#

Alias for field number 5

is_partial_trajectory#

Alias for field number 0

observation#

Alias for field number 7

reward#

Alias for field number 2

value#

Alias for field number 6

class SimpleMCTSModel(observation_spec, action_spec, num_unroll_steps, num_sampled_actions=None, encoding_net_ctor=<function create_simple_encoding_net>, dynamics_net_ctor=<function create_simple_dynamics_net>, prediction_net_ctor=<function create_simple_prediction_net>, game_over_logit_thresh=1.0, initial_alpha=0.0, target_entropy=None, alpha_adjust_rate=0.001, train_reward_function=True, train_game_over_function=True, train_policy=True, train_repr_prediction=False, debug_summaries=False, name='SimpleMCTSModel')[source]#

Bases: alf.algorithms.mcts_models.MCTSModel

Parameters
  • observation_spec (TensorSpec) – representing the observations.

  • action_spec (BoundedTensorSpec) – representing the actions.

  • num_sampled_actions (int) – the number of actions sampled from the action distribution. For continuous action or multi-dimensional discrete action, so many actions will be sampled from the action distribution. For 1 dimensional (scalar) discrete action, the num_sampled_actions actions with the largest probability will be chosen.

  • dynamics_net_ctor (Callable) – Called as dynamics_net_ctor((observation_spec, action_spec)) to create the dynamics net. The created net should take a tuple of (observation, action) as input and output the next observation.

  • prediction_net_ctor (Callable) – Called as prediction_net_ctor(observation_spec, action_spec) to create the prediction net. The created net should take the latent_state as input and output the prediction for (value, reward, action_distribution, game_over_logit).

  • game_over_logit_thresh (float) – the threshold of treating the state as game over if the logit for game is greater than this.

  • initial_alpha (float) – initial value for the weight of entropy regularization

  • target_entropy (float) – if provided, will adjust alpha automatically so that the entropy is not smaller than this.

  • alpha_adjust_rate (float) – the speed to adjust alpha

  • train_reward_function (bool) – whether to predict reward

  • train_game_over_function (bool) – whether to predict game over

  • train_repr_prediction (bool) – whether to train to predict future latent representation. This implements the self-supervised consistency loss described in Ye et. al. Mastering Atari Games with Limited Data. The loss is -cosine(prediction_net(projection_net(x)), projection_net(y)), where x is the representation calcuated by dynamics_net and y is the representation calcualted by representation_net from the corresponding future observations.

  • train_policy (bool) – whether to train a policy. Note that training policy is REQUIRED when the model is used in MCTS algorithm.

prediction_model(dyn_state, pred_state)[source]#
Calculate the prediction given the latent state of the dynamics model

and the state of the prediction model.

Returns

the following fields need to be provided - value_pred: - reward_pred: provide if need to predict reward - game_over: provide if need to predict game over - actions: provide if actions are sampled - action_probs - state (ModelState): dyn_state, pred_state - action_distribution: - game_over_logit: provide if need to predict game over

Return type

ModelOutput

property repr_spec#

Returns the spec of the representation.

Used by the downstream RL algorithms as their observation spec.

training: bool#
class SimplePredictionNet(observation_spec, action_spec, trunk_net_ctor, num_quantiles=1, discrete_projection_net_ctor=<class 'alf.networks.projection_networks.CategoricalProjectionNetwork'>, continuous_projection_net_ctor=<class 'alf.networks.projection_networks.StableNormalProjectionNetwork'>, initial_game_over_bias=0.0)[source]#

Bases: alf.networks.network.Network

Parameters
  • observation_spec (TensorSpec) – describing the observation.

  • action_spec (BoundedTensorSpec) – describing the action.

  • trunk_net_ctor (Callable) – called as trunk_net_ctor(input_tensor_spec=observation_spec) to created a network which taks observation as input and output a hidden representation which will be used as input for predicting value, reward, action_distribution and game_over_logit

  • initial_game_over_bias (float) – initial bias for predicting the. logit of game_over. Sugguest to use log(game_over_prob/(1 - game_over_prob))

forward(input, state=())[source]#

Predict (value, reward, action_distribution, game_over_logit)

Parameters
  • input (Tensor) – observation

  • state – not used.

Returns

(value, reward, action_distribution, game_over_logit), ()

Return type

A tuple of

training: bool#
create_simple_dynamics_net(input_tensor_spec)[source]#
create_simple_encoding_net(observation_spec)[source]#
create_simple_prediction_net(observation_spec, action_spec)[source]#
get_unique_num_actions(action_spec)[source]#

alf.algorithms.mdq_algorithm#

Multi-Dimensional Q-Learning Algorithm.

class MdqAlgorithm(observation_spec, action_spec, critic_network, reward_spec=TensorSpec(shape=(), dtype=torch.float32), epsilon_greedy=None, env=None, config=None, critic_loss_ctor=None, target_entropy=<function calc_default_target_entropy_quantized>, initial_log_alpha=0.0, target_update_tau=0.05, target_update_period=1, distill_noise=0.01, critic_optimizer=None, alpha_optimizer=None, debug_summaries=False, name='MdqAlgorithm')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

Multi-Dimensional Q-Learning Algorithm.

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • critic_network (MdqCriticNetwork) – an instance of MdqCriticNetwork

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from config.epsilon_greedy and then alf.get_config_value(TrainerConfig.epsilon_greedy).

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.

  • config (TrainerConfig) – config for training. It only needs to be provided to the algorithm which performs train_iter() by itself.

  • critic_loss_ctor (None|OneStepTDLoss|MultiStepLoss) – a critic loss constructor. If None, a default OneStepTDLoss will be used.

  • initial_log_alpha (float) – initial value for variable log_alpha.

  • target_entropy (float|Callable) – If a floating value, it’s the target average policy entropy, for updating alpha. If a callable function, then it will be called on the action spec to calculate a target entropy. Note that in MDQ algorithm, as the continuous action is represented by a discrete distribution for each action dimension, calc_default_target_entropy_quantized is used to compute the target entropy by default.

  • target_update_tau (float) – Factor for soft update of the target networks.

  • target_update_period (int) – Period for soft update of the target networks.

  • distill_noise (int) – the std of random Gaussian noise added to the action used for distillation.

  • critic_optimizer (torch.optim.optimizer) – The optimizer for critic.

  • alpha_optimizer (torch.optim.optimizer) – The optimizer for alpha.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – The name of this algorithm.

after_update(root_inputs, info)[source]#

Do things after completing one gradient update (i.e. update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.

Parameters
  • root_inputs (nest) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll().

  • info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() for on-policy training or train_step() for off-policy training.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(time_step, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

rollout_step(time_step, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

train_step(inputs, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class MdqAlphaInfo(alpha_loss, neg_entropy)#

Bases: tuple

Create new instance of MdqAlphaInfo(alpha_loss, neg_entropy)

alpha_loss#

Alias for field number 0

neg_entropy#

Alias for field number 1

class MdqCriticInfo(critic_free_form, target_critic_free_form, critic_adv_form, distill_target, kl_wrt_prior)#

Bases: tuple

Create new instance of MdqCriticInfo(critic_free_form, target_critic_free_form, critic_adv_form, distill_target, kl_wrt_prior)

critic_adv_form#

Alias for field number 2

critic_free_form#

Alias for field number 0

distill_target#

Alias for field number 3

kl_wrt_prior#

Alias for field number 4

target_critic_free_form#

Alias for field number 1

class MdqCriticState(critic, target_critic)#

Bases: tuple

Create new instance of MdqCriticState(critic, target_critic)

critic#

Alias for field number 0

target_critic#

Alias for field number 1

class MdqInfo(reward, step_type, discount, action, critic, alpha)#

Bases: tuple

Create new instance of MdqInfo(reward, step_type, discount, action, critic, alpha)

action#

Alias for field number 3

alpha#

Alias for field number 5

critic#

Alias for field number 4

discount#

Alias for field number 2

reward#

Alias for field number 0

step_type#

Alias for field number 1

class MdqLossInfo(critic, distill, alpha)#

Bases: tuple

Create new instance of MdqLossInfo(critic, distill, alpha)

alpha#

Alias for field number 2

critic#

Alias for field number 0

distill#

Alias for field number 1

class MdqState(critic)#

Bases: tuple

Create new instance of MdqState(critic,)

critic#

Alias for field number 0

alf.algorithms.merlin_algorithm#

Implementation of MERLIN algorithm. See class MerlinAlgorithm for detail.

class MBPLossInfo(decoder, vae)#

Bases: tuple

Create new instance of MBPLossInfo(decoder, vae)

decoder#

Alias for field number 0

vae#

Alias for field number 1

class MBPState(latent_vector, mem_readout, rnn_state, memory)#

Bases: tuple

Create new instance of MBPState(latent_vector, mem_readout, rnn_state, memory)

latent_vector#

Alias for field number 0

mem_readout#

Alias for field number 1

memory#

Alias for field number 3

rnn_state#

Alias for field number 2

class MemoryBasedActor(observation_spec, action_spec, memory, reward_spec=TensorSpec(shape=(), dtype=torch.float32), epsilon_greedy=None, num_read_keys=1, lstm_size=(256, 256), latent_dim=200, loss=None, loss_class=<class 'alf.algorithms.actor_critic_loss.ActorCriticLoss'>, loss_weight=1.0, debug_summaries=False, name='mba')[source]#

Bases: alf.algorithms.on_policy_algorithm.OnPolicyAlgorithm

The policy module for MERLIN model.

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • memory (MemoryWithUsage) – the memory module from MemoryBasedPredictor

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from alf.get_config_value(TrainerConfig.epsilon_greedy).

  • num_read_keys (int) – number of keys for reading memory.

  • latent_dim (int) – the dimension of the hidden representation of VAE.

  • lstm_size (list[int]) – size of lstm layers

  • loss (None|ActorCriticLoss) – an object for calculating the loss for reinforcement learning. If None, a default ActorCriticLoss will be used.

  • loss_class (type) – the class of the loss. The signature of its constructor: loss_class(debug_summaries)

  • name (str) – name of the algorithm.

calc_loss(train_info)[source]#

Calculate loss.

predict_step(time_step, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

rollout_step(time_step, state)[source]#

Train one step.

Parameters
  • time_step (TimeStep) – time_step.observation should be the latent vector.

  • state (nested Tensor) – state of the model

training: bool#
class MemoryBasedPredictor(action_spec, encoders, decoders, num_read_keys=3, lstm_size=(256, 256), latent_dim=200, memory_size=1350, loss_weight=1.0, name='mbp')[source]#

Bases: alf.algorithms.algorithm.Algorithm

The Memroy Based Predictor.

It’s described in: Wayne et al “Unsupervised Predictive Memory in a Goal-Directed Agent” arXiv:1803.10760

Parameters
  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • encoders (nested Network) – the nest should match observation_spec

  • decoders (nested Algorithm) – the nest should match observation_spec

  • num_read_keys (int) – number of keys for reading memory.

  • lstm_size (list[int]) – size of lstm layers for MBP and MBA

  • latent_dim (int) – the dimension of the hidden representation of VAE.

  • memroy_size (int) – number of memory slots

  • loss_weight (float) – weight for the loss

  • name (str) – name of the algorithm.

decode_step(latent_vector, observations)[source]#

Calculate decoding loss.

encode_step(inputs, state)[source]#

Calculate latent vector.

Parameters
  • inputs (tuple) – a tuple of (observation, prev_action).

  • state (MBPState) – RNN state

Returns

  • output: latent vector

  • state: next_state

  • info (LossInfo): loss

Return type

AlgStep

property memory#

Return the external memory of this module.

predict_step(inputs, state)[source]#

Train one step.

Parameters
  • inputs (tuple) – a tuple of (observation, action).

  • state (nested Tensor) – RNN state

Returns

  • output: latent vector

  • state: next state

  • info: empty tuple

Return type

AlgStep

train_step(inputs, state)[source]#

Train one step.

Parameters

inputs (tuple) – a tuple of (observation, action).

Returns

  • output: latent vector

  • state: next state

  • info (LossInfo): loss

Return type

AlgStep

training: bool#
class MerlinAlgorithm(observation_spec, action_spec, encoders, decoders, reward_spec=TensorSpec(shape=(), dtype=torch.float32), env=None, config=None, latent_dim=200, lstm_size=(256, 256), memory_size=1350, rl_loss=None, optimizer=None, debug_summaries=False, name='Merlin')[source]#

Bases: alf.algorithms.on_policy_algorithm.OnPolicyAlgorithm

MERLIN model.

This implements the MERLIN model described in Wayne et al “Unsupervised Predictive Memory in a Goal-Directed Agent” arXiv:1803.10760

Current differences:

  • No action encoding and decoding

  • No retroactive memory update

  • No prediction of state-action value

  • Value prediction does not use action distribution as feature.

  • No q-value prediction

  • Image encoding and decoding use batch-norm. The paper didn’t use.

Parameters
  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • encoders (nested Network) – the nest should match observation_spec

  • decoders (nested Algorithm) – the nest should match observation_spec

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultaneously. Running multiple environments in parallel is crucial to on-policy algorithms as it increases the diversity of data and decreases temporal correlation. env only needs to be provided to the root Algorithm.

  • config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs train_iter() by itself.

  • latent_dim (int) – the dimension of the hidden representation of VAE.

  • lstm_size (list[int]) – size of lstm layers for MBP and MBA

  • memroy_size (int) – number of memory slots

  • rl_loss (None|ActorCriticLoss) – an object for calculating the loss for reinforcement learning. If None, a default ActorCriticLoss will be used.

  • optimizer (torch.optim.Optimizer) – The optimizer for training.

  • debug_summaries – True if debug summaries should be created.

  • name (str) – name of the algorithm.

calc_loss(info)[source]#

Calculate loss.

predict_step(time_step, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

rollout_step(time_step, state)[source]#

Train one step.

training: bool#
class MerlinInfo(mbp_info, mba_info)#

Bases: tuple

Create new instance of MerlinInfo(mbp_info, mba_info)

mba_info#

Alias for field number 1

mbp_info#

Alias for field number 0

class MerlinLossInfo(mba, mbp)#

Bases: tuple

Create new instance of MerlinLossInfo(mba, mbp)

mba#

Alias for field number 0

mbp#

Alias for field number 1

class MerlinState(mbp_state, mba_state)#

Bases: tuple

Create new instance of MerlinState(mbp_state, mba_state)

mba_state#

Alias for field number 1

mbp_state#

Alias for field number 0

class ResnetDecodingNetwork(input_tensor_spec, output_tensor_spec=TensorSpec(shape=(3, 64, 64), dtype=torch.float32), name='ResnetDecodingNetwork')[source]#

Bases: alf.networks.network.Network

Image decoding network using ResNet bottleneck blocks.

This is not a generic network, it implements ImageDecoder described in 2.2.1 of “Unsupervised Predictive Memory in a Goal-Directed Agent”

Parameters
  • input_tensor_spec (TensorSpec) – input latent spec.

  • output_tensor_spec (TensorSpec) – desired output shape. Height and width needs to be divisible by 8.

forward(observation, state=())[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool#
class ResnetEncodingNetwork(input_tensor_spec, output_size=500, output_activation=<built-in method tanh of type object>, use_fc_bn=False, norm_layer=None, name='ResnetEncodingNetwork')[source]#

Bases: alf.networks.network.Network

Image encoding network using ResNet bottleneck blocks.

This is not a generic network, it implements ImageEncoder described in 2.1.1 of “Unsupervised Predictive Memory in a Goal-Directed Agent”

Parameters
  • input_tensor_spec (nested TensorSpec) – input observations spec.

  • output_size (int) – dimension of the encoding result

  • output_activation (Callable) – activation for the output

  • use_fc_bn (bool) – whether to use batch normalization for the final FC layer.

  • norm_layer (nn.Module|None) – optional additional layer for normalization.

forward(observation, state=())[source]#

Defines the computation performed at every call.

Should be overridden by all subclasses.

Note

Although the recipe for forward pass needs to be defined within this function, one should call the Module instance afterwards instead of this since the former takes care of running the registered hooks while the latter silently ignores them.

training: bool#

alf.algorithms.mi_estimator#

Mutual Information Estimator.

class MIEstimator(x_spec, y_spec, model=None, fc_layers=(256), sampler='buffer', buffer_size=65536, optimizer=None, estimator_type='DV', averager=None, name='MIEstimator')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Mutual Infomation Estimator.

Implements several mutual information estimator from Belghazi et al Mutual Information Neural Estimation Hjelm et al Learning Deep Representations by Mutual Information Estimation and Maximization

Currently, 3 types of estimator are implemented, which are based on the following variational lower bounds:

  • DV: \(\sup_T E_P(T) - \log E_Q(\exp(T))\)

  • KLD: \(\sup_T E_P(T) - E_Q(\exp(T)) + 1\)

  • JSD: \(\sup_T -E_P(softplus(-T))) - E_Q(softplus(T)) + \log(4)\)

  • ML: \(\sup_q E_P(\log(q(y|x)) - \log(P(y)))\)

where P is the joint distribution of X and Y, and Q is the product marginal distribution of P. Both DV and KLD are lower bounds for \(KLD(P||Q)=MI(X, Y)\). However, JSD is not a lower bound for mutual information, it is a lower bound for \(JSD(P||Q)\), which is closely correlated with MI as pointed out in Hjelm et al.

For ML, \(P(y)\) is the margianl distribution of y, and it needs to be provided. The current implementation uses a normal distribution with diagonal variance for \(q(y|x)\). So it only support continous y. If \(P(y|x)\) can be reasonably approximated as an diagonal normal distribution and \(P(y)\) is known, then ‘ML’ may give better estimation for the mutual information.

Assumming the function class of T is rich enough to represent any function, for KLD and JSD, T will converge to \(\log(\frac{P}{Q})\) and hence \(E_P(T)\) can also be used as an estimator of \(KLD(P||Q)=MI(X,Y)\). For DV, \(T\) will converge to \(\log(\frac{P}{Q}) + c\), where \(c=\log E_Q(\exp(T))\).

Among DV, KLD and JSD, DV and KLD seem to give a better estimation of PMI than JSD. But JSD might be numerically more stable than DV and KLD because of the use of softplus instead of exp. And DV is more stable than KLD because of the logarithm.

Several strategies are implemented in order to estimate \(E_Q(\cdot)\):

  • ‘buffer’: store \(y\) to a buffer and randomly retrieve samples from the buffer.

  • ‘double_buffer’: stroe both \(x\) and \(y\) to buffers and randomly retrieve samples from the two buffers.

  • ‘shuffle’: randomly shuffle batch \(y\)

  • ‘shift’: shift batch \(y\) by one sample, i.e. torch.cat([y[-1:, ...], y[0:-1, ...]], dim=0)

  • direct sampling: You can also provide the marginal distribution of \(y\) to train_step(). In this case, sampler is ignored and samples of \(y\) for estimating \(E_Q(.)\) are sampled from y_distribution.

If you need the gradient of \(y\), you should use sampler ‘shift’ and ‘shuffle’.

Among these, ‘buffer’ and ‘shift’ seem to perform better and ‘shuffle’ performs worst. ‘buffer’ incurs additional storage cost. ‘shift’ has the assumption that y samples from one batch are independent. If the additional memory is not a concern, we recommend ‘buffer’ sampler so that there is no need to worry about the assumption of independence.

MIEstimator can be also used to estimate conditional mutual information \(MI(X,Y|Z)\) using KLD, JSD or ML. In this case, you should let x to represent \(X\) and \(Z\), and y to represent \(Y\). And when calling train_step(), you need to provide y_distribution which is the distribution \(P(Y|z)\). Note that DV cannot be used for estimating conditional mutual information. See mi_estimator_test.py for an example.

Parameters
  • x_spec (nested TensorSpec) – spec of x

  • y_spec (nested TensorSpec) – spec of y

  • model (Network) – can be called as model([x, y]) and return a Tensor with shape=[batch_size, 1]. If None, a default MLP with fc_layers will be created.

  • fc_layers (tuple[int]) – size of hidden layers. Only used if model is None.

  • sampler (str) – type of sampler used to get samples from marginal distribution, should be one of ['buffer', 'double_buffer', 'shuffle', 'shift'].

  • buffer_size (int) – capacity of buffer for storing y for sampler ‘buffer’ and ‘double_buffer’.

  • optimzer (torch.optim.Optimzer) – optimizer

  • estimator_type (str) – one of ‘DV’, ‘KLD’ or ‘JSD’

  • averager (EMAverager) – averager used to maintain a moving average of \(exp(T)\). Only used for ‘DV’ estimator. If None, a ScalarAdaptiveAverager will be created.

  • name (str) – name of this estimator

calc_pmi(x, y, y_distribution=None)[source]#

Return estimated pointwise mutual information.

The pointwise mutual information is defined as:

\[\log \frac{P(x|y)}{P(x)} = \log \frac{P(y|x)}{P(y)}\]
Parameters
  • x (Tensor) – x

  • y (Tensor) – y

  • y_distribution (DiagMultivariateNormal) – needs to be provided for ‘ML’ estimator.

Returns

pointwise mutual information between x and y.

Return type

Tensor

train_step(inputs, y_distribution=None, state=None)[source]#

Perform training on one batch of inputs.

Parameters
  • inputs (tuple(nested Tensor, nested Tensor)) – tuple of x and y

  • y_distribution (nested td.Distribution) – distribution for the marginal distribution of y. If None, will use the sampling method sampler provided at constructor to generate the samples for the marginal distribution of \(Y\).

  • state – not used

Returns

  • outputs (Tensor): shape is [batch_size], its mean is the estimated MI for estimator ‘KL’, ‘DV’ and ‘KLD’, and Jensen-Shannon divergence for estimator ‘JSD’

  • state: not used

  • info (LossInfo): info.loss is the loss

Return type

AlgStep

training: bool#

alf.algorithms.monet_algorithm#

class MoNetAlgorithm(n_slots, slot_size, input_tensor_spec, attention_unet_cls=<class 'alf.algorithms.monet_algorithm.MoNetUNet'>, encoder_cls=<class 'alf.networks.encoding_networks.EncodingNetwork'>, decoder_cls=<function SpatialBroadcastDecodingNetwork>, recurrent_attention=True, beta=0.0, gamma=0.0, name='MoNetAlgorithm')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Implement the MoNet algorithm in the paper:

Burgess et al. 2019, MONet: Unsupervised Scene Decomposition and Representation

The algorithm can be thought of as one kind of VAEs except that it’s expected to produce object-centric posterior latent embeddings.

  1. We follow the exact form of image reconstruction loss in the paper. For each pixel, the mask values are the component weights of a GMM, and the predicted pixel values are the means of the GMM (log of weighted probs). Another implementation https://github.com/stelzner/monet uses an upper bound of this loss, where the mask values are weights of the mean square errors between a pixel and its predicted values (weighted log probs).

  2. We also support generating attention masks all at once, which could speed up the attention process if the number of slots is large. However, we do observe that the recurrent process usually gives better performance than this one-time process.

  3. Each slot has a different pre-assigned fixed sigma for its Gaussian model. The sigmas are automatically generated. The unequal sigmas are crucial for breaking symmetry when generating attention masks for the slots.

Parameters
  • n_slots (int) – number of slots (or objects) pre-defined. Note that background is also counted as an “object”.

  • slot_size (int) – the dimension of each slot embedding.

  • input_tensor_spec (Union[TensorSpec, List[ForwardRef], Tuple[()], Tuple[ForwardRef, …], Dict[str, ForwardRef]]) – the spec of input images

  • attention_unet_cls (Callable) –

    creates the attention UNet that generates masks for the slots. Depending on the value of recurrent_attention, this unet input and output channels might change. The user doesn’t need to specify the input and output specs for this UNet, as it is automatically handled by the algorithm.

    • If recurrent_attention==True, this UNet receives RGB+attention_scope and outputs attention logits for the current iteration. Input shape: [B,C+1,H,W]; output shape: [B,2,H,W].

    • Otherwise it receives RGB and outputs n_slots channels (all attention logits). Input shape: [B,C,H,W]; output shape: [B,n_slots,H,W].

    In either case, the UNet’s output should be non-activated.

  • encoder_cls (Callable) – creates the posterior encoder of MoNet. Note that this encoder operates on each individual slot independently, and thus it’s invariant to the slot order. For each slot, the encoder accepts a concatenation of the image and an attention mask for the slot, in a shape of [B,C+1,H,W]. The encoder outputs a non-activated vector of shape [B,2*slot_size], representing the mean and log variance of the slot Gaussian posterior.

  • decoder_cls (Callable) – creates the decoder of MoNet. The decoder also operates on each individual slot independently, and it should reconstruct both the image (the part masked by the attention; 3 channels) and the attention mask input to the encoder (1 channel). The output should be non-activated. Input shape: [B,slot_size]; output shape: [B,C+1,H,W].

  • recurrent_attention (bool) – if True, recurrently generates attention masks where each iteration conditions on the scope as the remaining attention; otherwise all attention masks are generated once.

  • beta (float) – weight for the VAE KLD term, sometimes this KLD can be ignored.

  • gamma (float) – weight for the KLD between generated attention masks and the reconstructed masks. A positive value might help make the masks more regular and compact.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

static make_gaussian(z_mean_and_log_var)[source]#
train_step(inputs, state=())[source]#

Run a training step of MoNet.

Parameters

inputs (Tensor) – the input image

Returns

  • output (VAEOutput): contains the rsampled posterior z and the

    mode of the posterior distribution z_mode.

  • state: empty

  • info (MoNetInfo):
    • loss: the overall loss

    • kld: kl divergence between posterior and prior (before beta)

    • rec_loss: image reconstruction loss

    • mask_rec_loss: mask reconstruction loss (before gamma)

    • full_rec: the fully reconstructed image from all slots (shape [B,C,H,W])

    • mask: the attention masks output by the attention network (note not the reconstructed one; shape [B,slots,H,W])

    • z_dist: the posterior distribution

Return type

AlgStep

training: bool#
class MoNetInfo(kld, rec_loss, mask_rec_loss, full_rec, mask, z_dist)#

Bases: tuple

Create new instance of MoNetInfo(kld, rec_loss, mask_rec_loss, full_rec, mask, z_dist)

full_rec#

Alias for field number 3

kld#

Alias for field number 0

mask#

Alias for field number 4

mask_rec_loss#

Alias for field number 2

rec_loss#

Alias for field number 1

z_dist#

Alias for field number 5

class MoNetUNet(input_tensor_spec, filters, nonskip_fc_layers, output_channels, name='MoNetUNet')[source]#

Bases: alf.networks.network.Network

Implement the UNet architecture used by MoNet. See Appendix B.2 of the MoNet paper https://arxiv.org/abs/1901.11390 for details.

The architecture is slightly different from the one in the paper, where for the downsampling path, we don’t downsample for the first block but always downsample for the other blocks. For an illustration,

                 (img) 16       16 (output)
            (3x3 conv) |  skip  | (3x3 conv + 1x1 conv)
                       16 ----> 16
(3x3 conv + maxpool 2) |  skip  | (3x3 conv + upsample 2)
                       8 -----> 8
(3x3 conv + maxpool 2) |  skip  | (3x3 conv + upsample 2)
                       4 -----> 4
                        \      /
                          MLP
Parameters
  • input_tensor_spec (Union[TensorSpec, List[ForwardRef], Tuple[()], Tuple[ForwardRef, …], Dict[str, ForwardRef]]) – spec of the input image

  • filters (Tuple[int]) – a tuple of output channels along the downsampling path, each for a conv layer. The upsampling path uses a reversed tuple.

  • nonskip_fc_layers (Tuple[int]) – a tuple of fc layer sizes for the bottleneck connection (nonskip) of the UNet.

  • output_channels (int) – final output channels. The output features are non-activated.

forward(inputs, state=())[source]#

Do a forward step of the UNet.

Parameters

inputs (Tensor) – the input image of shape [B,C,H,W] where C can be any value.

Returns

  • output: an output image of the shape [B,K,H,W], where K is

    output_channels. The output image is non-activated.

  • state: empty

Return type

tuple

training: bool#

alf.algorithms.muzero_algorithm#

MuZero algorithm.

class MuzeroAlgorithm(observation_spec, action_spec, discount, reward_spec=TensorSpec(shape=(), dtype=torch.float32), representation_learner_ctor=<class 'alf.algorithms.muzero_representation_learner.MuzeroRepresentationImpl'>, mcts_algorithm_ctor=<class 'alf.algorithms.mcts_algorithm.MCTSAlgorithm'>, reward_transformer=None, config=None, enable_amp=True, checkpoint=None, debug_summaries=False, name='MuZero')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

MuZero algorithm. MuZero is described in the paper: Schrittwieser et al. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model.

This is a wrapper that combines two sub algorithm components:

  1. A Muzero-style representation learner.

    The representation learner employs a MCTSModel to train a translation from a raw observation to its latent representation. The model is also used to predict the reward, values, policy, etc which will be used in the MCTS algorithm.

  2. A MCTS-based policy algorithm. It will perform tree search using the model provided by the representation learner to give the final policy on each predict and rollout step.

NOTE: Currently, the MCTS-based policy algorithm is assumed to NOT have any learnable parameters. This means that training will only update the parameters of the underlying model in the representation learner, and training related hooks for example train_step() and preprocess_experience() will delegate directly to their counterparts in the representation learner. This behavior can be changed if needed in the future.

Parameters
  • observation_spec (TensorSpec) – representing the observations.

  • action_spec (BoundedTensorSpec) – representing the actions.

  • representation_learner_ctor (Callable[…, MuzeroRepresentationImpl]) – It will be called to construct a MuZero-style representation learner. It is expected to be called as representation_learner_ctor(observation_spec=?, action_spec=?, reward_spec=?, discount=?, reward_transformer=?, enable_amp=?, config=?, debug_summaries=?, name=?).

  • mcts_algorithm_ctor (Callable[…, MCTSAlgorithm]) – will be called as mcts_algorithm_ctor(observation_spec=?, action_spec=?, discount=?, debug_summaries=?, name=?) to construct an MCTSAlgorithm instance. The constructed MCTS algorithm is assumed to have no learnable parameters. It also relies on the model from the representation learner ro run MCTS.

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • reward_transformer (Callable|None) – if provided, will be used to transform reward.

  • config (Optional[TrainerConfig]) – The trainer config that will eventually be assigned to self._config.

  • enable_amp (bool) – whether to use automatic mixed precision for inference. This usually makes the algorithm run faster. However, the result may be different (mostly likely due to random fluctuation). Note that rollout_step is exempted from using AMP.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • debug_summaries (bool) –

  • name (str) –

after_update(root_inputs, info)[source]#

Do things after completing one gradient update (i.e. update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.

Parameters
  • root_inputs (nest) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll().

  • info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() for on-policy training or train_step() for off-policy training.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(time_step, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

preprocess_experience(root_inputs, rollout_info, batch_info)[source]#

This function is called on the experiences obtained from a replay buffer. An example usage of this function is to calculate advantages and returns in PPOAlgorithm.

The shapes of tensors in experience are assumed to be \((B, T, ...)\).

Parameters
  • root_inputs (nest) – input for rollout_step() of the root algorithm. This is from replay buffer. Note this is not same as the input of rollout_step() of self unless self is the root algorithm.

  • rollout_info (nested Tensor) – AlgStep.info from rollout_step() for this algorithm.

  • batch_info (BatchInfo) – information about this batch of data

Returns

  • processed root_inputs

  • processed rollout_info

Return type

tuple

rollout_step(time_step, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

set_path(path)[source]#

Set the path from the root algorithm to this algorithm.

See AlgorithmInterface.path for description about path. This function is called by the trainer before training starts. It needs to be implemented if the algorithm contains some other sub-algorithms.

If an algorithm does not have any sub-algorithm or its sub-algorithm does not need to access the root replay buffer directly, it does not implement this function.

train_step(exp, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#

alf.algorithms.muzero_representation_learner#

MuZero algorithm.

class LinearTdStepFunc(max_bootstrap_age, min_td_steps=1)[source]#

Bases: object

Linearly decrease td steps from max_td_steps to min_td_steps based on the age of a sample.

If the age of a sample is more than max_bootstrap_age, its td steps will be min_td_steps. This is the “dynamic horizon” trick described in paper Mastering Atari Games with Limited Data

class MuzeroInfo(action, value, target, loss)#

Bases: tuple

Create new instance of MuzeroInfo(action, value, target, loss)

action#

Alias for field number 0

loss#

Alias for field number 3

target#

Alias for field number 2

value#

Alias for field number 1

class MuzeroRepresentationImpl(observation_spec, action_spec, model_ctor, num_unroll_steps, td_steps, discount, reward_spec=TensorSpec(shape=(), dtype=torch.float32), recurrent_gradient_scaling_factor=0.5, reward_transformer=None, calculate_priority=None, train_reward_function=True, train_game_over_function=True, train_repr_prediction=False, train_policy=True, reanalyze_algorithm_ctor=None, reanalyze_ratio=0.0, reanalyze_td_steps=5, reanalyze_td_steps_func=None, reanalyze_batch_size=None, full_reanalyze=False, priority_func="lambda loss_info: loss_info.extra['value'].sqrt().sum(dim=0)", data_transformer_ctor=None, data_augmenter=None, target_update_tau=1.0, target_update_period=1000, config=None, enable_amp=True, random_action_after_episode_end=False, optimizer=None, checkpoint=None, debug_summaries=False, name='MuzeroRepresentationImpl')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

MuZero-style Representation Learner.

MuZero is described in the paper: Schrittwieser et al. Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model.

The pseudocode can be downloaded from https://arxiv.org/src/1911.08265v2/anc/pseudocode.py

This representation learner trains the underlying MCTSModel to

  1. Most importantly, produce a latent representation from an observation

  2. Predict the next latent representation given the current latent + an action

  3. Predict various targets (e.g. reward, value)

Amont the above, 1) can be used as the representation in comibination with another RL aalgorithm; 2) and 3) can be used in policy improvements that requires a predictive model (e.g. Monte Carlo Tree Search).

The model is trained with supervision on target prediction in 2) and 3). Some of the targets may be computed with the reanalyze component. Please refer to the original MuZero paper and the following paper for details.

Online and Offline Reinforcement Learning by Planning with a Learned Model.

Parameters
  • observation_spec (TensorSpec) – representing the observations.

  • action_spec (BoundedTensorSpec) – representing the actions.

  • model_ctor (Callable) – will be called as model_ctor(observation_spec=?, action_spec=?, debug_summaries=?) to construct the model. The model should follow the interface alf.algorithms.mcts_models.MCTSModel.

  • num_unroll_steps (int) – steps for unrolling the model during training.

  • td_steps (int) – bootstrap so many steps into the future for calculating the discounted return. -1 means to bootstrap to the end of the game. Can only used for environments whose rewards are zero except for the last step as the current implmentation only use the reward at the last step to calculate the return.

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • recurrent_gradient_scaling_factor (float) – the gradient go through the model.recurrent_inference is scaled by this factor. This is suggested in Appendix G.

  • reward_transformer (Callable|None) – if provided, will be used to transform reward.

  • calculate_priority (bool) – whether to calculate priority. If not provided, will be same as TrainerConfig.priority_replay. This is only useful if priority replay is enabled.

  • train_reward_function (bool) – whether train reward function. If False, reward should only be given at the last step of an episode.

  • train_game_over_function (bool) – whether train game over function.

  • train_repr_prediction (bool) – whether to train to predict future latent representation.

  • train_policy (bool) – whether to train a policy. Note that training policy is REQUIRED when the model is used in MCTS algorithm.

  • reanalyze_algorithm_ctor (Callable) – will be called as reanalyze_algorithm_ctor(observation_spec=?, action_spec=?, discount=?, debug_summaries=?, name=?) to construct an Algorithm instance for reanalyze. It can also optionally accept an additional argument ‘model’. If so, an model constructed using model_ctor will be passed to the constructor.

  • reanalyze_ratio (float) – float number in [0., 1.]. Reanalyze so much portion of data retrieved from replay buffer. Reanalyzing means using recent model to calculate the value and policy target.

  • reanalyze_td_steps (int) – the n for the n-step return for reanalyzing.

  • reanalyze_td_steps_func (Callable) – If provided, will be called as reanalyze_td_steps_func(sample_age, reanalyze_td_steps, current_max_age) to calculate the td_steps in reanalyze. sample_age is a Tensor whose elements are between 0 and 1 indicating the age of each sample. The age of the latest sample is 0. The age of the sample collected at the beginning of the training is current_max_age.

  • reanalyze_batch_size (int|None) – the memory usage may be too much for reanalyzing all the data for one training iteration. If so, provide a number for this so that it will analyzing the data in several batches.

  • full_reanalyze (bool) –

    if False, during reanalyze only the first num_unroll_steps+1 steps are calculated using MCTS, and the next

    reanalyze_td_steps are calculated from the model directly. If True, all are calculated using MCTS.

  • priority_func (Union[Callable, str]) – the function for calculating priority. If it is a str, eval(priority_func) will be called first to convert it a Callable. It is called as priority_func(loss_info), where loss_info is the temporally stacked LossInfo strucuture returned from MCTSModel.calc_loss().

  • data_transformer_ctor (None|Callable|list[Callable]) – if provided, will used to construct data transformer. Otherwise, the one provided in config will be used.

  • data_augmenter (Optional[Callable]) – If provided, will be called to perform data augmentation as data_augmenter(observation) for training observations, where the shape of observation is [B, T, …] if train_repr_prediction is False, and [B, T*(R+1), …] if train_repr_prediction is True. B is mini-batch size, T is mini-batch length and R is num_unroll_steps.

  • target_update_tau (float) – Factor for soft update of the target networks used for reanalyzing.

  • target_update_period (int) – Period for soft update of the target networks used for reanalyzing.

  • config (Optional[TrainerConfig]) – The trainer config that will eventually be assigned to self._config.

  • enable_amp (bool) – whether to use automatic mixed precision for inference. This usually makes the algorithm run faster. However, the result may be different (mostly likely due to random fluctuation).

  • random_action_after_episode_end – If False, the actions used to predict future states after the end of an episode will be the same as the last action. If True, they will be uniformly sampled.

  • optimizer (Optional[Optimizer]) – the optimizer for independently training the representation.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • debug_summaries (bool) –

  • name (str) –

after_update(root_inputs, info)[source]#

Do things after completing one gradient update (i.e. update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.

Parameters
  • root_inputs (nest) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll().

  • info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() for on-policy training or train_step() for off-policy training.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

property model#
predict_step(time_step, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

preprocess_experience(root_inputs, rollout_info, batch_info)[source]#

Fill rollout_info with MuzeroInfo.

Especially, the training targets for representation learning is computed here with reanalyze and/or bootstrapping.

Note that the shape of experience is [B, T, …], where B is the batch size T is the mini batch length.

rollout_step(time_step, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

train_step(exp, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class MuzeroRepresentationLearner(observation_spec, action_spec, config, training_options=None, reward_spec=TensorSpec(shape=(), dtype=torch.float32), impl_cls=<class 'alf.algorithms.muzero_representation_learner.MuzeroRepresentationImpl'>, debug_summaries=False, name='MuZeroRepresentationLearner')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

Learn represenation following the MuZero style.

This is a thin wrapper over the MuzeroRepresentationImpl, so as to make it possible to work in combination with an RL algorithm (within Agent).

Construct a MuzeroRepresentationLearner.

Parameters
  • observation_spec (TensorSpec) – representing the observations.

  • action_spec (BoundedTensorSpec) – representing the actions.

  • config (TrainerConfig) – The trainer config, usually passed down from Agent.

  • training_options (Optional[MuzeroRepresentationTrainingOptions]) – The representation learner trains its underlying model independent of the RL algorithm, and therefore will need a separate set of parameters for the training options. See MuzeroRepresentationTrainingOptions above for details. If not set, training will not happen.

  • reward_spec – a rank-1 or rank-0 tensor spec representing the reward(s). Will passed down to the underlying wrapped MuzeroRepresentationImpl.

  • impl_cls (Callable[…, MuzeroRepresentationImpl]) – a callable to construct the underlying MuzeroRepresentationImpl. It will be called as impl_cls( observation_spec=?, action_spec=?, reward_spec=?, config=?, debug_summaries=?).

  • debug_summaries (bool) –

  • name (str) –

after_train_iter(experience, info)[source]#

Do things after completing one training iteration (i.e. train_iter() that consists of one or multiple gradient updates). This function can be used for training additional modules that have their own training logic (e.g., on/off-policy, replay buffers, etc). These modules should be added to _trainable_attributes_to_ignore in the parent algorithm.

Other things might also be possible as long as they should be done once every training iteration.

This function will serve the same purpose with after_update if there is always only one gradient update in each training iteration. Otherwise it’s less frequently called than after_update.

Parameters
  • root_inputs (nest|None) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll(). In the case where no data is available from the rollout_step() (e.g. in a offline pre-training phase where the online interaction is not started yet) root_inputs will be None.

  • rollout_info (nest|None) – information collected from rollout_step() for this algorithm during unroll(). In the case where no data is available from the rollout_step() (e.g. in a offline pre-training phase where the online interaction is not started yet) rollout_info will be None.

after_update(root_inputs, info)[source]#

Do things after completing one gradient update (i.e. update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.

Parameters
  • root_inputs (nest) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll().

  • info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() for on-policy training or train_step() for off-policy training.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

property output_spec#

Access the spec of the produced representation.

This will be used as the obervation spec for the subsequent RL algorithm.

predict_step(time_step, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

preprocess_experience(root_inputs, rollout_info, batch_info)[source]#

This function is called on the experiences obtained from a replay buffer. An example usage of this function is to calculate advantages and returns in PPOAlgorithm.

The shapes of tensors in experience are assumed to be \((B, T, ...)\).

Parameters
  • root_inputs (nest) – input for rollout_step() of the root algorithm. This is from replay buffer. Note this is not same as the input of rollout_step() of self unless self is the root algorithm.

  • rollout_info (nested Tensor) – AlgStep.info from rollout_step() for this algorithm.

  • batch_info (BatchInfo) – information about this batch of data

Returns

  • processed root_inputs

  • processed rollout_info

Return type

tuple

rollout_step(time_step, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

train_step(exp, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class MuzeroRepresentationTrainingOptions(interval: int = 1, mini_batch_length: int = 1, mini_batch_size: int = 256, num_updates_per_train_iter: int = 10, replay_buffer_length: int = 100000, initial_collect_steps: int = 2000, priority_replay: bool = True, priority_replay_alpha: float = 1.2, priority_replay_beta: float = 0.0)[source]#

Bases: tuple

The options for training the Muzero Representation.

When used together with an RL algorithm, the representation training does not necessarily share the training options with the RL algorithm. Therefore, we use this class to hold the training options private to the Muzero representation learner.

Create new instance of MuzeroRepresentationTrainingOptions(interval, mini_batch_length, mini_batch_size, num_updates_per_train_iter, replay_buffer_length, initial_collect_steps, priority_replay, priority_replay_alpha, priority_replay_beta)

initial_collect_steps: int#

Alias for field number 5

interval: int#

Alias for field number 0

mini_batch_length: int#

Alias for field number 1

mini_batch_size: int#

Alias for field number 2

num_updates_per_train_iter: int#

Alias for field number 3

priority_replay: bool#

Alias for field number 6

priority_replay_alpha: float#

Alias for field number 7

priority_replay_beta: float#

Alias for field number 8

replay_buffer_length: int#

Alias for field number 4

alf.algorithms.oac_algorithm#

Optimistic Actor Critic algorithm.

class OacAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), actor_network_cls=<class 'alf.networks.actor_distribution_networks.ActorDistributionNetwork'>, critic_network_cls=<class 'alf.networks.critic_networks.CriticNetwork'>, q_network_cls=<class 'alf.networks.q_networks.QNetwork'>, epsilon_greedy=None, use_entropy_reward=True, calculate_priority=False, num_critic_replicas=2, env=None, config=None, critic_loss_ctor=None, target_entropy=None, prior_actor_ctor=None, target_kld_per_dim=3.0, initial_log_alpha=0.0, explore=True, explore_delta=6.8, beta_ub=4.6, max_log_alpha=None, target_update_tau=0.05, target_update_period=1, dqda_clipping=None, actor_optimizer=None, critic_optimizer=None, alpha_optimizer=None, checkpoint=None, debug_summaries=False, name='OacAlgorithm')[source]#

Bases: alf.algorithms.sac_algorithm.SacAlgorithm

Optimistic Actor Critic algorithm, described in:

Ciosek et al "Better Exploration with Optimistic Actor-Critic", arXiv:1910.12807

Refer to SacAlgorithm for Args besides the following.

Parameters
  • explore (bool) – default is True for OAC algorithm, where only continuous action space is supported. When ‘explore’ is False, OAC is the same as SAC.

  • explore_delta (float) – parameter controlling how optimistic in shifting the mean of the target policy to get the mean of the explore policy.

  • beta_ub (float) – parameter for computing the upperbound of Q value: \(Q_ub(s,a) = \mu_Q(s,a) + eta_ub * \sigma_Q(s,a)\)

rollout_step(inputs, state)[source]#

Same as SacAlgorithm.rollout_step except that explore is set to be self._explore when calling _predict_action.

training: bool#

alf.algorithms.off_policy_algorithm#

Base class for off policy algorithms.

class OffPolicyAlgorithm(observation_spec, action_spec, train_state_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), predict_state_spec=None, rollout_state_spec=None, is_on_policy=None, reward_weights=None, env=None, config=None, optimizer=None, checkpoint=None, is_eval=False, overwrite_policy_output=False, debug_summaries=False, name='RLAlgorithm')[source]#

Bases: alf.algorithms.rl_algorithm.RLAlgorithm

OffPolicyAlgorithm implements basic off-policy training pipeline. User needs to implement rollout_step() and train_step(). - rollout_step() is called to generate actions at every environment step. - train_step() is called to generate necessary information for training.

The following is the pseudo code to illustrate how OffPolicyAlgorithm is used:

# (1) collect stage
for _ in range(steps_per_collection):
    # collect experience and store to replay buffer
    policy_step = rollout_step(time_step, policy_step.state)
    experience = make_experience(time_step, policy_step)
    store experience to replay buffer
    action = sample action from policy_step.action
    time_step = env.step(action)

# (2) train stage
for _ in range(training_steps_per_collection):
    # sample experiences and perform training
    experiences = sample batch from replay_buffer
    batched_train_info = []
    for experience in experiences:
        policy_step = train_step(experience, state)
        add policy_step.info to batched_train_info
    loss = calc_loss(experiences, batched_train_info)
    update_with_gradient(loss)
Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • train_state_spec (nested TensorSpec) – for the network state of train_step().

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • rollout_state_spec (nested TensorSpec) – for the network state of predict_step(). If None, it’s assumed to be the same as train_state_spec.

  • predict_state_spec (nested TensorSpec) – for the network state of predict_step(). If None, it’s assumed to be the same as rollout_state_spec.

  • is_on_policy (None|bool) – whether the algorithm is on-policy or not.

  • reward_weights (None|list[float]) – this is only used when the reward is multidimensional. If not None, the weighted sum of rewards is the reward for training. Otherwise, the sum of rewards is used.

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultaneously. Running multiple environments in parallel is crucial to on-policy algorithms as it increases the diversity of data and decreases temporal correlation. env only needs to be provided to the root Algorithm.

  • config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs a training iteration by itself.

  • optimizer (torch.optim.Optimizer) – The default optimizer for training.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • is_eval (bool) – True if this algorithm is used for evaluation only, during deployment. In this case, the algorithm do not need to create certain components such as value_network for ActorCriticAlgorithm, critic_networks for SacAlgorithm.

  • overwrite_policy_output (bool) – if True, overwrite the policy output with next_step.prev_action. This option can be used in some cases such as data collection.

  • debug_summaries (bool) – If True, debug summaries will be created.

  • name (str) – Name of this algorithm.

property on_policy#

Whether is on-policy training.

For on-policy training, train_step() will not be called. And info passed to calc_loss() is info collected from rollout_step().

For off-policy training, train_step() will be called with the experience from replay buffer. And info passed to calc_loss() is info collected from train_step.

An algorithm can override this to indicate whether it is an on-policy or off-policy algorithm. If an algorithm does not override this, it needs to support both on-policy and off-policy training, which means that rollout_step() and train_step() need to have the correct behavior for on-policy and off-policy training. It can check wether it is on-policy training by calling this function.

Returns

True if on-policy training, False if off-policy training,

None if not set.

Return type

bool | None

training: bool#

alf.algorithms.on_policy_algorithm#

Base class for on-policy RL algorithms.

class OnPolicyAlgorithm(observation_spec, action_spec, train_state_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), predict_state_spec=None, rollout_state_spec=None, is_on_policy=None, reward_weights=None, env=None, config=None, optimizer=None, checkpoint=None, is_eval=False, overwrite_policy_output=False, debug_summaries=False, name='RLAlgorithm')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

OnPolicyAlgorithm implements the basic on-policy training procedure.

User needs to implement rollout_step() and calc_loss().

rollout_step() is called to generate actions for every environment step. It also needs to generate necessary information for training.

update_with_gradient() is called every unroll_length steps (specified in config.TrainerConfig). All the training information collected by every rollout_step() are batched and provided as arguments for calc_loss().

The following is the pseudo code to illustrate how OnPolicyAlgorithm can be used:

for _ in range(unroll_length):
    policy_step = rollout_step(time_step, policy_step.state)
    collect information from time_step into experience
    collect information from policy_step.info into train_info
    time_step = env.step(policy_step.output)
loss = calc_loss(experience, train_info)
update_with_gradient(loss)
Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • train_state_spec (nested TensorSpec) – for the network state of train_step().

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • rollout_state_spec (nested TensorSpec) – for the network state of predict_step(). If None, it’s assumed to be the same as train_state_spec.

  • predict_state_spec (nested TensorSpec) – for the network state of predict_step(). If None, it’s assumed to be the same as rollout_state_spec.

  • is_on_policy (None|bool) – whether the algorithm is on-policy or not.

  • reward_weights (None|list[float]) – this is only used when the reward is multidimensional. If not None, the weighted sum of rewards is the reward for training. Otherwise, the sum of rewards is used.

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultaneously. Running multiple environments in parallel is crucial to on-policy algorithms as it increases the diversity of data and decreases temporal correlation. env only needs to be provided to the root Algorithm.

  • config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs a training iteration by itself.

  • optimizer (torch.optim.Optimizer) – The default optimizer for training.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • is_eval (bool) – True if this algorithm is used for evaluation only, during deployment. In this case, the algorithm do not need to create certain components such as value_network for ActorCriticAlgorithm, critic_networks for SacAlgorithm.

  • overwrite_policy_output (bool) – if True, overwrite the policy output with next_step.prev_action. This option can be used in some cases such as data collection.

  • debug_summaries (bool) – If True, debug summaries will be created.

  • name (str) – Name of this algorithm.

property on_policy#

Whether is on-policy training.

For on-policy training, train_step() will not be called. And info passed to calc_loss() is info collected from rollout_step().

For off-policy training, train_step() will be called with the experience from replay buffer. And info passed to calc_loss() is info collected from train_step.

An algorithm can override this to indicate whether it is an on-policy or off-policy algorithm. If an algorithm does not override this, it needs to support both on-policy and off-policy training, which means that rollout_step() and train_step() need to have the correct behavior for on-policy and off-policy training. It can check wether it is on-policy training by calling this function.

Returns

True if on-policy training, False if off-policy training,

None if not set.

Return type

bool | None

train_step(inputs, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#

alf.algorithms.one_step_loss#

class OneStepTDLoss(gamma=0.99, td_error_loss_fn=<function element_wise_squared_loss>, debug_summaries=False, name='OneStepTDLoss')[source]#

Bases: alf.algorithms.td_loss.TDLoss

Parameters
  • gamma (Union[float, List[float]]) – A discount factor for future rewards. For multi-dim reward, this can also be a list of discounts, each discount applies to a reward dim.

  • td_error_loss_fn (Callable) – A function for computing the TD errors loss. This function takes as input the target and the estimated Q values and returns the loss for each element of the batch.

  • debug_summaries (bool) – True if debug summaries should be created

  • name (str) – The name of this loss.

training: bool#
class OneStepTDQRLoss(num_quantiles=50, gamma=0.99, td_error_loss_fn=<function huber_function>, sum_over_quantiles=False, debug_summaries=False, name='OneStepTDQRLoss')[source]#

Bases: alf.algorithms.td_loss.TDQRLoss

One step temporal difference quantile regression loss.

Parameters
  • num_quantiles (int) – the number of quantiles.

  • gamma (Union[float, List[float]]) – A discount factor for future rewards. For multi-dim reward, this can also be a list of discounts, each discount applies to a reward dim.

  • td_error_loss_fn (Callable) – A function for computing the TD errors loss. This function takes as input the target and the estimated Q values and returns the loss for each element of the batch.

  • sum_over_quantiles (bool) – If True, the quantile regression loss will be summed along the quantile dimension. Otherwise, it will be averaged along the quantile dimension instead. Default is False.

  • debug_summaries (bool) – True if debug summaries should be created

  • name (str) – The name of this loss.

training: bool#

alf.algorithms.particle_vi_algorithm#

A generic generator.

class ParVIAlgorithm(particle_dim, num_particles=10, entropy_regularization=1.0, par_vi='gfsf', critic_input_dim=None, critic_hidden_layers=(100, 100), critic_l2_weight=10.0, critic_iter_num=2, critic_use_bn=True, critic_optimizer=None, optimizer=None, debug_summaries=False, name='ParVIAlgorithm')[source]#

Bases: alf.algorithms.algorithm.Algorithm

ParVIAlgorithm maintains a set of particles that keep chasing some target distribution. Two particle-based variational inference (par_vi) methods are implemented:

  1. Stein Variational Gradient Descent (SVGD):

    Liu, Qiang, and Dilin Wang. “Stein Variational Gradient Descent: A General Purpose Bayesian Inference Algorithm.” NIPS. 2016.

  2. Wasserstein Particle-based VI with Smooth Functions (GFSF):

    Liu, Chang, et al. “Understanding and accelerating particle-based variational inference.” International Conference on Machine Learning. 2019.

Create a ParVIAlgorithm.

Parameters
  • particle_dim (int) – dimension of the particles.

  • num_particles (int) – number of particles.

  • entropy_regularization (float) – weight of the repulsive term in par_vi.

  • par_vi (string) –

    par_vi methods, options are [svgd, gfsf, None],

    • svgd: empirical expectation of SVGD is evaluated by reusing the same batch of particles.

    • gfsf: wasserstein gradient flow with smoothed functions. It involves a kernel matrix inversion, so computationally more expensive, but in some cases the convergence seems faster than svgd approaches.

  • critic_input_dim (int) – dimension of critic input, used for minmax.

  • critic_hidden_layers (tuple) – sizes of hidden layers of the critic, used for minmax.

  • critic_l2_weight (float) – weight of L2 regularization in training the critic, used for minmax.

  • critic_iter_num (int) – number of critic updates for each generator train_step, used for minmax.

  • critic_use_bn (book) – whether use batch norm for each layers of the critic, used for minmax.

  • critic_optimizer (torch.optim.Optimizer) – Optimizer for training the critic, used for minmax.

  • optimizer (torch.optim.Optimizer) – (optional) optimizer for training

  • name (str) – name of this generator

property num_particles#
property particles#
predict_step(state=None)[source]#

Generate outputs given inputs.

Parameters

state – not used

Returns

  • output (Tensor): shape is [num_particles, output_dim]

  • state: not used

Return type

AlgStep

train_step(loss_func, transform_func=None, entropy_regularization=None, loss_mask=None, state=None)[source]#
Parameters
  • loss_func (Callable) – loss_func(loss_inputs) returns a Tensor or namedtuple of tensors with field loss, which is a Tensor of shape [num_particles] a loss term for optimizing the generator.

  • transform_func (Callable) –

    tranform functoin on particles. Used in function value based par_vi, where each particle represents parameters of a neural network function. It is call by transform_func(particles) which returns the following,

    • outputs: outputs of network parameterized by particles evaluated on predifined training batch.

    • extra_outputs: outputs of network parameterized by particles evaluated on additional sampled data.

  • entropy_regularization (float) – weight of the repulsive term in par_vi. If None, use self._entropy_regularization.

  • loss_mask (Tensor) – mask indicating which samples are valid for loss propagation.

  • state – not used

Returns

  • output (Tensor): shape is [num_particles, dim]

  • state: not used

  • info (LossInfo): loss

Return type

AlgStep

training: bool#

alf.algorithms.planning_algorithm#

class CEMPlanAlgorithm(feature_spec, action_spec, population_size, planning_horizon, reward_spec=TensorSpec(shape=(), dtype=torch.float32), elite_size=50, max_iter_num=5, epsilon=0.01, tau=0.9, scalar_var=None, upper_bound=None, lower_bound=None, name='CEMPlanAlgorithm')[source]#

Bases: alf.algorithms.planning_algorithm.RandomShootingAlgorithm

CEM-based planning method.

This method uses a Cross-Entropy Method (CEM) to optimize an action trajectory by minimizing a given cost function. The optimized action trajectory is termed as a ‘plan’ which can be used by other components such as a MPC-based controller. This has been used by some MBRL works such as Deep Reinforcement Learning in a Handful of Trials using Probabilistic Dynamics Models

To speedup, when possible, we have used the plan obtained at the previous time step to initialize the the mean of the plan distribution at the current time step, after proper shifting and padding.

Create a CEMPlanAlgorithm.

Parameters
  • population_size (int) – the size of polulation for optimization

  • planning_horizon (int) – planning horizon in terms of time steps

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s.)

  • elite_size (int) – the number of elites selected in each round

  • max_iter_num (int|Tensor) – the maximum number of CEM iterations

  • epsilon (float) – a minimum variance threshold. If the variance of the population falls below it, the CEM iteration will stop.

  • tau (float) –

    a value in (0, 1) for softly updating the population mean and variance:

    mean = (1 - tau) * mean + tau * new_mean
    var = (1 - tau) * var + tau * new_var
    

  • scalar_var (None|float) – the value that will be used to construct the initial diagonal covariance matrix of the multi-dimensional Gaussian used by the CEM optimizer. If value is None, 0.5 * (upper_bound - lower_bound) is used.

  • upper_bound (int) – upper bound for elements in solution; action_spec.maximum will be used if not specified

  • lower_bound (int) – lower bound for elements in solution; action_spec.minimum will be used if not specified

predict_plan(time_step, state, epislon_greedy)[source]#

Compute the plan based on the provided observation and action :type time_step: TimeStep :param time_step: input data for next step prediction :type time_step: TimeStep :type state: PlannerState :param state: input planner state :type state: PlannerState

Returns

planned action for the given inputs

Return type

action

training: bool#
class PlanAlgorithm(feature_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), planning_horizon=25, upper_bound=None, lower_bound=None, name='PlanningAlgorithm')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

Planning Module

This module plans for actions based on initial observation and specified reward and dynamics functions

Create a PlanningAlgorithm.

Parameters
  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • planning_horizon (int) – planning horizon in terms of time steps

  • upper_bound (int) – upper bound for elements in solution; action_spec.maximum will be used if not specified

  • lower_bound (int) – lower bound for elements in solution; action_spec.minimum will be used if not specified

  • particles_per_replica (int) – number of particles used for each replica

predict_plan(time_step, state, epsilon_greedy)[source]#

Compute the plan based on the provided observation and action :type time_step: TimeStep :param time_step: input data for next step prediction :type time_step: TimeStep :type state: PlannerState :param state: input planner state :type state: PlannerState

Returns

planned action for the given inputs

Return type

action

set_action_sequence_cost_func(action_seq_cost_func)[source]#

Set a function for evaluating the action sequences for planning :param action_seq_cost_func: cost function to be used for planning. :type action_seq_cost_func: Callable :param action_seq_cost_func takes initial observation and action sequences: :param of the shape [B, population, unroll_steps, action_dim] as input: :param and returns the accumulated cost along the unrolled trajectory, with: :param the shape of [B, population]:

train_step(time_step, state, rollout_info=None)[source]#
Parameters
  • time_step (TimeStep) – input data for dynamics learning

  • state (PlannerState) – input planner state

Returns

output: empty tuple () state (PlannerState): updated planner state info (PlannerInfo):

Return type

AlgStep

training: bool#
class PlannerInfo(planner)#

Bases: tuple

Create new instance of PlannerInfo(planner,)

planner#

Alias for field number 0

class PlannerState(prev_plan)#

Bases: tuple

Create new instance of PlannerState(prev_plan,)

prev_plan#

Alias for field number 0

class RandomShootingAlgorithm(feature_spec, action_spec, population_size, reward_spec=TensorSpec(shape=(), dtype=torch.float32), planning_horizon=25, upper_bound=None, lower_bound=None, name='RandomShootingAlgorithm')[source]#

Bases: alf.algorithms.planning_algorithm.PlanAlgorithm

Random Shooting-based planning method.

This method uses a Random Shooting approach to optimize an action trajectory by minimizing a given cost function. The optimized action trajectory is termed as a ‘plan’ which can be used by other components such as a MPC-based controller. It has been used in Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning

Create a RandomShootingAlgorithm.

Parameters
  • population_size (int) – the size of polulation for random shooting

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • planning_horizon (int) – planning horizon in terms of time steps

  • upper_bound (int) – upper bound for elements in solution; action_spec.maximum will be used if not specified

  • lower_bound (int) – lower bound for elements in solution; action_spec.minimum will be used if not specified

after_update(root_inputs, info)[source]#

Do things after completing one gradient update (i.e. update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.

Parameters
  • root_inputs (nest) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll().

  • info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() for on-policy training or train_step() for off-policy training.

predict_plan(time_step, state, epsilon_greedy)[source]#

Compute the plan based on the provided observation and action :type time_step: TimeStep :param time_step: input data for next step prediction :type time_step: TimeStep :type state: PlannerState :param state: input planner state :type state: PlannerState

Returns

planned action for the given inputs

Return type

action

train_step(time_step, state, rollout_info=None)[source]#
Parameters
  • time_step (TimeStep) – input data for planning

  • state – state for planning (previous observation)

Returns

output: empty tuple () state (DynamicsState): state for training info (DynamicsInfo):

Return type

AlgStep

training: bool#

alf.algorithms.ppg_algorithm#

Phasic Policy Gradient Algorithm.

class PPGAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), env=None, config=None, aux_options=PPGAuxOptions(enabled=True, interval=32, mini_batch_length=None, mini_batch_size=8, num_updates_per_train_iter=6), encoding_network_ctor=<class 'alf.networks.encoding_networks.EncodingNetwork'>, policy_optimizer=None, aux_optimizer=None, epsilon_greedy=None, checkpoint=None, debug_summaries=False, name='PPGAlgorithm')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

PPG Algorithm.

Implementation of the paper: https://arxiv.org/abs/2009.04416

PPG can be viewed as a variant of PPO, with two differences:

  1. It uses a special network structure (DisjointPolicyValueNetwork) that has an extra auxiliary value head in addition to the policy head and value head. In the current implementation, the auxiliary value head also tries to estimate the value function, similar to the (actual) value head.

  2. It does PPO update in normal iterations. However, after every specified number of iterations, it will perform auxiliary phase updates based on auxiliary phase losses (different from PPO loss, see algorithms/ppg/ppg_aux_phase_loss.py for details). Auxiliary phase updates does not require new rollouts. Instead it is performed on all of the experience collected since the last auxiliary phase update.

Args:

observation_spec (nested TensorSpec): representing the observations. action_spec (nested BoundedTensorSpec): representing the actions. reward_spec (TensorSpec): a rank-1 or rank-0 tensor spec representing

the reward(s).

env (Environment): The environment to interact with. env is a

batched environment, which means that it runs multiple simulations simultateously. env only needs to be provided to the root Algorithm. NOTE: env will default to None if PPGAlgorithm is run via Agent.

config (TrainerConfig): config for training. config only needs to be

provided to the algorithm which performs train_iter() by itself.

aux_options: Options that controls the auxiliary phase training. encoding_network_ctor (Callable[[TensorSpec], Network]): Function to

construct the encoding network from an input tensor spec. The constructed network will be called with forward(observation, state).

policy_optimizer (torch.optim.Optimizer): The optimizer for training

the policy phase of PPG.

aux_optimizer (torch.optim.Optimizer): The optimizer for training

the auxiliary phase of PPG.

epsilon_greedy (float): a floating value in [0,1], representing the

chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from config.epsilon_greedy and then alf.get_config_value(TrainerConfig.epsilon_greedy). It is used in predict_step() during evaluation.

checkpoint (None|str): a string in the format of “prefix@path”,

where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

debug_summaries (bool): True if debug summaries should be created. name (str): Name of this algorithm.

after_train_iter(experience, info)[source]#

Run auxiliary update if conditions are met

PPG requires running auxiliary update after certain number of iterations policy update. This is checked and performed at the after_train_iter() hook currently.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(inputs, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

rollout_step(inputs, state)[source]#

Rollout step for PPG algorithm

Besides running the network prediction, it does one extra thing to store the experience in the auxiliary replay buffer so that it can be consumed by the auxiliary phase updates.

Return type

AlgStep

train_step(inputs, state, plain_rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#

alf.algorithms.ppo_algorithm#

PPO algorithm.

class PPOAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), reward_weights=None, actor_network_ctor=<class 'alf.networks.actor_distribution_networks.ActorDistributionNetwork'>, value_network_ctor=<class 'alf.networks.value_networks.ValueNetwork'>, epsilon_greedy=None, env=None, config=None, loss=None, loss_class=<class 'alf.algorithms.actor_critic_loss.ActorCriticLoss'>, optimizer=None, checkpoint=None, debug_summaries=False, name='ActorCriticAlgorithm')[source]#

Bases: alf.algorithms.actor_critic_algorithm.ActorCriticAlgorithm

PPO Algorithm. Implement the simplified surrogate loss in equation (9) of “Proximal Policy Optimization Algorithms” https://arxiv.org/abs/1707.06347

It works with ppo_loss.PPOLoss. It should have same behavior as baselines.ppo2.

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • reward_weights (None|list[float]) – this is only used when the reward is multidimensional. In that case, the weighted sum of the v values is used for training the actor if reward_weights is not None. Otherwise, the sum of the v values is used.

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. env only needs to be provided to the root Algorithm.

  • epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from config.epsilon_greedy and then alf.get_config_value(TrainerConfig.epsilon_greedy).

  • config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs train_iter() by itself.

  • actor_network_ctor (Callable) – Function to construct the actor network. actor_network_ctor needs to accept input_tensor_spec and action_spec as its arguments and return an actor network. The constructed network will be called with forward(observation, state).

  • value_network_ctor (None | Callable) – Function to construct the value network. value_network_ctor needs to accept input_tensor_spec as its arguments and return a value netwrok. The contructed network will be called with forward(observation, state) and returns value tensor for each observation given observation and network state. Note that if the algorithm is constructed for evaluation or deployment only, the value_network_ctor can be set to None and the value network will not be constructed at all.

  • loss (None|ActorCriticLoss) – an object for calculating loss. If None, a default loss of class loss_class will be used.

  • loss_class (type) – the class of the loss. The signature of its constructor: loss_class(debug_summaries)

  • optimizer (torch.optim.Optimizer) – The optimizer for training

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – Name of this algorithm.

property on_policy#

Whether is on-policy training.

For on-policy training, train_step() will not be called. And info passed to calc_loss() is info collected from rollout_step().

For off-policy training, train_step() will be called with the experience from replay buffer. And info passed to calc_loss() is info collected from train_step.

An algorithm can override this to indicate whether it is an on-policy or off-policy algorithm. If an algorithm does not override this, it needs to support both on-policy and off-policy training, which means that rollout_step() and train_step() need to have the correct behavior for on-policy and off-policy training. It can check wether it is on-policy training by calling this function.

Returns

True if on-policy training, False if off-policy training,

None if not set.

Return type

bool | None

preprocess_experience(root_inputs, rollout_info, batch_info)[source]#

Compute advantages and put it into exp.rollout_info.

train_step(inputs, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class PPOInfo(step_type, discount, reward, action, rollout_log_prob, rollout_action_distribution, returns, advantages, action_distribution, value, reward_weights)#

Bases: tuple

Create new instance of PPOInfo(step_type, discount, reward, action, rollout_log_prob, rollout_action_distribution, returns, advantages, action_distribution, value, reward_weights)

action#

Alias for field number 3

action_distribution#

Alias for field number 8

advantages#

Alias for field number 7

discount#

Alias for field number 1

returns#

Alias for field number 6

reward#

Alias for field number 2

reward_weights#

Alias for field number 10

rollout_action_distribution#

Alias for field number 5

rollout_log_prob#

Alias for field number 4

step_type#

Alias for field number 0

value#

Alias for field number 9

alf.algorithms.ppo_loss#

Loss for PPO algorithm.

class PPOLoss(gamma=0.99, td_error_loss_fn=<function element_wise_squared_loss>, td_lambda=0.95, normalize_advantages=True, compute_advantages_internally=False, advantage_clip=None, entropy_regularization=None, td_loss_weight=1.0, importance_ratio_clipping=0.2, log_prob_clipping=0.0, check_numerics=False, debug_summaries=False, name='PPOLoss')[source]#

Bases: alf.algorithms.actor_critic_loss.ActorCriticLoss

PPO loss.

Implement the simplified surrogate loss in equation (9) of Proximal Policy Optimization Algorithms.

The total loss equals to

(policy_gradient_loss      # (L^{CLIP} in equation (9))
+ td_loss_weight * td_loss # (L^{VF} in equation (9))
- entropy_regularization * entropy)

This loss works with PPOAlgorithm. The advantages and returns are pre-computed by PPOAlgorithm.preprocess(). One known difference with baselines.ppo2 is that value estimation is not clipped here, while baselines.ppo2 also clipped value if it deviates from returns too much.

Parameters
  • gamma (float|list[float]) – A discount factor for future rewards. For multi-dim reward, this can also be a list of discounts, each discount applies to a reward dim.

  • td_errors_loss_fn (Callable) – A function for computing the TD errors loss. This function takes as input the target and the estimated Q values and returns the loss for each element of the batch.

  • td_lambda (float) – Lambda parameter for TD-lambda computation.

  • normalize_advantages (bool) – If True, normalize advantage to zero mean and unit variance within batch for caculating policy gradient.

  • compute_advantages_internally (bool) – Normally PPOLoss does not compute the adavantage and it expects the info to carry the already-computed advantage. If this flag is set to True, PPOLoss will instead compute the advantage internally without depending on the input info, because loading very large amount of experiences into GPU memory to compute advantages may not always be possible.

  • advantage_clip (float) – If set, clip advantages to \([-x, x]\)

  • entropy_regularization (float) – Coefficient for entropy regularization loss term.

  • td_loss_weight (float) – the weigt for the loss of td error.

  • importance_ratio_clipping (float) – Epsilon in clipped, surrogate PPO objective. See the cited paper for more detail.

  • log_prob_clipping (float) – If >0, clipping log probs to the range (-log_prob_clipping, log_prob_clipping) to prevent inf/NaN values.

  • check_numerics (bool) – If true, checking for NaN/Inf values. For debugging only.

  • name (str) –

training: bool#

alf.algorithms.predictive_representation_learner#

PredictiveRepresentationLearner.

class PredictiveRepresentationLearner(observation_spec, action_spec, num_unroll_steps, decoder_ctor, encoding_net_ctor, dynamics_net_ctor, reward_spec=TensorSpec(shape=(), dtype=torch.float32), config=None, postprocessor=None, encoding_optimizer=None, dynamics_optimizer=None, postprocessor_optimizer=None, checkpoint=None, debug_summaries=False, name='PredictiveRepresentationLearner')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Learn representation based on the prediction of future values.

PredictiveRepresentationLearner contains 3 ``Module``s:

  • encoding_net: it is a Network that encodes the raw observation to a latent vector.

  • dynamics_net: it is a Network that generates the future latent states from the current latent state.

  • decoder: it is an Algorithm that decode the target values from the latent state and calcuate the loss.

Parameters
  • observation_spec (nested TensorSpec) – describing the observation.

  • action_spec (nested BoundedTensorSpec) – describing the action.

  • num_unroll_steps (int) – the number of future steps to predict. num_unroll_steps of 0 means no future prediction and hence dynamics_net_ctor is ignored.

  • decoder_ctor (Callable|[Callable]) – each individual constructor is called as decoder_ctor(observation) to construct the decoder algorithm. It should follow the Algorithm interface. In addition to the interface of Algorithm, it should also implement a member function get_target_fields(), which returns a nest of the names of target fields. See SimpleDecoder for an example of decoder.

  • encoding_net_ctor (Callable) – called as encoding_net_ctor(observation_spec) to construct the encoding Network. The network takes raw observation as input and output the latent representation. encoding_net can be an RNN.

  • dynamics_net_ctor (Callable) – called as dynamics_net_ctor(action_spec) to construct the dynamics Network. It must be an RNN. The constructed network takes action as input and outputs the future latent representation. If the state_spec of the dynamics net is exactly same as the state_spec of the encoding net, the current state of the encoding net will be used as the initial state of the dynamics net. Otherwise, a linear projection will be used to convert the current latent represenation to the initial state for the dynamics net.

  • reward_spec – NOT USED. Only present as representation learner interface to be used with Agent.

  • config (Optional[TrainerConfig]) – The trainer config. Present as representation learner interface to be used with Agent.

  • postprocessor (None|Callable) – If provided, will be called as postprocessor(latent) to get the actual representation, where latent is the output from encoding_net.

  • encoding_optimizer (Optimizer|None) – if provided, will be used to optimize the parameter for the encoding net.

  • dynamics_optimizer (Optimizer|None) – if provided, will be used to optimize the parameter for the dynamics net.

  • postprocessor_optimizer (Optimizer|None) – if provided, will be used to optimize the parameter for the postprocessor.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • debug_summaries (bool) – whether to generate debug summaries

  • name (str) – name of this instance.

get_decoder(target_field)[source]#

Get the decoder which predicts the target specified by target_name. :param target_field: the name of the prediction quantity corresponding

to the decoder

Returns

decoder (Algorithm)

property output_spec#
predict_multi_step(init_latent, actions, target_field=None, state=None)[source]#
Perform multi-step predictions based on the initial latent

representation and actions sequences.

Parameters
  • init_latent (Tensor) – the latent representation for the initial step of the prediction

  • actions (Tensor) – [B, unroll_steps, action_dim]

  • target_field (None|str|[str]) – the name or a list if names of the quantities to be predicted. It is used for selecting the corresponding decoder. If None, all the available decoders will be used for generating predictions.

  • state

Returns

predicted target of shape

[B, unroll_steps + 1, d], where d is the dimension of the predicted target. The return is a list of Tensors when there are multiple targets to be predicted.

Return type

prediction (Tensor|[Tensor])

predict_step(inputs, state)[source]#

Predict for one step of inputs.

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match predict_state_spec.

  • info (nest): information for analyzing the agent. In particular,

    if an element of the info is alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.

Return type

AlgStep

preprocess_experience(root_inputs, rollout_info, batch_info)[source]#

Fill experience.rollout_info with PredictiveRepresentationLearnerInfo

Note that the shape of experience is [B, T, …].

The target is a Tensor (or a nest of Tensors) when there is only one decoder. When there are multiple decorders, the target is a list, and each of its element is a Tensor (or a nest of Tensors), which is used as the target for the corresponding decoder.

rollout_step(inputs, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

train_step(root_inputs, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class PredictiveRepresentationLearnerInfo(action, mask, target)#

Bases: tuple

Create new instance of PredictiveRepresentationLearnerInfo(action, mask, target)

action#

Alias for field number 0

mask#

Alias for field number 1

target#

Alias for field number 2

class SimpleDecoder(input_tensor_spec, target_field, decoder_net_ctor, loss_ctor=functools.partial(<class 'torch.nn.modules.loss.SmoothL1Loss'>, reduction='none'), loss_weight=1.0, summarize_each_dimension=False, optimizer=None, normalize_target=False, append_target_field_to_name=True, debug_summaries=False, name='SimpleDecoder')[source]#

Bases: alf.algorithms.algorithm.Algorithm

A simple decoder with elementwise loss between the target and the predicted value.

It is used to predict the target value from the given representation. Its loss can be used to train the representation.

Parameters
  • input_tensor_spec (TensorSpec) – describing the input tensor.

  • target_field (str) – name of the field in the experience to be used as the decoding target.

  • decoder_net_ctor (Callable) – called as decoder_net_ctor(input_tensor_spec=input_tensor_spec) to construct an instance of Network for decoding. The network should take the latent representation as input and output the predicted value of the target.

  • loss_ctor (Callable) – loss function with signature loss(y_pred, y_true). Note that it should not reduce to a scalar. It should at least keep the batch dimension in the returned loss.

  • loss_weight (float) – weight for the loss.

  • optimizer (Optimzer|None) – if provided, it will be used to optimize the parameter of decoder_net

  • normalize_target (bool) – whether to normalize target. Note that the effect of this is to change the loss. The predicted value itself is not normalized.

  • append_target_field_to_name (bool) – whether append target field to the name of the decoder. If True, the actual name used will be name.target_field

  • debug_summaries (bool) – whether to generate debug summaries

  • name (str) – name of this instance

calc_loss(target, predicted, mask=None)[source]#

Calculate the loss between target and predicted.

Parameters
  • target (Tensor) – target to be predicted. Its shape is [T, B, …]

  • predicted (Tensor) – predicted target. Its shape is [T, B, …]

  • mask (bool Tensor) – indicating which target should be predicted. Its shape is [T, B].

Returns

LossInfo

get_target_fields()[source]#
predict_step(repr, state=())[source]#

Predict for one step of inputs.

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match predict_state_spec.

  • info (nest): information for analyzing the agent. In particular,

    if an element of the info is alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.

Return type

AlgStep

train_step(repr, state=())[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#

alf.algorithms.prior_actor#

Prior action policies for KL regularized RL.

class SameActionPriorActor(observation_spec, action_spec, same_action_noise=0.1, same_action_prob=0.9, debug_summaries=False, name='SameActionPriorActor')[source]#

Bases: alf.algorithms.algorithm.Algorithm

SameActionPriorActor can be used as a prior for KLD regularized RL-algorithms. It encodes the prior intuition that the next action should be same as the previous action most of time. More specifically, the distribution for each action dimension is a mixture of two components:

  1. a flat TruncatedNormal with loc equal to the median of the action range scale equal to the action range.

  2. a sharp TruncatedNormal with loc equal to the previous action and scale equal to the action range multiplied by same_action_noise.

The mixture weight depends on step_type:

  1. If the step_type is FIRST, the mixture weight is [1.0, 0]

  2. Otherwise the mixture weight is [1-same_actin_prob, same_actin_prob]

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • same_action_noise (float) – the noise added to the previous action if the new action is the same as the previous action.

  • same_action_prob (float) – the probability that the next action is same as the previous action.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – The name of this algorithm.

predict_step(inputs, state)[source]#

Calculate the disribution of the next action.

Parameters

inputs (TimeStep) – time step structure

Returns

  • output (Distribution): the distribution of the action

  • state: ()

  • info: ()

Return type

AlgStep

rollout_step(inputs, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

train_step(inputs, state, unroll_info=())[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class TruncatedNormal(loc, scale, low, high, validate_args=None)[source]#

Bases: torch.distributions.distribution.Distribution

Normal distribution truncated to the range between low and high.

Currently, only log_prob() is implemented.

Parameters
  • loc (Tensor) – mean of the untruncated Normal

  • scale (Tensor) – standard deviation of the untruncated Normal

  • low (Tensor) – lower range of the truncation range

  • high (Tensor) – upper range of the truncation range

log_prob(value)[source]#

Log-probability of value.

Parameters

value (Tensor) – the samples whose log_prob is to calculated

Returns

log probability of value

rsample()[source]#

Generates a sample_shape shaped reparameterized sample or sample_shape shaped batch of reparameterized samples if the distribution parameters are batched.

sample()[source]#

Generates a sample_shape shaped sample or sample_shape shaped batch of samples if the distribution parameters are batched.

class UniformPriorActor(observation_spec, action_spec, debug_summaries=False, name='UniformPriorActor')[source]#

Bases: alf.algorithms.algorithm.Algorithm

UniformPriorActor can be used as a prior for KLD regularized RL-algorithms. It generate a prior distribution for the next action using limited information, which can be used as the prior distribution in KLD.

The action distribution is always an uniform distribution defined by the valid range of the action specified in action_spec

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – The name of this algorithm.

predict_step(inputs, state)[source]#

Predict for one step of inputs.

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match predict_state_spec.

  • info (nest): information for analyzing the agent. In particular,

    if an element of the info is alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.

Return type

AlgStep

rollout_step(inputs, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

train_step(inputs, state, rollout_info=None)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
normcdf(a, b)[source]#

alf.algorithms.qrsac_algorithm#

Quantile Regression Soft Actor Critic Algorithm.

class QrsacAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), actor_network_cls=<class 'alf.networks.actor_distribution_networks.ActorDistributionNetwork'>, critic_network_cls=<class 'alf.networks.critic_networks.CriticNetwork'>, epsilon_greedy=None, use_entropy_reward=False, normalize_entropy_reward=False, calculate_priority=False, num_critic_replicas=2, min_critic_by_critic_mean=False, env=None, config=None, critic_loss_ctor=None, target_entropy=None, prior_actor_ctor=None, target_kld_per_dim=3.0, initial_log_alpha=0.0, max_log_alpha=None, target_update_tau=0.05, target_update_period=1, dqda_clipping=None, actor_optimizer=None, critic_optimizer=None, alpha_optimizer=None, checkpoint=None, debug_summaries=False, reproduce_locomotion=False, name='QrsacAlgorithm')[source]#

Bases: alf.algorithms.sac_algorithm.SacAlgorithm

Quantile regression actor critic algorithm.

A SAC variant that applies the following quantile regression based distributional RL approach to model the critic function:

Dabney et al "Distributional Reinforcement Learning with Quantile Regression",
arXiv:1710.10044

Currently, only continuous action space is supported.

Refer to SacAlgorithm for Args beside the following. Args used for discrete and mixed actions are omitted.

Parameters
  • min_critic_by_critic_mean (bool) – If True, compute the min quantile distribution of critic replicas by choosing the one with the lowest distribution mean. Otherwise, compute the min quantile by taking a minimum value across all critic replicas for each quantile value.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

training: bool#

alf.algorithms.reward_learning_algorithm#

class FixedRewardFunction(reward_func, name='FixedRewardFunction')[source]#

Bases: alf.algorithms.reward_learning_algorithm.RewardEstimationAlgorithm

Fixed Reward Estimation Module with hand-crafted computational rules.

Parameters

reward_func (Callable) –

a function for computing reward. It takes as input:

  1. observation (Tensor of shape [batch_size, observation_dim])

  2. action (Tensor of shape [batch_size, num_actions]) and returns a reward Tensor of shape [batch_size]

compute_reward(obs, action, state)[source]#

Compute reward based on current observation and action :param obs: observation :type obs: Tensor :param action: action :type action: Tensor :param state: state for reward calculation

Returns

compuated reward for the given input state: updated state, currently simply passing the input state

Return type

reward (Tensor)

train_step(time_step, state=(), rollout_info=None)[source]#
Parameters
  • time_step (TimeStep) – input data for dynamics learning

  • state – state for reward learning

Returns

AlgStep

training: bool#
class RewardEstimationAlgorithm(name='RewardEstimationAlgorithm')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Reward Estimation Module

This module is responsible for computing/predicting rewards

Create a RewardEstimationAlgorithm.

compute_reward(obs, action, state)[source]#

Compute reward based on the provided observation and action :param obs: observation :type obs: Tensor :param action: action :type action: Tensor :param state ():

Returns

compuated reward for the given input

Return type

reward (Tensor)

train_step(time_step, state, rollout_info=None)[source]#
Parameters
  • time_step (TimeStep) – input data for dynamics learning

  • state (Tensor) – state for dynamics learning (previous observation)

Returns

AlgStep

training: bool#

alf.algorithms.rl_algorithm#

Base class for RL algorithms.

class RLAlgorithm(observation_spec, action_spec, train_state_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), predict_state_spec=None, rollout_state_spec=None, is_on_policy=None, reward_weights=None, env=None, config=None, optimizer=None, checkpoint=None, is_eval=False, overwrite_policy_output=False, debug_summaries=False, name='RLAlgorithm')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Abstract base class for RL Algorithms.

RLAlgorithm provide basic functions and generic interface for rl algorithms.

The key interface functions are:

  1. predict_step(): one step of computation of action for evaluation.

  2. rollout_step(): one step of computation for rollout. It is used for collecting experiences during training. Different from predict_step, rollout_step may include addtional computations for training. For on-policy algorithms (e.g., AC, PPO, etc), the collected experiences will be immediately used to update parameters after one rollout (multiple rollout steps) is performed; for off-policy algorithms (e.g., SAC, DDPG, etc), these collected experiences will be put into a replay buffer.

  3. train_step(): only used for off-policy training. The training data are sampled from the replay buffer filled by rollout_step().

  4. train_iter(): perform one iteration of training (rollout [and train]). train_iter() is called num_iterations times by Trainer. We provide a default implementation. Users can choose to implement their own train_iter().

  5. update_with_gradient(): Do one gradient update based on the loss. It is used by the default train_iter() implementation. You can override to implement your own update_with_gradient().

  6. calc_loss(): calculate loss based the experience and the train_info collected from rollout_step() or train_step(). It is used by the default implementation of train_iter(). If you want to use the default train_iter(), you need to implement calc_loss().

  7. after_update(): called by train_iter() after every call to update_with_gradient(), mainly for some postprocessing steps such as copying a training model to a target model in SAC or DQN.

  8. after_train_iter(): called by train_iter() after every call to train_from_unroll() (on-policy training iter) or train_from_replay_buffer (off-policy training iter). It’s mainly for training additional modules that have their own training logic (e.g., on/off-policy, replay buffers, etc). Other things might also be possible as long as they should be done once every training iteration.

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • train_state_spec (nested TensorSpec) – for the network state of train_step().

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • rollout_state_spec (nested TensorSpec) – for the network state of predict_step(). If None, it’s assumed to be the same as train_state_spec.

  • predict_state_spec (nested TensorSpec) – for the network state of predict_step(). If None, it’s assumed to be the same as rollout_state_spec.

  • is_on_policy (None|bool) – whether the algorithm is on-policy or not.

  • reward_weights (None|list[float]) – this is only used when the reward is multidimensional. If not None, the weighted sum of rewards is the reward for training. Otherwise, the sum of rewards is used.

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultaneously. Running multiple environments in parallel is crucial to on-policy algorithms as it increases the diversity of data and decreases temporal correlation. env only needs to be provided to the root Algorithm.

  • config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs a training iteration by itself.

  • optimizer (torch.optim.Optimizer) – The default optimizer for training.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • is_eval (bool) – True if this algorithm is used for evaluation only, during deployment. In this case, the algorithm do not need to create certain components such as value_network for ActorCriticAlgorithm, critic_networks for SacAlgorithm.

  • overwrite_policy_output (bool) – if True, overwrite the policy output with next_step.prev_action. This option can be used in some cases such as data collection.

  • debug_summaries (bool) – If True, debug summaries will be created.

  • name (str) – Name of this algorithm.

property action_spec#

Return the action spec.

finish_train()[source]#

Finish training and release resources if necessary.

get_metrics()[source]#

Returns the metrics monitored by this driver.

Returns

Return type

list[StepMetric]

get_step_metrics()[source]#

Get step metrics that used for generating summaries against

Returns

step metrics EnvironmentSteps and NumberOfEpisodes.

Return type

list[StepMetric]

has_multidim_reward()[source]#

Check if the algorithm uses multi-dim reward or not.

Returns

True if the reward has multiple dims.

Return type

bool

is_rl()[source]#

Always return True for RLAlgorithm.

load_offline_replay_buffer(untransformed_observation_spec)[source]#

Load replay buffer from a replay buffer checkpoint. It will construct a replay buffer (self._offline_replay_buffer) holding the data loaded from the checkpoint, which can be used for model training, e.g. in the hybrid training pipeline or in other ways.

Parameters

untransformed_observation_spec (nested TensorSpec) – spec that describes the strcuture of the utransformed observations.

property observation_spec#

Return the observation spec.

predict_step(inputs, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

property reward_weights#

Return the current reward weights.

property rollout_info_spec#

The spec for the AlgStep.info returned from rollout_step().

set_reward_weights(reward_weights)[source]#

Update reward weights; this function can be called at any step during training. Once called, the updated reward weights are expected to be used by the algorithm in the next.

Parameters

reward_weights (Tensor) – a tensor that is compatible with self._reward_spec.

summarize_metrics()[source]#

Generate summaries for metrics AverageEpisodeLength, AverageReturn, etc.

summarize_reward(name, rewards)[source]#
summarize_rollout(experience, custom_summary=None)[source]#

Generate summaries for rollout.

Parameters
  • experience (Experience) – experience collected from rollout_step().

  • custom_summary (Optional[Callable[[Experience], None]]) – when specified it is a function that will be called every time when this summarize_rollout hook is called. This provides a convenient way for the user to extend summarize_rollout from ALF configs.

summarize_train(experience, train_info, loss_info, params)[source]#

Generate summaries for training & loss info after each gradient update.

For on-policy algorithms, experience.rollout_info is empty, while for off-policy algorithms, it is available. However, the statistics in both train_info and ``experience.rollout_info` are for the data sampled from the replay buffer. They store the update-to-date model outputs and the historical model outputs (on the past rollout data), respectively. They do not represent the model outputs on the current on-going rollout.

Parameters
  • experience (Experience) – experiences collected from the most recent unroll() or from a replay buffer. It also has been used for the most recent update_with_gradient().

  • train_info (nested Tensor) – AlgStep.info returned by either rollout_step() (on-policy training) or train_step() (off-policy training).

  • loss_info (LossInfo) – loss

  • params (list[Parameter]) – list of parameters with gradients

train_iter()[source]#

Perform one iteration of training.

Users may choose to implement their own train_iter().

Returns

the number of samples being trained on (including duplicates).

Return type

int

training: bool#
unroll(**kwargs)#
adjust_replay_buffer_length(config, num_earliest_frames_ignored=0)[source]#

Adjust the replay buffer length for whole replay buffer training.

Normally we just respect the replay buffer length set in the config. However, for a specific case where the user asks to do “whole replay buffer training”, we need to adjust the user provided length to achieve desired behavior.

Parameters
  • config (TrainerConfig) – The trainer config of the training session

  • num_earliest_frames_ignored (int) – ignore the earliest so many frames from the buffer when sampling or gathering. This is typically required when FrameStacker is used. See ReplayBuffer for details.

Return type

int

Returns

An integer representing the adjusted replay buffer length.

alf.algorithms.rnd_algorithm#

class RNDAlgorithm(target_net, predictor_net, encoder_net=None, reward_adapt_speed=None, observation_adapt_speed=None, observation_spec=None, optimizer=None, clip_value=- 1.0, keep_stacked_frames=1, name='RNDAlgorithm')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Exploration by Random Network Distillation, Burda et al. 2019.

This module generates the intrinsic reward based on the prediction errors of randomly generated state embeddings.

Suppose we have a fixed randomly initialized target network g: s -> e_t and a trainable predictor network h: s -> e_p, then the intrinsic reward is

r = |e_t - e_p|^2

The reward is expected to be higher for novel states.

Parameters
  • encoder_net (EncodingNetwork) – a shared network that encodes observation to embeddings before being input to target_net or predictor_net; its parameters are not trainable.

  • target_net (EncodingNetwork) – the random fixed network that generates target state embeddings to be fitted.

  • predictor_net (EncodingNetwork) – the trainable network that predicts target embeddings. If fully trained given enough data, predictor_net will become target_net eventually.

  • reward_adapt_speed (float) – speed for adaptively normalizing intrinsic rewards; if None, no normalizer is used.

  • observation_adapt_speed (float) – speed for adaptively normalizing observations. Only useful if observation_spec is not None.

  • observation_spec (TensorSpec) – the observation tensor spec; used for creating an adaptive observation normalizer.

  • optimizer (torch.optim.Optimizer) – The optimizer for training

  • clip_value (float) – if positive, the rewards will be clipped to [-clip_value, clip_value]; only used for reward normalization.

  • keep_stacked_frames (int) – a non-negative integer indicating how many stacked frames we want to keep as the observation. If >0, we only keep the last so many frames for RND to make predictions on, as suggested by the original paper Burda et al. 2019. For Atari games, this argument is usually 1 (with frame_stacking==4). If it’s 0, the observation is unchanged. For other games, the user is responsible for setting this value correctly depending on how many channels an observation has at each time step.

  • name (str) –

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(inputs, state)[source]#

Predict for one step of inputs.

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match predict_state_spec.

  • info (nest): information for analyzing the agent. In particular,

    if an element of the info is alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.

Return type

AlgStep

rollout_step(inputs, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

train_step(inputs, state, rollout_info=None)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#

alf.algorithms.sac_algorithm#

Soft Actor Critic Algorithm.

class ActionType(value)#

Bases: enum.Enum

An enumeration.

Continuous = 2#
Discrete = 1#
Mixed = 3#
class SacActionState(actor_network, critic)#

Bases: tuple

Create new instance of SacActionState(actor_network, critic)

actor_network#

Alias for field number 0

critic#

Alias for field number 1

class SacActorInfo(actor_loss, neg_entropy)#

Bases: tuple

Create new instance of SacActorInfo(actor_loss, neg_entropy)

actor_loss#

Alias for field number 0

neg_entropy#

Alias for field number 1

class SacAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), actor_network_cls=<class 'alf.networks.actor_distribution_networks.ActorDistributionNetwork'>, critic_network_cls=<class 'alf.networks.critic_networks.CriticNetwork'>, q_network_cls=<class 'alf.networks.q_networks.QNetwork'>, reward_weights=None, epsilon_greedy=None, use_entropy_reward=True, normalize_entropy_reward=False, calculate_priority=False, num_critic_replicas=2, env=None, config=None, critic_loss_ctor=None, target_entropy=None, prior_actor_ctor=None, target_kld_per_dim=3.0, initial_log_alpha=0.0, max_log_alpha=None, target_update_tau=0.05, target_update_period=1, dqda_clipping=None, actor_optimizer=None, critic_optimizer=None, alpha_optimizer=None, checkpoint=None, debug_summaries=False, reproduce_locomotion=False, name='SacAlgorithm')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

Soft Actor Critic algorithm, described in:

Haarnoja et al "Soft Actor-Critic Algorithms and Applications", arXiv:1812.05905v2

There are 3 points different with tf_agents.agents.sac.sac_agent:

1. To reduce computation, here we sample actions only once for calculating actor, critic, and alpha loss while tf_agents.agents.sac.sac_agent samples actions for each loss. This difference has little influence on the training performance.

2. We calculate losses for every sampled steps. \((s_t, a_t), (s_{t+1}, a_{t+1})\) in sampled transition are used to calculate actor, critic and alpha loss while tf_agents.agents.sac.sac_agent only uses \((s_t, a_t)\) and critic loss for \(s_{t+1}\) is 0. You should handle this carefully, it is equivalent to applying a coefficient of 0.5 on the critic loss.

3. We mask out StepType.LAST steps when calculating losses but tf_agents.agents.sac.sac_agent does not. We believe the correct implementation should mask out LAST steps. And this may make different performance on same tasks.

In addition to continuous actions addressed by the original paper, this algorithm also supports discrete actions and a mixture of discrete and continuous actions. The networks for computing Q values \(Q(s,a)\) and sampling acitons can be divided into 3 cases according to action types:

  1. Discrete only: a QNetwork is used for estimating Q values. There will be no actor network to learn because actions can be directly sampled from the Q values: \(p(a|s) \propto \exp(\frac{Q(s,a)}{\alpha})\).

  2. Continuous only: a CriticNetwork is used for estimating Q values. An ActorDistributionNetwork for sampling actions will be learned according to Q values.

  3. Mixed: a QNetwork is used for estimating Q values. The input of this particular QNetwork (dubbed as “Universal Q Network”) is augmented with all continuous actions as (observation, continuous_action), while the output heads correspond to discrete actions. So a Q value \(Q(s, a_{cont}, a_{disc}=k)\) is estimated by the \(k\)-th output head of the network given \(a_{cont}\) as the augmented input to \(s\). Still only an ActorDistributionNetwork is needed for first sampling continuous actions, and then a discrete action is sampled from Q values conditioned on the continuous actions. See alf/docs/notes/sac_with_hybrid_action_types.rst for training details.

In addition to the entropy regularization described in the SAC paper, we also support KL-Divergence regularization if a prior actor is provided. In this case, the training objective is:

\(E_\pi(\sum_t \gamma^t(r_t - \alpha D_{\rm KL}(\pi(\cdot)|s_t)||\pi^0(\cdot)|s_t)))\)

where \(pi^0\) is the prior actor.

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (nested BoundedTensorSpec) – representing the actions; can be a mixture of discrete and continuous actions. The number of continuous actions can be arbitrary while only one discrete action is allowed currently. If it’s a mixture, then it must be a tuple/list (discrete_action_spec, continuous_action_spec).

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • actor_network_cls (Callable) – is used to construct the actor network. The constructed actor network will be called to sample continuous actions. All of its output specs must be continuous. Note that we don’t need a discrete actor network because a discrete action can simply be sampled from the Q values.

  • critic_network_cls (None or Callable) – is used to construct critic network. for estimating Q(s,a) given that the action is continuous. Note that if the algorithm is constructed for evaluation or deployment only, the critic_network_cls can be set to None and the network will not be constructed at all.

  • q_network (Callable) – is used to construct QNetwork for estimating Q(s,a) given that the action is discrete. Its output spec must be consistent with the discrete action in action_spec.

  • reward_weights (None|list[float]) – this is only used when the reward is multidimensional. In that case, the weighted sum of the q values is used for training the actor if reward_weights is not None. Otherwise, the sum of the q values is used.

  • epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from config.epsilon_greedy and then alf.get_config_value(TrainerConfig.epsilon_greedy).

  • use_entropy_reward (bool) – whether to include entropy as reward

  • normalize_entropy_reward (bool) – if True, normalize entropy reward to reduce bias in episodic cases. Only used if use_entropy_reward==True.

  • calculate_priority (bool) – whether to calculate priority. This is only useful if priority replay is enabled.

  • num_critic_replicas (int) – number of critics to be used. Default is 2.

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.

  • config (TrainerConfig) – config for training. It only needs to be provided to the algorithm which performs train_iter() by itself.

  • critic_loss_ctor (None|OneStepTDLoss|MultiStepLoss) – a critic loss constructor. If None, a default OneStepTDLoss will be used.

  • initial_log_alpha (float) – initial value for variable log_alpha.

  • max_log_alpha (float|None) – if not None, log_alpha will be capped at this value.

  • target_entropy (float|Callable|None) – If a floating value, it’s the target average policy entropy, for updating alpha. If a callable function, then it will be called on the action spec to calculate a target entropy. If None, a default entropy will be calculated. For the mixed action type, discrete action and continuous action will have separate alphas and target entropies, so this argument can be a 2-element list/tuple, where the first is for discrete action and the second for continuous action.

  • prior_actor_ctor (Callable) – If provided, it will be called using prior_actor_ctor(observation_spec, action_spec, debug_summaries=debug_summaries) to constructor a prior actor. The output of the prior actor is the distribution of the next action. Two prior actors are implemented: alf.algorithms.prior_actor.SameActionPriorActor and alf.algorithms.prior_actor.UniformPriorActor.

  • target_kld_per_dim (float) – alpha is dynamically adjusted so that the KLD is about target_kld_per_dim * dim.

  • target_update_tau (float) – Factor for soft update of the target networks.

  • target_update_period (int) – Period for soft update of the target networks.

  • dqda_clipping (float) – when computing the actor loss, clips the gradient dqda element-wise between [-dqda_clipping, dqda_clipping]. Will not perform clipping if dqda_clipping == 0.

  • actor_optimizer (torch.optim.optimizer) – The optimizer for actor.

  • critic_optimizer (torch.optim.optimizer) – The optimizer for critic.

  • alpha_optimizer (torch.optim.optimizer) – The optimizer for alpha.

  • debug_summaries (bool) – True if debug summaries should be created.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • reproduce_locomotion (bool) – if True, some slight tweaks are added to the original SAC to roughly reproducing its reported results on MuJoCo locomotion tasks. These include uniform action sampling in the beginning and different masks for actor and critic losses.

  • name (str) – The name of this algorithm.

after_update(root_inputs, info)[source]#

Do things after completing one gradient update (i.e. update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.

Parameters
  • root_inputs (nest) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll().

  • info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() for on-policy training or train_step() for off-policy training.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(inputs, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

rollout_step(inputs, state)[source]#

rollout_step() basically predicts actions like what is done by predict_step(). Additionally, if states are to be stored a in replay buffer, then this function also call _critic_networks and _target_critic_networks to maintain their states.

train_step(inputs, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class SacCriticInfo(critics, target_critic)#

Bases: tuple

Create new instance of SacCriticInfo(critics, target_critic)

critics#

Alias for field number 0

target_critic#

Alias for field number 1

class SacCriticState(critics, target_critics)#

Bases: tuple

Create new instance of SacCriticState(critics, target_critics)

critics#

Alias for field number 0

target_critics#

Alias for field number 1

class SacInfo(reward, step_type, discount, action, action_distribution, actor, critic, alpha, log_pi, discounted_return)#

Bases: tuple

Create new instance of SacInfo(reward, step_type, discount, action, action_distribution, actor, critic, alpha, log_pi, discounted_return)

action#

Alias for field number 3

action_distribution#

Alias for field number 4

actor#

Alias for field number 5

alpha#

Alias for field number 7

critic#

Alias for field number 6

discount#

Alias for field number 2

discounted_return#

Alias for field number 9

log_pi#

Alias for field number 8

reward#

Alias for field number 0

step_type#

Alias for field number 1

class SacLossInfo(actor, critic, alpha)#

Bases: tuple

Create new instance of SacLossInfo(actor, critic, alpha)

actor#

Alias for field number 0

alpha#

Alias for field number 2

critic#

Alias for field number 1

class SacState(action, actor, critic)#

Bases: tuple

Create new instance of SacState(action, actor, critic)

action#

Alias for field number 0

actor#

Alias for field number 1

critic#

Alias for field number 2

alf.algorithms.sarsa_algorithm#

SARSA Algorithm.

class SarsaAlgorithm(observation_spec, action_spec, actor_network_ctor, critic_network_ctor, reward_spec=TensorSpec(shape=(), dtype=torch.float32), num_critic_replicas=2, env=None, config=None, critic_loss_cls=<class 'alf.algorithms.one_step_loss.OneStepTDLoss'>, target_entropy=None, epsilon_greedy=None, use_entropy_reward=False, calculate_priority=False, initial_alpha=1.0, ou_stddev=0.2, ou_damping=0.15, actor_optimizer=None, critic_optimizer=None, alpha_optimizer=None, target_update_tau=0.05, target_update_period=10, use_smoothed_actor=False, dqda_clipping=0.0, on_policy=False, checkpoint=None, debug_summaries=False, name='SarsaAlgorithm')[source]#

Bases: alf.algorithms.rl_algorithm.RLAlgorithm

SARSA Algorithm.

SARSA update Q function using the following loss:

\[||Q(s_t,a_t) - \text{nograd}(r_t + \gamma * Q(s_{t+1}, a_{t+1}))||^2\]

See https://en.wikipedia.org/wiki/State-action-reward-state-action

Currently, this is only implemented for continuous action problems. The policy is dervied by a DDPG/SAC manner by maximizing \(Q(a(s_t), s_t)\), where \(a(s_t)\) is the action.

Parameters
  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • observation_spec (nested TensorSpec) – spec for observation.

  • actor_network_ctor (Callable) – Function to construct the actor network. actor_network_ctor needs to accept input_tensor_spec and action_spec as its arguments and return an actor network. The constructed network will be called with forward(observation, state).

  • critic_network_ctor (Callable) – Function to construct the critic network. critic_netwrok_ctor needs to accept input_tensor_spec which is a tuple of (observation_spec, action_spec). The constructed network will be called with forward((observation, action), state).

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • num_critic_replicas (int) – number of critics to be used. Default is 2.

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultaneously. Running multiple environments in parallel is crucial to on-policy algorithms as it increases the diversity of data and decreases temporal correlation. env only needs to be provided to the root Algorithm.

  • config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs train_iter() by itself.

  • initial_alpha (float|None) – If provided, will add -alpha*entropy to the loss to encourage diverse action.

  • target_entropy (float|Callable|None) – If a floating value, it’s the target average policy entropy, for updating alpha. If a callable function, then it will be called on the action spec to calculate a target entropy. If None, a default entropy will be calculated.

  • epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from config.epsilon_greedy and then alf.get_config_value(TrainerConfig.epsilon_greedy).

  • use_entropy_reward (bool) – If True, will use alpha*entropy as additional reward.

  • calculate_priority (bool) – whether to calculate priority. This is only useful if priority replay is enabled.

  • ou_stddev (float) – Only used for DDPG. Standard deviation for the Ornstein-Uhlenbeck (OU) noise added in the default collect policy.

  • ou_damping (float) – Only used for DDPG. Damping factor for the OU noise added in the default collect policy.

  • target_update_tau (float) – Factor for soft update of the target networks.

  • target_update_period (int) – Period for soft update of the target networks.

  • use_smoothed_actor (bool) – use a smoothed version of actor for predict and rollout. This option can be used if on_policy is False.

  • dqda_clipping (float) – when computing the actor loss, clips the gradient dqda element-wise between [-dqda_clipping, dqda_clipping]. Does not perform clipping if dqda_clipping == 0.

  • actor_optimizer (torch.optim.Optimizer) – The optimizer for actor.

  • critic_optimizer (torch.optim.Optimizer) – The optimizer for critic networks.

  • alpha_optimizer (torch.optim.Optimizer) – The optimizer for alpha. Only used if initial_alpha is not None.

  • on_policy (bool) – whether it is used as an on-policy algorithm.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – The name of this algorithm.

after_update(root_inputs, info)[source]#

Do things after completing one gradient update (i.e. update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.

Parameters
  • root_inputs (nest) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll().

  • info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() for on-policy training or train_step() for off-policy training.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

convert_train_state_to_predict_state(state)[source]#

Convert RNN state for train_step() to RNN state for predict_step().

predict_step(inputs, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

rollout_step(inputs, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

train_step(inputs, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class SarsaInfo(reward, step_type, discount, action_distribution, actor_loss, critics, target_critics, neg_entropy)#

Bases: tuple

Create new instance of SarsaInfo(reward, step_type, discount, action_distribution, actor_loss, critics, target_critics, neg_entropy)

action_distribution#

Alias for field number 3

actor_loss#

Alias for field number 4

critics#

Alias for field number 5

discount#

Alias for field number 2

neg_entropy#

Alias for field number 7

reward#

Alias for field number 0

step_type#

Alias for field number 1

target_critics#

Alias for field number 6

class SarsaLossInfo(actor, critic, alpha, neg_entropy)#

Bases: tuple

Create new instance of SarsaLossInfo(actor, critic, alpha, neg_entropy)

actor#

Alias for field number 0

alpha#

Alias for field number 2

critic#

Alias for field number 1

neg_entropy#

Alias for field number 3

class SarsaState(prev_observation, prev_step_type, actor, critics, target_critics, noise)#

Bases: tuple

Create new instance of SarsaState(prev_observation, prev_step_type, actor, critics, target_critics, noise)

actor#

Alias for field number 2

critics#

Alias for field number 3

noise#

Alias for field number 5

prev_observation#

Alias for field number 0

prev_step_type#

Alias for field number 1

target_critics#

Alias for field number 4

alf.algorithms.taac_algorithm#

class ActPredOutput(dists, b, actor_a, taus, q_values2)#

Bases: tuple

Create new instance of ActPredOutput(dists, b, actor_a, taus, q_values2)

actor_a#

Alias for field number 2

b#

Alias for field number 1

dists#

Alias for field number 0

q_values2#

Alias for field number 4

taus#

Alias for field number 3

class Distributions(beta_dist, b1_a_dist)#

Bases: tuple

Create new instance of Distributions(beta_dist, b1_a_dist)

b1_a_dist#

Alias for field number 1

beta_dist#

Alias for field number 0

Mode#

alias of alf.algorithms.taac_algorithm.AlgorithmMode

class TAACTDLoss(gamma=0.99, td_error_loss_fn=<function element_wise_squared_loss>, debug_summaries=False, name='TAACTDLoss')[source]#

Bases: torch.nn.modules.module.Module

This TD loss implements the compare-through multi-step Q operator \(\mathcal{T}^{\pi^{\text{ta}}}\) proposed in the TAAC paper. For a sampled trajectory, it compares the beta action \(\tilde{b}_n\) sampled from the current policy with the historical rollout beta action \(b_n\) step by step, and uses the minimum \(n\) that has \(\tilde{b}_n\lor b_n=1\) as the target step for boostrapping.

Parameters
  • gamma (float|list[float]) – A discount factor for future rewards. For multi-dim reward, this can also be a list of discounts, each discount applies to a reward dim.

  • td_errors_loss_fn (Callable) – A function for computing the TD errors loss. This function takes as input the target and the estimated Q values and returns the loss for each element of the batch.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – The name of this loss.

forward(info, value, target_value)[source]#

Calculate the TD loss. The first dimension of all the tensors is the time dimension and the second dimesion is the batch dimension.

Parameters
  • info (TaacInfo) – TaacInfo collected from train_step().

  • value (torch.Tensor) – the tensor for the value at each time step. The loss is between this and the calculated return.

  • target_value (torch.Tensor) – the tensor for the value at each time step. This is used to calculate return.

Returns

TD loss with the extra field same as the loss.

Return type

LossInfo

property gamma#

Return the \(\gamma\) value for discounting future rewards.

Returns

a rank-0 or rank-1 (multi-dim reward) floating tensor.

Return type

Tensor

training: bool#
class TaacActorInfo(actor_loss, b1_a_entropy, beta_entropy, adv, value_loss)#

Bases: tuple

Create new instance of TaacActorInfo(actor_loss, b1_a_entropy, beta_entropy, adv, value_loss)

actor_loss#

Alias for field number 0

adv#

Alias for field number 3

b1_a_entropy#

Alias for field number 1

beta_entropy#

Alias for field number 2

value_loss#

Alias for field number 4

class TaacAlgorithm(name='TaacAlgorithm', *args, **kwargs)[source]#

Bases: alf.algorithms.taac_algorithm.TaacAlgorithmBase

Model temporal abstraction by action repetition. See

“TAAC: Temporally Abstract Actor-Critic for Continuous Control”, Yu et al., arXiv 2021.

for algorithm details.

See TaacAlgorithmBase for argument description.

training: bool#
class TaacAlgorithmBase(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), actor_network_cls=<class 'alf.networks.actor_distribution_networks.ActorDistributionNetwork'>, critic_network_cls=<class 'alf.networks.critic_networks.CriticNetwork'>, actor_observation_processors=Detach(), reward_weights=None, num_critic_replicas=2, epsilon_greedy=None, env=None, config=None, target_update_tau=0.05, target_update_period=1, critic_loss_ctor=None, actor_optimizer=None, critic_optimizer=None, alpha_optimizer=None, initial_alpha=1.0, debug_summaries=False, randomize_first_state_tau=False, b1_advantage_clipping=None, max_repeat_steps=None, target_entropy=None, checkpoint=None, name='TaacAlgorithmBase')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

Temporally abstract actor-critic algorithm.

In a nutsell, for inference TAAC adds a second stage that chooses between a candidate trajectory \(\hat{\tau}\) output by an SAC actor and the previous trajectory \(\tau^-\). For policy evaluation, TAAC uses a compare-through Q operator for TD backup by re-using state-action sequences that have shared actions between rollout and training. For policy improvement, the new actor gradient is approximated by multiplying a scaling factor to the \(\frac{\partial Q}{\partial a}\) term in the original SAC’s actor gradient, where the scaling factor is the optimal probability of choosing the \(\hat{\tau}\) in the second stage.

Different sub-algorithms implement different forms of the ‘trajectory’ concept, for example, it can be a constant function representing the same action, or a quadratic function.

Parameters
  • observation_spec (nested TensorSpec) – representing the observations.

  • action_spec (BoundedTensorSpec) – representing the continuous action.

  • reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).

  • actor_network_cls (Callable) – is used to construct the actor network. The constructed actor network will be called to sample continuous actions.

  • critic_network_cls (Callable) – is used to construct critic network. for estimating Q(s,a) given that the action is continuous.

  • actor_observation_processors (Nest) – a nest of observation processors applied to the inputs of the actor network. Note that any configured input_preprocessors of actor_network_cls will be overwritten by a tuple of this one and a preprocessor of the prev action, for modeling \(\pi(a|s,a^-)\).

  • reward_weights (None|list[float]) – this is only used when the reward is multidimensional. In that case, the weighted sum of the q values is used for training the actor if reward_weights is not None. Otherwise, the sum of the q values is used.

  • num_critic_replicas (int) – number of critics to be used. Default is 2.

  • epsilon_greedy (float) – a floating value in [0,1], representing the chance of action sampling instead of taking argmax. This can help prevent a dead loop in some deterministic environment like Breakout. Only used for evaluation. If None, its value is taken from config.epsilon_greedy and then alf.get_config_value(TrainerConfig.epsilon_greedy).

  • env (Environment) – The environment to interact with. env is a batched environment, which means that it runs multiple simulations simultateously. ``env` only needs to be provided to the root algorithm.

  • config (TrainerConfig) – config for training. It only needs to be provided to the algorithm which performs train_iter() by itself.

  • target_update_tau (float) – Factor for soft update of the target networks.

  • target_update_period (int) – Period for soft update of the target networks.

  • critic_loss_ctor (None|OneStepTDLoss|MultiStepLoss) – a critic loss constructor. If None, a default TAACTDLoss will be used.

  • actor_optimizer (torch.optim.optimizer) – The optimizer for actor.

  • critic_optimizer (torch.optim.optimizer) – The optimizer for critic.

  • alpha_optimizer (torch.optim.optimizer) – The optimizer for alpha.

  • initial_alpha (float) – the initial entropy weight for both policies.

  • debug_summaries (bool) – True if debug summaries should be created.

  • randomize_first_state_tau (bool) – whether to randomize state.tau at the beginning of an episode during rollout and training. Potentially this helps exploration. This was turned off in Yu et al. 2021.

  • b1_advantage_clipping (None|tuple[float]) – option for clipping the advantage (defined as \(Q(s,\hat{\tau}) - Q(s,\tau^-)\)) when computing \(\beta_1\). If not None, it should be a pair of numbers [min_adv, max_adv].

  • max_repeat_steps (None|int) – the max number of steps to repeat during rollout and evaluation. This value doesn’t impact the switch during training.

  • target_entropy (Callable|tuple[Callable]|None) – If a callable function, then it will be called on the action spec to calculate a target entropy. If None, a default entropy will be calculated. To set separate entropy targets for the two stage policies, this argument can be a tuple of two callables.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • name (str) – name of the algorithm

after_update(root_inputs, info)[source]#

Do things after completing one gradient update (i.e. update_with_gradient()). This function can be used for post-processings following one minibatch update, such as copy a training model to a target model in SAC, DQN, etc.

Parameters
  • root_inputs (nest) – temporally batched inputs for the rollout_step() of the root algorithm collected during unroll().

  • info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() for on-policy training or train_step() for off-policy training.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(inputs, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

rollout_step(inputs, state)[source]#

Rollout for one step of inputs.

It is called to calculate output for every environment step. For on-policy training, it also needs to generate necessary information for calc_loss(). For off-policy training, it needs to generate necessary information for train_step().

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match rollout_state_spec.

  • info (nested Tensor): For on-policy training it will be temporally batched and passed as info for calc_loss(). For off-policy training, it will be stored into retrieved from replay buffer and and retrieved for train_step() as rollout_info.

Return type

AlgStep

summarize_rollout(experience)[source]#

Generate summaries for rollout.

Parameters
  • experience – experience collected from rollout_step().

  • custom_summary – when specified it is a function that will be called every time when this summarize_rollout hook is called. This provides a convenient way for the user to extend summarize_rollout from ALF configs.

train_step(inputs, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class TaacCriticInfo(critics, target_critic, value_loss)#

Bases: tuple

Create new instance of TaacCriticInfo(critics, target_critic, value_loss)

critics#

Alias for field number 0

target_critic#

Alias for field number 1

value_loss#

Alias for field number 2

class TaacInfo(reward, step_type, tau, prev_tau, discount, action_distribution, rollout_b, b, actor, critic, alpha, repeats)#

Bases: tuple

Create new instance of TaacInfo(reward, step_type, tau, prev_tau, discount, action_distribution, rollout_b, b, actor, critic, alpha, repeats)

action_distribution#

Alias for field number 5

actor#

Alias for field number 8

alpha#

Alias for field number 10

b#

Alias for field number 7

critic#

Alias for field number 9

discount#

Alias for field number 4

prev_tau#

Alias for field number 3

repeats#

Alias for field number 11

reward#

Alias for field number 0

rollout_b#

Alias for field number 6

step_type#

Alias for field number 1

tau#

Alias for field number 2

class TaacLAlgorithm(name='TaacLAlgorithm', inverse_mode=True, *args, **kwargs)[source]#

Bases: alf.algorithms.taac_algorithm.TaacAlgorithmBase

TaacL: Piecewise linear trajectory policy for continuous control.

For a linear trajectory, let \(a\) be the action and \(v\) the first derivative. Its dynamics is:

\[\begin{split}\begin{array}{ll} v_{t+1} &\leftarrow v_t\\ a_{t+1} &\leftarrow v_{t+1} + a_t\\ \end{array}\end{split}\]

TaacL’s trajectory is piece-wise linear. Each time the policy decides whether to repeat the previous linear traj or generate a new one. Importantly, to generate a new one the policy doesn’t directly generate the entire set of two parameters \((a,v)\) because this will result in bad exploration in the action space. Instead,

\[\begin{split}\begin{array}{ll} a_{t+1} &\sim \pi\\ v_{t+1} &\leftarrow a_{t+1} - a_t\\ \end{array}\end{split}\]

For \(a\in[0,1]\) and \(v\in[0,1]\), the actual dynamics is \(a_{t+1}\leftarrow \max(\min(a_t+2v_{t+1},1),-1)\).

See TaacAlgorithmBase for other argument description.

Parameters

inverse_mode (bool) – this argument decides how the new traj is computed when b=1. If it’s False, then the new action is treated as the new first derivative v; otherwise the new action is treated as the new action a, and v is inversely inferred.

training: bool#
class TaacLossInfo(actor, critic, alpha)#

Bases: tuple

Create new instance of TaacLossInfo(actor, critic, alpha)

actor#

Alias for field number 0

alpha#

Alias for field number 2

critic#

Alias for field number 1

class TaacQAlgorithm(name='TaacQAlgorithm', inverse_mode=True, *args, **kwargs)[source]#

Bases: alf.algorithms.taac_algorithm.TaacLAlgorithm

TaacQ: Piecewise quadratic trajectory policy for continuous control.

For a quadratic trajectory, let \(a\) be the action, \(u\) be the second derivative, and \(v\) be the first derivative. Its dynamics is:

\[\begin{split}\begin{array}{ll} u_{t+1} &\leftarrow u_t\\ v_{t+1} &\leftarrow u_{t+1} + v_t\\ a_{t+1} &\leftarrow v_{t+1} + a_t\\ \end{array}\end{split}\]

TaacQ’s trajectory is piece-wise quadratic. Each time the policy decides whether to repeat the previous quadratic traj or generate a new one. Importantly, to generate a new one the policy doesn’t directly generate the entire set of three parameters \((a,v,u)\) because this will result in bad exploration in the action space. Instead,

\[\begin{split}\begin{array}{ll} a_{t+1} &\sim \pi\\ v_{t+1} &\leftarrow a_{t+1} - a_t\\ u_{t+1} &\leftarrow v_{t+1}\\ \end{array}\end{split}\]

where the last two steps assume resetting \(v_t\) to zero.

For \(a\in[0,1]\), \(v\in[0,1]\), and \(u\in[0,1]\), the actual dynamics is \(v_{t+1}\leftarrow \max(\min(v_t+2u_{t+1},1),-1)\) and \(a_{t+1}\leftarrow \max(\min(a_t+2v_{t+1},1),-1)\).

See TaacAlgorithmBase for other argument description.

Parameters

inverse_mode (bool) – this argument decides how the new traj is computed when b=1. If it’s False, then the new action is treated as the new second derivative u; otherwise the new action is treated as the new action a, and u is inversely inferred. In either case, the current v is first set to 0, and then a new v is computed.

training: bool#
class TaacState(tau, repeats)#

Bases: tuple

Create new instance of TaacState(tau, repeats)

repeats#

Alias for field number 1

tau#

Alias for field number 0

class Tau(a, v, u)#

Bases: tuple

Create new instance of Tau(a, v, u)

a#

Alias for field number 0

u#

Alias for field number 2

v#

Alias for field number 1

alf.algorithms.td_loss#

class TDLoss(gamma=0.99, td_error_loss_fn=<function element_wise_squared_loss>, td_lambda=0.95, normalize_target=False, debug_summaries=False, name='TDLoss')[source]#

Bases: torch.nn.modules.module.Module

Temporal difference loss.

Let \(G_{t:T}\) be the bootstraped return from t to T:

\[G_{t:T} = \sum_{i=t+1}^T \gamma^{t-i-1}R_i + \gamma^{T-t} V(s_T)\]

If td_lambda = 1, the target for step t is \(G_{t:T}\).

If td_lambda = 0, the target for step t is \(G_{t:t+1}\)

If 0 < td_lambda < 1, the target for step t is the \(\lambda\)-return:

\[G_t^\lambda = (1 - \lambda) \sum_{i=t+1}^{T-1} \lambda^{i-t}G_{t:i} + \lambda^{T-t-1} G_{t:T}\]

There is a simple relationship between \(\lambda\)-return and the generalized advantage estimation \(\hat{A}^{GAE}_t\):

\[G_t^\lambda = \hat{A}^{GAE}_t + V(s_t)\]

where the generalized advantage estimation is defined as:

\[\hat{A}^{GAE}_t = \sum_{i=t}^{T-1}(\gamma\lambda)^{i-t}(R_{i+1} + \gamma V(s_{i+1}) - V(s_i))\]

References:

Schulman et al. High-Dimensional Continuous Control Using Generalized Advantage Estimation

Sutton et al. Reinforcement Learning: An Introduction, Chapter 12, 2018

Parameters
  • gamma (Union[float, List[float]]) – A discount factor for future rewards. For multi-dim reward, this can also be a list of discounts, each discount applies to a reward dim.

  • td_error_loss_fn (Callable) – A function for computing the TD errors loss. This function takes as input the target and the estimated Q values and returns the loss for each element of the batch.

  • td_lambda (float) – Lambda parameter for TD-lambda computation.

  • normalize_target (bool) – whether to normalize target. Note that the effect of this is to change the loss. The critic value itself is not normalized.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – The name of this loss.

compute_td_target(info, target_value)[source]#

Calculate the td target.

The first dimension of all the tensors is time dimension and the second dimesion is the batch dimension.

Parameters
  • info (namedtuple) – experience collected from unroll() or a replay buffer. All tensors are time-major. info should contain the following fields: - reward: - step_type: - discount:

  • target_value (torch.Tensor) – the time-major tensor for the value at each time step. This is used to calculate return. target_value can be same as value.

Returns

td_target

forward(info, value, target_value)[source]#

Calculate the loss.

The first dimension of all the tensors is time dimension and the second dimesion is the batch dimension.

Parameters
  • info (namedtuple) – experience collected from unroll() or a replay buffer. All tensors are time-major. info should contain the following fields: - reward: - step_type: - discount:

  • value (Tensor) – the time-major tensor for the value at each time step. The loss is between this and the calculated return.

  • target_value (Tensor) – the time-major tensor for the value at each time step. This is used to calculate return. target_value can be same as value.

Returns

with the extra field same as loss.

Return type

LossInfo

property gamma#

Return the \(\gamma\) value for discounting future rewards.

Returns

a rank-0 or rank-1 (multi-dim reward) floating tensor.

Return type

Tensor

training: bool#
class TDQRLoss(num_quantiles=50, gamma=0.99, td_error_loss_fn=<function huber_function>, td_lambda=1.0, sum_over_quantiles=False, debug_summaries=False, name='TDQRLoss')[source]#

Bases: alf.algorithms.td_loss.TDLoss

Temporal difference quantile regression loss. Compared to TDLoss, GAE support has not been implemented.

Parameters
  • num_quantiles (int) – the number of quantiles.

  • gamma (Union[float, List[float]]) – A discount factor for future rewards. For multi-dim reward, this can also be a list of discounts, each discount applies to a reward dim.

  • td_error_loss_fn (Callable) – A function for computing the TD errors loss. This function takes as input the target and the estimated Q values and returns the loss for each element of the batch.

  • td_lambda (float) – Lambda parameter for TD-lambda computation. Currently only supports 1 and 0.

  • sum_over_quantiles (bool) – If True, the quantile regression loss will be summed along the quantile dimension. Otherwise, it will be averaged along the quantile dimension instead. Default is False.

  • debug_summaries (bool) – True if debug summaries should be created

  • name (str) – The name of this loss.

forward(info, value, target_value)[source]#

Calculate the loss.

The first dimension of all the tensors is time dimension and the second dimesion is the batch dimension.

Parameters
  • info (namedtuple) – experience collected from unroll() or a replay buffer. All tensors are time-major. info should contain the following fields: - reward: - step_type: - discount:

  • value (Tensor) – the time-major tensor for the value at each time step. The loss is between this and the calculated return.

  • target_value (Tensor) – the time-major tensor for the value at each time step. This is used to calculate return. target_value can be same as value.

Returns

with the extra field same as loss.

Return type

LossInfo

training: bool#

alf.algorithms.trac_algorithm#

Trusted Region Actor critic algorithm.

class TracAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), env=None, config=None, ac_algorithm_cls=<class 'alf.algorithms.actor_critic_algorithm.ActorCriticAlgorithm'>, action_dist_clip_per_dim=0.01, checkpoint=None, debug_summaries=False, name='TracAlgorithm')[source]#

Bases: alf.algorithms.rl_algorithm.RLAlgorithm

Trust-region actor-critic. It compares the action distributions after the SGD with the action distributions from the previous model. If the average distance is too big, the new parameters are shrinked as: .. code-block:: python

w_new’ = old_w + 0.9 * distance_clip / distance * (w_new - w_old)

If the distribution is Categorical, the distance is \(||logits_1 - logits_2||^2\), and if the distribution is Deterministic, it is \(||loc_1 - loc_2||^2\), otherwise it’s \(KL(d1||d2) + KL(d2||d1)\). The reason of using \(||logits_1 - logits_2||^2\) for categorical distributions is that KL can be small even if there are large differences in logits when the entropy is small. This means that KL cannot fully capture how much the change is.

Parameters
  • action_spec (nested BoundedTensorSpec) – representing the actions.

  • ac_algorithm_cls (type) – Actor Critic Algorithm cls.

  • action_dist_clip_per_dim (float) – action dist clip per dimension

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • debug_summaries (bool) – True if debug summaries should be created.

  • name (str) – Name of this algorithm.

after_update(root_inputs, info)[source]#

Adjust actor parameter according to KL-divergence.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the

batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

predict_step(time_step, state)[source]#

Predict for one step of observation.

This only used for evaluation. So it only need to perform computations for generating action distribution.

Parameters
  • time_step (TimeStep) – Current observation and other inputs for computing action.

  • state (nested Tensor) – should be consistent with predict_state_spec

Returns

  • output (nested Tensor): should be consistent with action_spec.

  • state (nested Tensor): should be consistent with predict_state_spec.

Return type

AlgStep

preprocess_experience(root_inputs, rollout_info, batch_info)[source]#

This function is called on the experiences obtained from a replay buffer. An example usage of this function is to calculate advantages and returns in PPOAlgorithm.

The shapes of tensors in experience are assumed to be \((B, T, ...)\).

Parameters
  • root_inputs (nest) – input for rollout_step() of the root algorithm. This is from replay buffer. Note this is not same as the input of rollout_step() of self unless self is the root algorithm.

  • rollout_info (nested Tensor) – AlgStep.info from rollout_step() for this algorithm.

  • batch_info (BatchInfo) – information about this batch of data

Returns

  • processed root_inputs

  • processed rollout_info

Return type

tuple

rollout_step(time_step, state)[source]#

Rollout for one step.

train_step(exp, state, rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters
  • inputs (nested Tensor) – inputs for train.

  • state (nested Tensor) – consistent with train_state_spec.

  • rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match train_state_spec.

  • info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#
class TracExperience(observation, step_type, state, action_param, prev_action)#

Bases: tuple

Create new instance of TracExperience(observation, step_type, state, action_param, prev_action)

action_param#

Alias for field number 3

observation#

Alias for field number 0

prev_action#

Alias for field number 4

state#

Alias for field number 2

step_type#

Alias for field number 1

class TracInfo(action_distribution, observation, state, ac, prev_action)#

Bases: tuple

Create new instance of TracInfo(action_distribution, observation, state, ac, prev_action)

ac#

Alias for field number 3

action_distribution#

Alias for field number 0

observation#

Alias for field number 1

prev_action#

Alias for field number 4

state#

Alias for field number 2

alf.algorithms.vae#

Variational auto encoder.

class DiscreteVAE(z_spec, input_tensor_spec=None, z_network_cls=<class 'alf.networks.encoding_networks.EncodingNetwork'>, prior_input_tensor_spec=None, prior_z_network_cls=None, mode='st', gumbel_temp_scheduler=1.0, beta=1.0, target_kld_per_categorical=None, beta_optimizer=None, name='DiscreteVAE')[source]#

Bases: alf.algorithms.vae.VariationalAutoEncoder

VAE with a discrete posterior distribution. The latent z might be a single categorical variable or a vector of categorials. Because the re-parameterization trick can no longer be applied to the discrete distribution, we instead use the straight-through (ST) gradient estimator to train the encoder.

Bengio et al., "Estimating or Propagating Gradients Through Stochastic
Neurons for Conditional Computation", 2013.

In short, we can re-parameterize the one-hot latent embedding \(z\) as

\[\hat{z} = z + z_{prob} - SG(z_{prob})\]

Because \(z\) is a sampled discrete variable, it has no gradient. So the parameter gradient is

\[\frac{\partial L}{\partial \hat{z}}\frac{\partial \hat{z}}{\partial \theta} = \frac{\partial L}{\partial \hat{z}}\frac{\partial z_{prob}}{\partial \theta}\]

Alternatively, we provide the option of ST Gumbel Softmax gradient estimator.

Jang et al., "CATEGORICAL REPARAMETERIZATION WITH GUMBEL-SOFTMAX", 2017.

Which applies the above ST trick to the Gumbel-softmax distribution that uses the Gumbel trick to reparameterize the categorical sampling process. The paper claims that ST Gumbel-softmax gradient estimator has a lower variance than the plain ST estimator.

Parameters
  • z_spec (BoundedTensorSpec) – a tensor spec for the discrete posterior. It has to be rank-one, representing a vector of discrete variables. The value bould of each variable must be identical and the lower bound has to be 0.

  • input_tensor_spec (Union[TensorSpec, List[ForwardRef], Tuple[()], Tuple[ForwardRef, …], Dict[str, ForwardRef]]) – the input spec.

  • z_network_cls (Callable) – an encoding network to encode input data into a vector of logits. If prior_z_network_cls is None, this network must handle input with spec input_tensor_spec. If prior_z_network_cls is not None, this network must be handle input with spec (prior_input_tensor_spec, input_tensor_spec, prior_z_network.output_spec).

  • prior_input_tensor_spec (Union[TensorSpec, List[ForwardRef], Tuple[()], Tuple[ForwardRef, …], Dict[str, ForwardRef]]) – the input spec for prior_z_network.

  • prior_z_network_cls (Callable) – an encoding network that outputs a vector of logits representing the a prior z distribution given the prior input.

  • mode (str) – either ‘st’ or ‘st-gumbel’.

  • gumbel_temp_scheduler (Scheduler) – the temperature scheduler for gumbel-softmax. Only used when mode=='st-gumbel'.

  • beta (float) – the weight for KL-divergence

  • target_kld_per_categorical (float) – if not None, then this will be used as the target KLD per Categorical to automatically tune beta.

  • beta_optimizer (Optimizer) – if not None, will be used to train beta.

  • name (str) –

property output_spec#

Because the output is a floating one-hot vector, the shape is rank-two.

training: bool#
class VAEInfo(kld, z_std, loss, beta_loss, beta)#

Bases: tuple

Create new instance of VAEInfo(kld, z_std, loss, beta_loss, beta)

beta#

Alias for field number 4

beta_loss#

Alias for field number 3

kld#

Alias for field number 0

loss#

Alias for field number 2

z_std#

Alias for field number 1

class VAEOutput(z, z_mode, z_std)#

Bases: tuple

Create new instance of VAEOutput(z, z_mode, z_std)

z#

Alias for field number 0

z_mode#

Alias for field number 1

z_std#

Alias for field number 2

class VariationalAutoEncoder(z_dim, input_tensor_spec=None, preprocess_network=None, z_prior_network=None, beta=1.0, target_kld_per_dim=None, beta_optimizer=None, checkpoint=None, name='VariationalAutoEncoder')[source]#

Bases: alf.algorithms.algorithm.Algorithm

VariationalAutoEncoder encodes data into diagonal multivariate gaussian, performs sampling with reparametrization trick, and returns KL divergence between posterior and prior.

Mathematically:

\(\log p(x) >= E_z \log P(x|z) - \beta KL(q(z|x) || prior(z))\)

train_step() method returns sampled z and KLD, it is up to the user of this class to use the returned z to decode and compute reconstructive loss to combine with kl loss returned here to optimize the whole network.

See vae_test.py for example usages to train vanilla vae, conditional vae and vae with prior network on mnist dataset.

Parameters
  • z_dim (int) – dimension of latent vector z, namely, the dimension for generating z_mean and z_log_var.

  • input_tensor_spec (Union[TensorSpec, List[ForwardRef], Tuple[()], Tuple[ForwardRef, …], Dict[str, ForwardRef]]) – the input spec which can be a nest. If preprocess_network is None, then it must be provided.

  • preprocess_network (EncodingNetwork) – an encoding network to preprocess input data before projecting it into (mean, log_var). If z_prior_network is None, this network must be handle input with spec input_tensor_spec. If z_prior_network is not None, this network must be handle input with spec (z_prior_network.input_tensor_spec, input_tensor_spec, z_prior_network.output_spec). If this is None, an MLP of hidden sizes (z_dim*2, z_dim*2) will be used.

  • z_prior_network (EncodingNetwork) – an encoding network that outputs concatenation of a prior mean and prior log var given the prior input. The network shouldn’t activate its output.

  • beta (float) – the weight for KL-divergence

  • target_kld_per_dim (float) – if not None, then this will be used as the target KLD per dim to automatically tune beta.

  • beta_optimizer (Optimizer) – if not None, will be used to train beta.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

  • name (str) –

train_step(inputs, state=())[source]#
Parameters
  • inputs (nested Tensor) – data to be encoded. If there is a prior network, then inputs is a tuple of (prior_input, new_obs).

  • state (Tensor) – empty tuple ()

Returns

  • output (VAEOutput):

  • state: empty tuple ()

  • info (VAEInfo):

Return type

AlgStep

training: bool#

alf.algorithms.vq_vae#

Vector Quantized Variational AutoEncoder Algorithm.

class Vqvae(input_tensor_spec, num_embeddings, embedding_dim, encoder_ctor=<class 'alf.networks.encoding_networks.EncodingNetwork'>, decoder_ctor=<class 'alf.networks.encoding_networks.EncodingNetwork'>, optimizer=None, commitment_loss_weight=1.0, checkpoint=None, debug_summaries=False, name='Vqvae')[source]#

Bases: alf.algorithms.algorithm.Algorithm

Vector Quantized Variational AutoEncoder (VQVAE) algorithm, described in:

::

A van den Oord et al. “Neural Discrete Representation Learning”, NeurIPS 2017.

VQVAE is different from standard VAE mainly in the follows aspects:

  1. Discrete latent is used, instead of continuous latent as in standard VAE.

  2. Standard VAE uses Gaussian prior and posterior. VQVAE can be viewed as using a determinstic form of posterior, which is a categorical distribution with onehot samples computed by nearest neighbor matching (Eq.1 of the paper). By using a uniform prior, the KL divergence is constant.

Parameters
  • input_tensor_spec (TensorSpec) – the tensor spec of the input.

  • num_embeddings (int) – the number of embeddings (size of codebook)

  • embedding_dim (int) – the dimensionality of embedding vectors

  • encoder_ctor (Callable) – called as encoder_ctor(observation_spec) to construct the encoding Network. The network takes raw observation as input and output the latent representation.

  • decoder_ctor (Callable) – called as decoder_ctor(latent_spec) to construct the decoder.

  • optimizer (Optimzer|None) – if provided, it will be used to optimize the parameter of encoder_net, decoder_net and embedding vectors.

  • commitment_loss_weight (float) – the weight for commitment loss.

  • checkpoint (None|str) – a string in the format of “prefix@path”, where the “prefix” is the multi-step path to the contents in the checkpoint to be loaded. “path” is the full path to the checkpoint file saved by ALF. Refer to Algorithm for more details.

predict_step(inputs, state=())[source]#

Predict for one step of inputs.

Parameters
  • inputs (nested Tensor) – inputs for prediction.

  • state (nested Tensor) – network state (for RNN).

Returns

  • output (nested Tensor): prediction result.

  • state (nested Tensor): should match predict_state_spec.

  • info (nest): information for analyzing the agent. In particular,

    if an element of the info is alf.summary.render.Image, it will be rendered during play. See alf/summary/render.py for detail.

Return type

AlgStep

train_step(inputs, state=())[source]#
Parameters

inputs (tensor) – with the shape the same as input_tensor_spec

training: bool#
class VqvaeLossInfo(quantization, commitment, reconstruction)#

Bases: tuple

Create new instance of VqvaeLossInfo(quantization, commitment, reconstruction)

commitment#

Alias for field number 1

quantization#

Alias for field number 0

reconstruction#

Alias for field number 2