alf.algorithms.ppg#

alf.algorithms.ppg.disjoint_policy_value_network#

class DisjointPolicyValueNetwork(observation_spec, action_spec, encoding_network_ctor=<class 'alf.networks.encoding_networks.EncodingNetwork'>, is_sharing_encoder=False, discrete_projection_net_ctor=<class 'alf.networks.projection_networks.CategoricalProjectionNetwork'>, continuous_projection_net_ctor=<class 'alf.networks.projection_networks.NormalProjectionNetwork'>, name='DisjointPolicyValueNetwork')[source]#

Bases: alf.networks.network.Network

A composite network with a policy component and a value component.

This network capture a category of network as proposed in the Phasic Policy Gradient paper. It consists of two components and 3 heads:

Value Component: a single value head that estimates the value function
Policy Component: 1 policy head that outputs the action distribution, and
1 auxiliary value head that behaves as a secondary value function estimator

The output of this network is a triplet, corresponding to the 3 heads in the order of (action distribution, value function, auxiliary value function).

About Architecture:

The Value Component and the Policy Component may share the same encoding network or have their own encoding network. When the encoding network is shared, it is called the “shared” architecture. If the encoding network is not shared, it is called the “dual” architecture.

NOTE that in the “shared” architecture, the encoder is detached before connecting to the value head. This means that the value head will have no power to optimize and update the parameters of the encoder under such constraint.

See https://github.com/HorizonRobotics/alf/issues/965 for a graphical illustration of such two different architectures.

NOTE:

The is_sharing_encoder = True situation corresponds to the ‘detached’ arch in OpenAI’s implementation and the Single-Network PPG in the original paper. However, OpenAI’s implementation and paper has an important difference regarding this. In the paper, it reads (quoted):

During the policy phase, we detach the value function gradient at the last layer shared between the policy and value heads, preventing the value function gradient from influencing shared parameters. During the auxiliary phase, we take the value function gradient with respect to all parameters, including shared parameters.

In their implementation, the “true” (as opposed to the aux) value head is always detached, in both policy and aux phase.

Our implementation follows the OpenAI’s implementation, which keeps the true value head always detached.
In OpenAI’s implementation, the FC and Conv layers are initialized in a non-standard way. Here in our implementation we initialize such layers with standard approaches.

The constructor of DisjointPolicyValueNetwork

Note that there are two projection constructor parameters. They exist because in the case when the action spec is a nest of different types where some of them are discrete and some of them are continuous, corresponding projection networks can be created for the two parties individually and respectively.

Parameters

observation_spec (nest of TesnorSpec) – specifies the shape and type of the input observation.
action_spec (nest of TensorSpec) – speficifies the shape and type of the output action. The type of output action distribution is implicitly derived from this.
encoding_network_ctor (Callable[..., Network]) – A constructor that creates the encoding network. Depending whether the encoding network is shared between the value component and the policy component, 1 or 2 encoding network will be created using this constructor.
is_sharing_encoder (bool) – When set to true, the encoding network is shared between the value and the policy component, resulting in a “shared” architecture disjoint network. When set to false, the encoding network is not shared, resulting in a “dual” architecture disjoint network.
(Callable[[int, BoundedTensorSpec], (continuous_projection_net_ctor) – Network]): constructor that generates a discrete projection network that outputs discrete actions.
(Callable[[int, BoundedTensorSpec], – Network): constructor that generates a continuous projection network that outputs continuous actions.
name (str) – the name of the network

forward(observation, state, require_aux=True)[source]#

Computes the action distribution, aux value and value estimation

In PPG’s policy phase update, auxiliary estimation is not needed as it does not participate in computing the policy phase loss. Depending on whether require_aux is set to True or False, forward will choose to compute auxiliary value estimation or not accordingly.

NOTE: Although by not computing the auxiliary value estimation it saves a tiny bit of computation, the main reason we want to prevent it from being computed for PPG’s policy phase is to make PPG work with DDP (Data Distributed Parallel). DDP need to wait for all parameters that contributes to the output of forward() to go through backward(). If auxiliary value estimation were computed, DDP will panic since it will not go through backward() in the policy phase update.

Parameters

observation (nested torch.Tensor) – a tensor that is consistent with the encoding network
state – the state(s) for RNN based network
require_aux (bool) – When set to False, return () as the auxiliary value estimation in the output.

Returns

network output in the order of policy (action: distribution), value function estimation, auxiliary value function estimation

state (Triplet): RNN states in the order of policy, value, aux value

Return type

output (Triplet)

training: bool#

alf.algorithms.ppg.ppg_aux_algorithm#

class PPGAuxAlgorithm(observation_spec, action_spec, reward_spec=TensorSpec(shape=(), dtype=torch.float32), config=None, optimizer=None, dual_actor_value_network=None, aux_options=PPGAuxOptions(enabled=True, interval=32, mini_batch_length=None, mini_batch_size=8, num_updates_per_train_iter=6), debug_summaries=False, name='PPGAuxAlgorithm')[source]#

Bases: alf.algorithms.off_policy_algorithm.OffPolicyAlgorithm

An algorithm that performs the auxiliary phase update of PPG.

The algorithm is used as a sub algorithm of PPGAlgorithm. Auxiliary phase updates does not require new rollouts. Instead it will collect all of the experiences since the last auxiliary phase updates in ITS OWN replay buffer.

Construct a PPGAuxAlgorithm instances.

Parameters

observation_spec (nested TensorSpec) – representing the observations.
action_spec (nested BoundedTensorSpec) – representing the actions.
reward_spec (TensorSpec) – a rank-1 or rank-0 tensor spec representing the reward(s).
config (TrainerConfig) – config for training. config only needs to be provided to the algorithm which performs train_iter() by itself.
optimizer (Optional[Optimizer]) – optimizer used for auxiliary phase update.
dual_actor_value_network – the underlying network for PPG algorithm. PPGAuxAlgorithm does not own the network. Instead, this should be shared reference from the parent PPGAlgorithm.
aux_options (PPGAuxOptions) – Options that controls the auxiliary phase training.
name (str) – Name of this algorithm.

calc_loss(info)[source]#

Calculate the loss at each step for each sample.

Parameters

info (nest) – information collected for training. It is batched from each AlgStep.info returned by rollout_step() (on-policy training) or train_step() (off-policy training).

Returns

loss at each time step for each sample in the: batch. The shapes of the tensors in loss info should be \((T, B)\).

Return type

LossInfo

property interval#

observe_for_aux_replay(exp)[source]#

Save the experience in the replay buffer for auxiliary phase update.

Parameters: exp (nested Tensor) – Experience to be saved. The shape is [B, …] where B is the batch size of the batch environment.

train_step(inputs, state, plain_rollout_info)[source]#

Perform one step of training computation.

It is called to calculate output for every time step for a batch of experience from replay buffer. It also needs to generate necessary information for calc_loss().

Parameters

inputs (nested Tensor) – inputs for train.
state (nested Tensor) – consistent with train_state_spec.
rollout_info (nested Tensor) – info from rollout_step(). It is retrieved from replay buffer.

Returns

output (nested Tensor): prediction result.
state (nested Tensor): should match train_state_spec.
info (nested Tensor): information for training. It will temporally batched and passed as info for calc_loss(). If this is LossInfo, calc_loss() in Algorithm can be used. Otherwise, the user needs to override calc_loss() to calculate loss or override update_with_gradient() to do customized training.

Return type

AlgStep

training: bool#

class PPGAuxOptions(enabled, interval, mini_batch_length, mini_batch_size, num_updates_per_train_iter)#

Bases: tuple

Create new instance of PPGAuxOptions(enabled, interval, mini_batch_length, mini_batch_size, num_updates_per_train_iter)

enabled#: Alias for field number 0

interval#: Alias for field number 1

mini_batch_length#: Alias for field number 2

mini_batch_size#: Alias for field number 3

num_updates_per_train_iter#: Alias for field number 4

alf.algorithms.ppg.ppg_aux_phase_loss#

class PPGAuxPhaseLoss(td_error_loss_fn=<function element_wise_squared_loss>, policy_kl_loss_weight=1.0, gamma=0.999, td_lambda=0.95, name='PPGAuxPhaseLoss')[source]#

Bases: alf.algorithms.algorithm.Loss

Implementation of the PPG Auxiliary Phase Loss Function

The loss is used in the auxiliary update phase of the Phasic Policy Gradient (PPG) algorithm and the total loss is a (weighted) sum of 3 components

td_loss_actual: the MSE-like loss between the value head’s value estimation and the TD-based value target.
td_loss_aux: the MSE-like loss between the auxiliary value head’s value estimation and the TD-based value target.
policy_kl_loss: this is the behavior cloning loss that measures the KL divergence between the old policy and the target policy

Since the first 2 components are comparable, there is one weight defined for the last one (behavior cloning KLD) to tune the relative significance of the components.

For detailed illustration of the PPG auxiliary phase loss, see https://github.com/HorizonRobotics/alf/issues/965#issuecomment-897949432

Construct a PPGAuxPhaseLoss instnace with parameters

Parameters

td_error_loss_fn (Callable) – a binary tensor operator that computes the MSE-like loss between the two inputs, and it defines how we aggregate the error between the value estimations and the TD-based value targets
policy_kl_loss_weight (float) – this parameter is used to tune up and down the relative significance of the behavior cloning KLD in the total loss
gamma (Union[float, List[float]]) – A discount factor for future rewards. For multi-dim reward, this can also be a list of discounts, each discount applies to a reward dim.
td_lambda (float) – Lambda parameter for TD-lambda computation.
name (str) – the name of the loss

forward(info)[source]#

Computes loss based on the input PPGTrainInfo

Parameters: info (PPGTrainInfo) – provide the inputs for computing the loss, which includes the value targets, the value estimations, and the action distributions from both the old policy and the target policy
Return type: LossInfo

training: bool#

class PPGAuxPhaseLossInfo(td_loss_actual, td_loss_aux, policy_kl_loss)#

Bases: tuple

Create new instance of PPGAuxPhaseLossInfo(td_loss_actual, td_loss_aux, policy_kl_loss)

policy_kl_loss#: Alias for field number 2

td_loss_actual#: Alias for field number 0

td_loss_aux#: Alias for field number 1

alf.algorithms.ppg.ppg_utils#

class PPGRolloutInfo(action_distribution, action, log_prob, value, aux, step_type, discount, reward, reward_weights)#

Bases: tuple

Create new instance of PPGRolloutInfo(action_distribution, action, log_prob, value, aux, step_type, discount, reward, reward_weights)

action#: Alias for field number 1

action_distribution#: Alias for field number 0

aux#: Alias for field number 4

discount#: Alias for field number 6

log_prob#: Alias for field number 2

reward#: Alias for field number 7

reward_weights#: Alias for field number 8

step_type#: Alias for field number 5

value#: Alias for field number 3

class PPGTrainInfo(action_distribution=(), action=(), log_prob=(), value=(), aux=(), step_type=(), discount=(), reward=(), reward_weights=(), rollout_action_distribution=(), rollout_value=(), rollout_log_prob=())[source]#

Bases: alf.algorithms.ppg.ppg_utils.PPGTrainInfo

Data structure that stores extra derived information for training in addition to the original rollout information.

Such extra information is derived during training updates and used across calls to train_step().

It is designed as a separate class (as opposite to be merged into PPGRolloutInfo) becase we want to make it explicit about what are derived compared to the rollout information during training.

Create new instance of PPGTrainInfo(action_distribution, action, log_prob, value, aux, step_type, discount, reward, reward_weights, rollout_action_distribution, rollout_value, rollout_log_prob)

absorbed(rollout_info)[source]#

Combines the PPGTrainInfo and the PPGRolloutInfo.

This function generate a new PPGTrainInfo instead of updating self in place.

In ``train_step()`, we would like to keep the derived information in PPGTrainInfo while updating most of the shared fields (with PPGRolloutInfo) from evaluation of the updated network. This function makes it easy to do that.

Parameters: rollout_info (PPGRolloutInfo) – the result of rollout or evaluation that needs to be combined with self
Return type: PPGTrainInfo
Returns: A new PPGTrainInfo that combines the useful part from both parties.

ppg_network_forward(network, inputs, state, require_aux=True, epsilon_greedy=None)[source]#

Evaluates the network forward pass for roll out or training The signature mimics rollout_step() of Algorithm completedly. :type network: DisjointPolicyValueNetwork :param network: the network whose forward pass is to be performed. :type inputs: TimeStep :param inputs: carries the observation that is needed as input to the

network.

Parameters

state (nested Tesnor) – carries the state for RNN-based network
require_aux (bool) – whether to compute and return auxiliary estimation. See DisjointPolicyValueNetwork.forward() for details.
epsilon_greedy (Optional[float]) – if set to None, the action will be sampled strictly based on the action distribution. If set to a value in [0, 1], epsilon-greedy sampling will be used to sample the action from the action distribution, and the float value determines the chance of action sampling instead of taking argmax.

Return type

AlgStep