ALF Knowlege Base#

Reading the source code of ALF#

The whole training process in ALF can be understood as a loop of calling RLAlgorithm.train_iter().

while not end:
  algorithm.train_iter()

This loop is in PolicyTrainer._train(). Besides calling RLAlgorithm.train_iter(), it also takes care of things like checkpointing, etc.

So if you want to get a good understanding of ALF, you can go directly to read the source code of train_iter(). Depending on whether the algorithm is on-policy or off-policy, train_iter() calls _train_iter_on_policy() or _train_iter_off_policy() respectively. Each of these is implemented in OnPolicyAlgorithm and OffPolicyAlgorithm.

Debugging using VScode#

Currently, ALF uses separate processes to launch multiple environments. Because vscode does not support debug for multiprocessing, in order to debug in vscode, you need to make ALF not to start separate processes by setting the following config:

create_environment.num_parallel_environments=1
create_environment.nonparallel=True
TrainerConfig.evaluate=False

The last line evaluate=False is to make it safe for rare simulators which crash when two unwrapped (thread) envs coexist in the main process.

Training scheme#

There are two core concepts in ALF when training an algorithm: unroll and training iteration that happen in an intervening way.

An unroll is when the algorithm collects new time steps, using actions generated by its network inference to interact with the environment. No learning happens during an unroll. Each unroll proceeds by unroll_length steps before the algorithm switching to a training iteration. Depending on whether the algorithm is on-policy or off-policy, an unroll stores the collected data in different ways:

on-policy: temporarily caches the collected data. Once the next training iteration finishes, the data will be discarded (one-time use).
off-policy: uses a replay buffer to store the collected data which will be potentially used by many training iteration in the future.

A training iteration is when the algorithm actually learns from experiences (an experience is a historical time step) and updates its network parameters. Depending on whether the algorithm is on-policy or off-policy, a training iteration does different things:

on-policy: compute gradients on the cached collected data for a parameter update. Note that in this case, a training iteration can only have one update.
off-policy: sample and compute gradients on a batch of data from a replay buffer in either of the two following ways:
- whole_replay_buffer_training=True: the entire replay buffer will be used for the current training iteration. The buffer will be shuffled first (along the batch size dimension while keeping temporal information) and then divided into minibatches of (mini_batch_size, mini_batch_length). Each minibatch results in an optimizer step. This entire process (shuffling-dividing-stepping) will be repeated by num_updates_per_train_iter times.
- whole_replay_buffer_training=False: samples a batch of size (mini_batch_size * num_updates_per_train_iter, mini_batch_length) from the replay buffer, shuffles the batch along the batch size dimension, and divides it into minibatches of (mini_batch_size, mini_batch_length).

Overall, for an on-policy algorithm, at any moment the number of environment steps is always equal to the number of experiences that have been used for training. For an off-policy algorithm, the former number is likely to be much smaller than the latter one.

Algorithm#

Algorithm is the most important concept in ALF. (TODO: more description about the design.)

TimeStep#

TimeStep is a data structure that stores the information from the result of each environment step. It contains eight fields:

step_type: type of this step. It has three possible values:
- StepType.FIRST is the first step of an episode, which is typically the step generated from env.reset().
- StepType.LAST is the last step of an episode.
- StepType.MID is for all the other steps in an episode.
reward: reward from the previous action. In most RL literature, the reward for an action \(a_t\) at time \(t\) is usually written as \(r_t\). However, in ALF, \(r_t\) will always represent the reward for the previous action at time \(t-1\).
discount: discount value for discounting future reward. When calculating the cumulative discounted return, discount is used to discount the future reward. There is some subtleties on how this value is set which we will describe later.
observation: observation from the environment. It can be a nest of Tensors. It is obtained after the environment execute prev_action.
prev_action: the previous action taken by the agent.
env_id: which environment this TimeStep comes from. This id information can be used by replay buffers and metrics if there are multiple environments accessing them asynchronously.
untransformed: a nest that represents the entire time step itself before any transformation (e.g., observation or reward transformation); used for experience replay observing by subalgorithms.
env_info: A dictionary containing information returned by Gym environments’ info.

About `TimeStep.discount`#

When a gym environment is registered, there is an optional parameter named max_episode_steps which has default value of None. For example, the following is the registration for MountainCar environment:

register(
    id='MountainCar-v0',
    entry_point='gym.envs.classic_control:MountainCarEnv',
    max_episode_steps=200,
    reward_threshold=-110.0,
)

Gym creates an EnvSpec object for each registered environment. EnvSpec has an attribute timestep_limit which returns the value of max_episode_steps.

A gym environment can be loaded by using gym.make() defined in gym.envs.registration. If timestep_limit of the spec of this environment is not None, this function will wrap the environment using gym.wrappers.time_limit.TimeLimit. This wrapper will end an episode by returning done=True if the number of steps exceeds max_episode_steps.

Each TimeStep is associated with a discount value. In general, if an episode ends, TimeStep.step_type is set to StepType.LAST and TimeStep.discount is set to 0 to prevent using the value estimation at the last step. However, if an episode ends because the max_episode_steps is reached, it wants to use the original discount instead of 0 so that the value estimation at the last step can be properly used to estimate the value of previous steps. In order to achieve this, we create an environment in the following way to avoid gym.wrappers.time_limit.TimeLimit:

gym_spec = gym.spec(environment_name)
gym_env = gym_spec.make()

Then we use the wrapper environments.alf_wrappers.TimeLimit to wrap the environment to limit the steps so that it does not change the discount when max_episode_steps is reached.

The following table summarizes how step type and discount affect the learning.

Step type	Discount	Value used for bootstrapping the previous value?	Value to be learned?	Note
`FIRST`	1	No	Yes	First step of an episode
`MID`	1	Yes	Yes	Any step other than `FIRST` and `LAST`
`LAST`	0	No	No	Last step because of a normal game end
`LAST`	1	Yes	No	Last step because of time limit

DataTransformers#

A DataTransformer takes in data from rollout or replay, does some processing and returns the modified data.

It is a useful abstraction to organize all kinds of data processing. For example, ObservationNormalizer normalizes input data to be zero mean and one std.

However, it is important to note that when combining multiple data transformers into a SequentialDataTransformer, certain rules on the order must be followed:

If UntransformedTimeStep is used to save a reference to the original TimeStep, it must be the very first data transformer in the list.
HindsightExperienceTransformer, FrameStacker or any data transformer that need to access the replay buffer directly for data needs to happen before all other data transformers that are not UntransformedTimeStep.

The reason is the following: In off policy training, the replay buffer stores raw input w/o being processed by any data transformer. If say ObservationNormalizer is applied before hindsight, then data retrieved by replay will be normalized whereas hindsight data directly pulled from the replay buffer will not be normalized. Data will be in mismatch, causing training to suffer and potentially fail.

For the same reason, one needs to be very careful when retrieving any data directly from the replay buffer, when there are data transformers present. For example, when RewardClip or RewardNormalizer is present, we need to manually process any rewards retrieved directly from the replay buffer (hence raw rewards) using the same clipping or normalization transformations. Otherwise, results will be likely incorrect.

It is very hard to debug such an error at the moment. We try to raise errors where we suspect a problematic sequence of data transformers is present, but it does not catch all problems. Ultimately, it’s the developer’s responsibility to make sure the sequence of data transformers is applied correctly to produce consistent data across rollout and training, and also within the same batch of data during either rollout or training.

Environment#

The training algorithms learn through the interaction with environments. The interface of an environment for an algorithm is defined by AlfEnvironment. The interface provides support for batched environment step and reset. That means from the perspective the algorithm, it can step and reset multiple environments synchronously.

Typically, we have a third-party environment following gym.Env interface. It takes the following steps to obain a batched AlfEnvironment from the name of a gym environment.

1. Create a gym environment. Typically, the gym environment is created using the following code:

gym_spec = gym.spec(environment_name)
gym_env = gym_spec.make()

2. Apply a series of gym wrappers. One of the most often used gym wrapper is ImageChannelFirst, which converts image with channel-last format to channel-first format. ALF uses channel-first format for its convolution layers.

3. Wrap the gym environment as a non-batched AlfEnvironment using AlfGymWrapper. All of its inputs/outputs are numpy.ndarray.

4. Apply a series of ALF environment wrappers. All of its inputs/outputs are numpy.ndarray.

5. Wrap the non-batched ALF environmnet with ProcessEnvironment. It provides an interface using CPU torch.Tensor and interacts with the underlying AlfEnvironment using numpy.ndarray.

6. Use ParallelAlfEnvironment to manage a set of ProcessEnvironment s and obtain a batched ALfEnvironmnet. During step(), ParallelEnvironment unstacks the action to get individual actions and call step() of each ProcessEnvironment. After obtaining all the individual TimeStep s from ProcessEnvironment, it stacks them as a batched TimeStep and converts it to the default device. The inter-process communication takes place inside ProcessEnvironment.

The load() function from various envrinment suites such as suite_gym or suite_socialbot handles steps 1-4 for each of these environment suites. alf.environments.utils.create_environment handles all the above steps by creating ParallelEnvironment using the load() function.

It is possible to directly implement a batched AlfEnvironment without following the above steps. suite_carla is such an example.

`ParallelAlfEnvironment` and `ThreadEnvironment`#

A ThreadEnvironment is directly created in a thread of the main process and it can only wrap one Gym environment. A ParallelAlfEnvironment wraps a collection of Gym environments in subprocesses. Sometimes a Gym environment will crash or behave abnormally if it’s wrapped by a ThreadEnvironment. So ParallelAlfEnvironment is usually preferred for single or multiple training environments.

However, gin/alf configurations that are used by subprocesses will not be considered by the main process as “operative”. So to help debug, sometimes a ThreadEnvironment is additionally created because it uses gin/alf configurations in the main process. If an evaluation environment is needed, this thread environment can also serve as the evaluation environment.

To resolve the conflict of two, TrainerConfig provide a flag no_thread_env_for_conf. The logic of creating an evaluation environment or a thread env is illustrated below:

`TrainerConfig` flags	`evaluate=True`	`evaluate=False`
`no_thread_env_for_conf=True`	`eval_env` \(\leftarrow\) `ParallelAlfEnvironment` (N=1)	`None`
`no_thread_env_for_conf=False`	`eval_env` \(\leftarrow\) `ThreadEnvironment`	Is training env `ParallelAlfEnvironment`? Yes: `ThreadEnvironment` No: `None`

Snapshot#

Sometimes we might want to play an old model that was trained a long time ago, even though ALF code has been changed since then. So by default, ALF stores a snapshot (all python files) under the root dir of a training job. This snapshot has a path like <training_root_dir>/alf. To disable storing a snapshot, when training or grid searching, you can specify a flag --nostore_snapshot in the command line.

alf.bin.play will by default use the current ALF code for playing. To play a trained model with its snapshot, you can specify the flag --use_alf_snapshot. By doing so, alf.bin.play will give a higher priority to the ALF snapshot under the training directory.

To correctly use a snapshot, it is important to avoid relative paths/imports when writing your conf files. For example, suppose a conf file imports sac_conf.py under the same directory, as in the following:

# sac_conf1.py     # under 'alf/examples'
import sac_conf    # under 'alf/examples'
algo_cls = sac_conf.SacAlgorithm
...

When this conf is played with a snapshot, it is supposed to import the sac_conf.py file of the ALF snapshot. However, if alf.bin.play is run in the current alf/examples that also contains the newest version of sac_conf.py, the old (desired) sac_conf.py will be shadowed. As another example,

# sac_conf1.py    # under 'alf/examples'
import sys
sys.path.append("./sac")
import sac_conf   # under 'alf/examples/sac'
algo_cls = sac_conf.SacAlgorithm
...

which will append the wrong path (depending on what the current path is) to sys.path when playing with a snapshot.

When playing with a snapshot, one thing is always guaranteed: the module alf is always under the correct python path. So you should always ensure that modules are imported relative to the root module alf. The perfectly safe way of writing the above examples are:

# sac_conf1.py     # under 'alf/examples'
from alf.examples import sac_conf
algo_cls = sac_conf.SacAlgorithm
...

and

# sac_conf1.py    # under 'alf/examples'
from alf.examples.sac import sac_conf
algo_cls = sac_conf.SacAlgorithm
...

In this way, no matter whether you are playing with a snapshot or not, the correct python files are used.

Note

When playing with a snapshot, if the behaviors are unexpected, remember to check if you’re using relative paths incorrectly.

Differences with the Tensorflow version of ALF#

The Pytorch version of ALF has several subtle differences with the Tensorflow version. Knowing these differences may help reproducing some of the experiments.

1. alf.initializers.variance_scaling_init(). It functions similarly as tf.compat.v1.keras.initializers.VarianceScaling. However, there is one key difference: its gain parameter corresponds to the squared root of scale parameter of VarianceScaling. Because of this, the following parameters also have different meaning as their corresponding parameters used in ALF-tf:

logits_init_output_factor of alf.networks.CategoricalProjectionNetwork corresponds to logits_init_output_factor of tf_agents CategoricalProjectionNetwork used by ALF-tf. logits_init_output_factor of ALF-pytorch should be set to the squared root of logits_init_output_factor of tf_agents.
projection_output_init_gain of alf.networks.NormalProjectionNetwork corresponds to init_means_output_factor of tf_agents NormalProjectionNetwork used by ALF-tf. projection_output_init_gain should be set to the squared root of init_means_output_factor.

2. gym_wrappers.ContinuousActionClip. In ALF-pytorch, by default, we add this wrapper to clip the out-of-bound continuous actions for all gym environments (Note that most environments supported by ALF are gym environments, even they may not be named so). ContinuousActionClip can often help the algorithm to obtain higher rewards at the beginning of training because the evironment may calculate reward using an out-of-bound action without clipping. But sometimes, using this wrapper can hurt the final performance. You can disable it by setting the following in the config:

suite_gym.wrap_env.clip_action=False

ALF Knowlege Base#

Reading the source code of ALF#

Debugging using VScode#

Training scheme#

Algorithm#

TimeStep#

About TimeStep.discount#

DataTransformers#

Environment#

ParallelAlfEnvironment and ThreadEnvironment#

Snapshot#

Differences with the Tensorflow version of ALF#

About `TimeStep.discount`#

`ParallelAlfEnvironment` and `ThreadEnvironment`#