ALF Knowlege Base#
Reading the source code of ALF#
The whole training process in ALF can be understood as a loop of calling
RLAlgorithm.train_iter().
while not end:
algorithm.train_iter()
This loop is in PolicyTrainer._train().
Besides calling RLAlgorithm.train_iter(), it also takes care of things like
checkpointing, etc.
So if you want to get a good understanding of ALF, you can go directly to
read the source code of train_iter().
Depending on whether the algorithm is on-policy or off-policy, train_iter()
calls _train_iter_on_policy() or _train_iter_off_policy() respectively. Each of these
is implemented in OnPolicyAlgorithm
and OffPolicyAlgorithm.
Debugging using VScode#
Currently, ALF uses separate processes to launch multiple environments. Because vscode does not support debug for multiprocessing, in order to debug in vscode, you need to make ALF not to start separate processes by setting the following config:
create_environment.num_parallel_environments=1
create_environment.nonparallel=True
TrainerConfig.evaluate=False
The last line evaluate=False is to make it safe for rare simulators which
crash when two unwrapped (thread) envs coexist in the main process.
Training scheme#
There are two core concepts in ALF when training an algorithm: unroll and training iteration that happen in an intervening way.
An unroll is when the algorithm collects new time steps, using actions
generated by its network inference to interact with the environment. No learning
happens during an unroll. Each unroll proceeds by unroll_length steps
before the algorithm switching to a training iteration. Depending on whether
the algorithm is on-policy or off-policy, an unroll stores the collected data
in different ways:
on-policy: temporarily caches the collected data. Once the next training iteration finishes, the data will be discarded (one-time use).
off-policy: uses a replay buffer to store the collected data which will be potentially used by many training iteration in the future.
A training iteration is when the algorithm actually learns from experiences (an experience is a historical time step) and updates its network parameters. Depending on whether the algorithm is on-policy or off-policy, a training iteration does different things:
on-policy: compute gradients on the cached collected data for a parameter update. Note that in this case, a training iteration can only have one update.
off-policy: sample and compute gradients on a batch of data from a replay buffer in either of the two following ways:
whole_replay_buffer_training=True: the entire replay buffer will be used for the current training iteration. The buffer will be shuffled first (along the batch size dimension while keeping temporal information) and then divided into minibatches of(mini_batch_size, mini_batch_length). Each minibatch results in an optimizer step. This entire process (shuffling-dividing-stepping) will be repeated bynum_updates_per_train_itertimes.whole_replay_buffer_training=False: samples a batch of size(mini_batch_size * num_updates_per_train_iter, mini_batch_length)from the replay buffer, shuffles the batch along the batch size dimension, and divides it into minibatches of(mini_batch_size, mini_batch_length).
Overall, for an on-policy algorithm, at any moment the number of environment steps is always equal to the number of experiences that have been used for training. For an off-policy algorithm, the former number is likely to be much smaller than the latter one.
Algorithm#
Algorithm is the most important concept in ALF. (TODO: more description about the design.)
TimeStep#
TimeStep is a data structure that stores the information from the result
of each environment step. It contains eight fields:
step_type: type of this step. It has three possible values:StepType.FIRSTis the first step of an episode, which is typically the step generated fromenv.reset().StepType.LASTis the last step of an episode.StepType.MIDis for all the other steps in an episode.
reward: reward from the previous action. In most RL literature, the reward for an action \(a_t\) at time \(t\) is usually written as \(r_t\). However, in ALF, \(r_t\) will always represent the reward for the previous action at time \(t-1\).discount: discount value for discounting future reward. When calculating the cumulative discounted return,discountis used to discount the future reward. There is some subtleties on how this value is set which we will describe later.observation: observation from the environment. It can be a nest of Tensors. It is obtained after the environment executeprev_action.prev_action: the previous action taken by the agent.env_id: which environment thisTimeStepcomes from. This id information can be used by replay buffers and metrics if there are multiple environments accessing them asynchronously.untransformed: a nest that represents the entire time step itself before any transformation (e.g., observation or reward transformation); used for experience replay observing by subalgorithms.env_info: A dictionary containing information returned by Gym environments’info.
About TimeStep.discount#
When a gym environment is registered, there
is an optional parameter named max_episode_steps which has default value
of None. For example, the following is the registration for
MountainCar environment:
register(
id='MountainCar-v0',
entry_point='gym.envs.classic_control:MountainCarEnv',
max_episode_steps=200,
reward_threshold=-110.0,
)
Gym creates an EnvSpec object for each registered environment.
EnvSpec has an attribute timestep_limit which returns the value
of max_episode_steps.
A gym environment can be loaded by using gym.make() defined in
gym.envs.registration. If timestep_limit of the spec of this
environment is not None, this function will wrap the environment using
gym.wrappers.time_limit.TimeLimit. This wrapper will end an episode by
returning done=True if the number of steps exceeds
max_episode_steps.
Each TimeStep is associated with a discount value. In general,
if an episode ends, TimeStep.step_type is set to StepType.LAST
and TimeStep.discount is set to 0 to prevent using the value estimation
at the last step. However, if an episode ends because the
max_episode_steps is reached, it wants to use the original
discount instead of 0 so that the value estimation at the last step can
be properly used to estimate the value of previous steps. In order to achieve
this, we create an environment in the following way to avoid
gym.wrappers.time_limit.TimeLimit:
gym_spec = gym.spec(environment_name)
gym_env = gym_spec.make()
Then we use the wrapper environments.alf_wrappers.TimeLimit to wrap
the environment to limit the steps so that it does not change the discount when
max_episode_steps is reached.
The following table summarizes how step type and discount affect the learning.
Step type |
Discount |
Value used
for bootstrapping
the previous value?
|
Value
to be learned?
|
Note |
|---|---|---|---|---|
|
1 |
No |
Yes |
First step of an episode |
|
1 |
Yes |
Yes |
Any step other than |
|
0 |
No |
No |
Last step because of a normal game end |
|
1 |
Yes |
No |
Last step because of time limit |
DataTransformers#
A DataTransformer takes in data from rollout or replay, does some processing and returns the modified data.
It is a useful abstraction to organize all kinds of data processing. For example,
ObservationNormalizer normalizes input data to be zero mean and one std.
However, it is important to note that when combining multiple data transformers
into a SequentialDataTransformer, certain rules on the order must be
followed:
If
UntransformedTimeStepis used to save a reference to the originalTimeStep, it must be the very first data transformer in the list.HindsightExperienceTransformer,FrameStackeror any data transformer that need to access the replay buffer directly for data needs to happen before all other data transformers that are notUntransformedTimeStep.
The reason is the following: In off policy training, the replay buffer stores
raw input w/o being processed by any data transformer. If say
ObservationNormalizer is applied before hindsight, then data retrieved by
replay will be normalized whereas hindsight data directly pulled from the replay
buffer will not be normalized. Data will be in mismatch, causing training to
suffer and potentially fail.
For the same reason, one needs to be very careful when retrieving any data directly from the replay buffer, when there are data transformers present. For example, when RewardClip or RewardNormalizer is present, we need to manually process any rewards retrieved directly from the replay buffer (hence raw rewards) using the same clipping or normalization transformations. Otherwise, results will be likely incorrect.
It is very hard to debug such an error at the moment. We try to raise errors where we suspect a problematic sequence of data transformers is present, but it does not catch all problems. Ultimately, it’s the developer’s responsibility to make sure the sequence of data transformers is applied correctly to produce consistent data across rollout and training, and also within the same batch of data during either rollout or training.
Environment#
The training algorithms learn through the interaction with environments. The interface of an environment for an algorithm is defined by AlfEnvironment. The interface provides support for batched environment step and reset. That means from the perspective the algorithm, it can step and reset multiple environments synchronously.
Typically, we have a third-party environment following gym.Env interface. It takes the following steps to obain a batched AlfEnvironment from the name of a gym environment.
1. Create a gym environment. Typically, the gym environment is created using the following code:
gym_spec = gym.spec(environment_name)
gym_env = gym_spec.make()
2. Apply a series of gym wrappers. One of the most often used gym wrapper is ImageChannelFirst, which converts image with channel-last format to channel-first format. ALF uses channel-first format for its convolution layers.
3. Wrap the gym environment as a non-batched AlfEnvironment using
AlfGymWrapper.
All of its inputs/outputs are numpy.ndarray.
4. Apply a series of ALF environment wrappers.
All of its inputs/outputs are numpy.ndarray.
5. Wrap the non-batched ALF environmnet with ProcessEnvironment.
It provides an interface using CPU torch.Tensor and interacts with the underlying
AlfEnvironment using numpy.ndarray.
6. Use ParallelAlfEnvironment
to manage a set of ProcessEnvironment s and obtain a batched ALfEnvironmnet.
During step(), ParallelEnvironment unstacks the action to get individual
actions and call step() of each ProcessEnvironment. After obtaining all
the individual TimeStep s from ProcessEnvironment, it stacks them as a
batched TimeStep and converts it to the default device. The inter-process
communication takes place inside ProcessEnvironment.
The load() function from various envrinment suites such as suite_gym
or suite_socialbot
handles steps 1-4 for each of these environment suites. alf.environments.utils.create_environment
handles all the above steps by creating ParallelEnvironment using the load()
function.
It is possible to directly implement a batched AlfEnvironment without following
the above steps. suite_carla
is such an example.
ParallelAlfEnvironment and ThreadEnvironment#
A ThreadEnvironment is directly created in a thread of the main process and
it can only wrap one Gym environment. A ParallelAlfEnvironment wraps a
collection of Gym environments in subprocesses. Sometimes a Gym environment will
crash or behave abnormally if it’s wrapped by a ThreadEnvironment.
So ParallelAlfEnvironment is usually preferred for single or multiple
training environments.
However, gin/alf configurations that are used by subprocesses will not be considered
by the main process as “operative”. So to help debug, sometimes a ThreadEnvironment
is additionally created because it uses gin/alf configurations in the main process.
If an evaluation environment is needed, this thread environment can also serve
as the evaluation environment.
To resolve the conflict of two, TrainerConfig provide a flag no_thread_env_for_conf.
The logic of creating an evaluation environment or a thread env is illustrated
below:
|
|
|
|---|---|---|
|
|
|
|
|
Is training env
ParallelAlfEnvironment?Yes:
ThreadEnvironmentNo:
None |
Snapshot#
Sometimes we might want to play an old model that was trained a long time ago,
even though ALF code has been changed since then. So by default, ALF stores a
snapshot (all python files) under the root dir of a training job. This snapshot
has a path like <training_root_dir>/alf. To disable storing a snapshot, when
training or grid searching, you can specify a flag --nostore_snapshot in the
command line.
alf.bin.play will by default use the current ALF code for playing. To play a
trained model with its snapshot, you can specify the flag --use_alf_snapshot.
By doing so, alf.bin.play will give a higher priority to the ALF snapshot under
the training directory.
To correctly use a snapshot, it is important to avoid relative paths/imports
when writing your conf files. For example, suppose a conf file
imports sac_conf.py under the same directory, as in the following:
# sac_conf1.py # under 'alf/examples'
import sac_conf # under 'alf/examples'
algo_cls = sac_conf.SacAlgorithm
...
When this conf is played with a snapshot, it is supposed to import the sac_conf.py
file of the ALF snapshot. However, if alf.bin.play is run in the current
alf/examples that also contains the newest version of sac_conf.py,
the old (desired) sac_conf.py will be shadowed. As another example,
# sac_conf1.py # under 'alf/examples'
import sys
sys.path.append("./sac")
import sac_conf # under 'alf/examples/sac'
algo_cls = sac_conf.SacAlgorithm
...
which will append the wrong path (depending on what the current path is) to
sys.path when playing with a snapshot.
When playing with a snapshot, one thing is always guaranteed: the module alf
is always under the correct python path. So you should always ensure that modules
are imported relative to the root module alf. The perfectly safe way of writing
the above examples are:
# sac_conf1.py # under 'alf/examples'
from alf.examples import sac_conf
algo_cls = sac_conf.SacAlgorithm
...
and
# sac_conf1.py # under 'alf/examples'
from alf.examples.sac import sac_conf
algo_cls = sac_conf.SacAlgorithm
...
In this way, no matter whether you are playing with a snapshot or not, the correct python files are used.
Note
When playing with a snapshot, if the behaviors are unexpected, remember to check if you’re using relative paths incorrectly.
Differences with the Tensorflow version of ALF#
The Pytorch version of ALF has several subtle differences with the Tensorflow version. Knowing these differences may help reproducing some of the experiments.
1. alf.initializers.variance_scaling_init(). It functions similarly as
tf.compat.v1.keras.initializers.VarianceScaling.
However, there is one key difference: its gain parameter corresponds to the squared
root of scale parameter of VarianceScaling. Because of this, the following
parameters also have different meaning as their corresponding parameters used in
ALF-tf:
logits_init_output_factorofalf.networks.CategoricalProjectionNetworkcorresponds tologits_init_output_factorof tf_agentsCategoricalProjectionNetworkused by ALF-tf.logits_init_output_factorof ALF-pytorch should be set to the squared root oflogits_init_output_factorof tf_agents.projection_output_init_gainofalf.networks.NormalProjectionNetworkcorresponds toinit_means_output_factorof tf_agentsNormalProjectionNetworkused by ALF-tf.projection_output_init_gainshould be set to the squared root ofinit_means_output_factor.
2. gym_wrappers.ContinuousActionClip. In ALF-pytorch, by default, we add this
wrapper to clip the out-of-bound continuous actions for all gym environments
(Note that most environments supported by ALF are gym environments, even they
may not be named so). ContinuousActionClip can often help the algorithm to
obtain higher rewards at the beginning of training because the evironment may
calculate reward using an out-of-bound action without clipping. But sometimes,
using this wrapper can hurt the final performance. You can disable it by setting
the following in the config:
suite_gym.wrap_env.clip_action=False