alf.environments.simple#
alf.environments.simple.noisy_array#
- class NoisyArray(K=11, M=100, auto_noise=False)[source]#
Bases:
gym.core.EnvA synthetic noisy array to test the agent’s robustness to random noises. The binary array has a length of (K+M), where the subarray of length K is a onehot vector with 1 representing the agent’s current location, and the remaining M bits constitute a noise vector in {0,1}^M. For example (K=5, M=3):
0 0 1 0 0 | 0 1 1
and the agent is at i==2 now.
The agent always starts from i==0. The goal is to reach i==K-1 (it cannot step on the noise vector). It has three actions: LEFT, RIGHT, and FIRE. The FIRE action changes the noise vector into some random M bits, without changing the agent’s position. Both LEFT and RIGHT won’t change the noise vector.
In the example above, if the next action is FIRE, then the resulting array might be
0 0 1 0 0 | 1 1 0
If the next action is RIGHT, then the resulting array should be:
0 0 0 1 0 | 0 1 1
The game ends whether the array looks like
0 0 0 0 1 | X X X
- Parameters
K (int) – K-1 will be the minimum steps that take the agent from left to right and get a reward of 1
M (int) – the length of the noisy vector. The total observation length would be K+M
auto_noise (bool) – if True, the noise vector will change automatically at every step, and FIRE becomes “no-operation”.
- FIRE = 1#
- LEFT = 0#
- RIGHT = 2#
- render(mode='human', close=False)[source]#
Renders the environment.
The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:
human: render to the current display or terminal and return nothing. Usually for human consumption.
rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.
ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).
Note
- Make sure that your class’s metadata ‘render.modes’ key includes
the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.
- Parameters
mode (str) – the mode to render with
Example:
- class MyEnv(Env):
metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}
- def render(self, mode=’human’):
- if mode == ‘rgb_array’:
return np.array(…) # return RGB frame suitable for video
- elif mode == ‘human’:
… # pop up a window and render
- else:
super(MyEnv, self).render(mode=mode) # just raise an exception
- reset()[source]#
Resets the state of the environment and returns an initial observation.
- Returns
the initial observation.
- Return type
observation (object)
- step(action)[source]#
Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.
Accepts an action and returns a tuple (observation, reward, done, info).
- Parameters
action (object) – an action provided by the agent
- Returns
agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
- Return type
observation (object)
alf.environments.simple.stochastic_with_risky_branch#
- class StochasticWithRiskyBranch(seed=None)[source]#
Bases:
gym.core.EnvA simple stochastic MDP s0 -> a0 - 50% -> s1 -> a0 -> T (reward=2)
All other actions terminates with reward 0. T is the terminal state.
Optimal action at s0 is a1 with q(s0, a1) = 1.5. Optimal q(s0, a0) = 1.8.
However, if it so happens that Q learning is conditioned on the action sequence, then q(s0, a0, a0) will contain mostly experience of <s0, a0, s1, a0, s0> and very few <s0, a0, s3, a0, s0>, leading to an average of about 2. q(s0, a0, a1) will be around 1. q(s0, a1, a0) will be around 1.8. The agent could end up choosing a0 at s0.
- Parameters
seed (int) – random seed for the environment.
- render(mode='human', close=False)[source]#
Renders the environment.
The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:
human: render to the current display or terminal and return nothing. Usually for human consumption.
rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.
ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).
Note
- Make sure that your class’s metadata ‘render.modes’ key includes
the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.
- Parameters
mode (str) – the mode to render with
Example:
- class MyEnv(Env):
metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}
- def render(self, mode=’human’):
- if mode == ‘rgb_array’:
return np.array(…) # return RGB frame suitable for video
- elif mode == ‘human’:
… # pop up a window and render
- else:
super(MyEnv, self).render(mode=mode) # just raise an exception
- reset()[source]#
Resets the state of the environment and returns an initial observation.
- Returns
the initial observation.
- Return type
observation (object)
- step(action)[source]#
Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.
Accepts an action and returns a tuple (observation, reward, done, info).
- Parameters
action (object) – an action provided by the agent
- Returns
agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
- Return type
observation (object)