alf.environments.simple#

alf.environments.simple.noisy_array#

class NoisyArray(K=11, M=100, auto_noise=False)[source]#

Bases: gym.core.Env

A synthetic noisy array to test the agent’s robustness to random noises. The binary array has a length of (K+M), where the subarray of length K is a onehot vector with 1 representing the agent’s current location, and the remaining M bits constitute a noise vector in {0,1}^M. For example (K=5, M=3):

0 0 1 0 0 | 0 1 1

and the agent is at i==2 now.

The agent always starts from i==0. The goal is to reach i==K-1 (it cannot step on the noise vector). It has three actions: LEFT, RIGHT, and FIRE. The FIRE action changes the noise vector into some random M bits, without changing the agent’s position. Both LEFT and RIGHT won’t change the noise vector.

In the example above, if the next action is FIRE, then the resulting array might be

0 0 1 0 0 | 1 1 0

If the next action is RIGHT, then the resulting array should be:

0 0 0 1 0 | 0 1 1

The game ends whether the array looks like

0 0 0 0 1 | X X X

Parameters
  • K (int) – K-1 will be the minimum steps that take the agent from left to right and get a reward of 1

  • M (int) – the length of the noisy vector. The total observation length would be K+M

  • auto_noise (bool) – if True, the noise vector will change automatically at every step, and FIRE becomes “no-operation”.

FIRE = 1#
LEFT = 0#
RIGHT = 2#
render(mode='human', close=False)[source]#

Renders the environment.

The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:

  • human: render to the current display or terminal and return nothing. Usually for human consumption.

  • rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.

  • ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).

Note

Make sure that your class’s metadata ‘render.modes’ key includes

the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.

Parameters

mode (str) – the mode to render with

Example:

class MyEnv(Env):

metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}

def render(self, mode=’human’):
if mode == ‘rgb_array’:

return np.array(…) # return RGB frame suitable for video

elif mode == ‘human’:

… # pop up a window and render

else:

super(MyEnv, self).render(mode=mode) # just raise an exception

reset()[source]#

Resets the state of the environment and returns an initial observation.

Returns

the initial observation.

Return type

observation (object)

step(action)[source]#

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

alf.environments.simple.stochastic_with_risky_branch#

class StochasticWithRiskyBranch(seed=None)[source]#

Bases: gym.core.Env

A simple stochastic MDP s0 -> a0 - 50% -> s1 -> a0 -> T (reward=2)

|– 50% -> s3 -> a1 -> T (reward=1)

|–> a1 - 100% -> s2 -> a0 -> T (reward=1.8)

All other actions terminates with reward 0. T is the terminal state.

Optimal action at s0 is a1 with q(s0, a1) = 1.5. Optimal q(s0, a0) = 1.8.

However, if it so happens that Q learning is conditioned on the action sequence, then q(s0, a0, a0) will contain mostly experience of <s0, a0, s1, a0, s0> and very few <s0, a0, s3, a0, s0>, leading to an average of about 2. q(s0, a0, a1) will be around 1. q(s0, a1, a0) will be around 1.8. The agent could end up choosing a0 at s0.

Parameters

seed (int) – random seed for the environment.

render(mode='human', close=False)[source]#

Renders the environment.

The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:

  • human: render to the current display or terminal and return nothing. Usually for human consumption.

  • rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.

  • ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).

Note

Make sure that your class’s metadata ‘render.modes’ key includes

the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.

Parameters

mode (str) – the mode to render with

Example:

class MyEnv(Env):

metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}

def render(self, mode=’human’):
if mode == ‘rgb_array’:

return np.array(…) # return RGB frame suitable for video

elif mode == ‘human’:

… # pop up a window and render

else:

super(MyEnv, self).render(mode=mode) # just raise an exception

reset()[source]#

Resets the state of the environment and returns an initial observation.

Returns

the initial observation.

Return type

observation (object)

step(action)[source]#

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)