tf2rl.envs package

Submodules

tf2rl.envs.atari_wrapper module

The MIT License

Copyright (c) 2017 OpenAI (http://openai.com)

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

class tf2rl.envs.atari_wrapper.NoopResetEnv(env, noop_max=30)

Bases: gym.core.Wrapper

__init__(env, noop_max=30)

Sample initial states by taking random number of no-ops on reset. No-op is assumed to be action 0.

reset(**kwargs)

Do no-op action for a number of steps in [1, noop_max].

step(ac)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

class tf2rl.envs.atari_wrapper.FireResetEnv(env)

Bases: gym.core.Wrapper

__init__(env)

Take action on reset for environments that are fixed until firing.

reset(**kwargs)

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns

the initial observation.

Return type

observation (object)

step(ac)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

class tf2rl.envs.atari_wrapper.EpisodicLifeEnv(env)

Bases: gym.core.Wrapper

__init__(env)

Make end-of-life == end-of-episode, but only reset on true game over. Done by DeepMind for the DQN and co. since it helps value estimation.

step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

reset(**kwargs)

Reset only when lives are exhausted. This way all states are still reachable even though lives are episodic, and the learner need not know about any of this behind-the-scenes.

class tf2rl.envs.atari_wrapper.MaxAndSkipEnv(env, skip=4)

Bases: gym.core.Wrapper

__init__(env, skip=4)

Return only every skip-th frame

step(action)

Repeat action, sum reward, and max over last observations.

reset(**kwargs)

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns

the initial observation.

Return type

observation (object)

class tf2rl.envs.atari_wrapper.ClipRewardEnv(env)

Bases: gym.core.RewardWrapper

__init__(env)
reward(reward)

Bin reward to {+1, 0, -1} by its sign.

class tf2rl.envs.atari_wrapper.WarpFrame(env, width=84, height=84, grayscale=True, dict_space_key=None)

Bases: gym.core.ObservationWrapper

__init__(env, width=84, height=84, grayscale=True, dict_space_key=None)

Warp frames to 84x84 as done in the Nature paper and later work. If the environment uses dictionary observations, dict_space_key can be specified which indicates which observation should be warped.

observation(obs)
class tf2rl.envs.atari_wrapper.ProcessFrame84(env=None)

Bases: gym.core.ObservationWrapper

__init__(env=None)
observation(obs)
static process(frame)
class tf2rl.envs.atari_wrapper.FrameStack(env, k)

Bases: gym.core.Wrapper

__init__(env, k)

Stack k last frames. Returns lazy array, which is much more memory efficient. See also baselines.common.atari_wrappers.LazyFrames

reset()

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns

the initial observation.

Return type

observation (object)

step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

class tf2rl.envs.atari_wrapper.ScaledFloatFrame(env)

Bases: gym.core.ObservationWrapper

__init__(env)
observation(observation)
class tf2rl.envs.atari_wrapper.LazyFrames(frames)

Bases: object

__init__(frames)

This object ensures that common frames between the observations are only stored once. It exists purely to optimize memory usage which can be huge for DQN’s 1M frames replay buffers. This object should only be converted to numpy array before being passed to the model. You’d not believe how complex the previous solution was.

class tf2rl.envs.atari_wrapper.NdarrayFrames(env)

Bases: gym.core.Wrapper

__init__(env)
reset()

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns

the initial observation.

Return type

observation (object)

step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

tf2rl.envs.atari_wrapper.make_atari(env_id, max_episode_steps=None)
tf2rl.envs.atari_wrapper.wrap_deepmind(env, episode_life=True, clip_rewards=True, frame_stack=False, scale=False)

Configure environment for DeepMind-style Atari.

tf2rl.envs.atari_wrapper.wrap_dqn(env, stack_frames=4, episodic_life=True, reward_clipping=True, wrap_ndarray=False)

Apply a common set of wrappers for Atari games.

tf2rl.envs.dmc_wrapper module

class tf2rl.envs.dmc_wrapper.DMCWrapper(env, k, obs_shape, wait_ms=33.333333333333336, **kwargs)

Bases: tf2rl.envs.frame_stack_wrapper.FrameStack

Wrapper class to visualize DMC environments.

__init__(env, k, obs_shape, wait_ms=33.333333333333336, **kwargs)
render()

Renders the environment.

The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:

  • human: render to the current display or terminal and return nothing. Usually for human consumption.

  • rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.

  • ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).

Note

Make sure that your class’s metadata ‘render.modes’ key includes

the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.

Parameters

mode (str) – the mode to render with

Example:

class MyEnv(Env):

metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}

def render(self, mode=’human’):
if mode == ‘rgb_array’:

return np.array(…) # return RGB frame suitable for video

elif mode == ‘human’:

… # pop up a window and render

else:

super(MyEnv, self).render(mode=mode) # just raise an exception

tf2rl.envs.env_utils module

tf2rl.envs.env_utils.get_act_dim(env)

tf2rl.envs.frame_stack_wrapper module

class tf2rl.envs.frame_stack_wrapper.FrameStack(env, k, obs_shape, channel_first=False)

Bases: gym.core.Wrapper

__init__(env, k, obs_shape, channel_first=False)
reset()

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns

the initial observation.

Return type

observation (object)

step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters

action (object) – an action provided by the agent

Returns

agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)

Return type

observation (object)

tf2rl.envs.multi_thread_env module

class tf2rl.envs.multi_thread_env.MultiThreadEnv(env_fn, batch_size, thread_pool=4, max_episode_steps=1000)

Bases: object

This contains multiple environments. When step() is called, all of them forward one-step.

This serve tensorflow operators to manipulate multiple environments.

__init__(env_fn, batch_size, thread_pool=4, max_episode_steps=1000)
Parameters
  • env_fn – function Function to make an environment

  • batch_size – int Batch size

  • thread_pool – int Thread pool size

  • max_episode_steps – int Maximum step of an episode

property original_env
step(actions, name=None)
Parameters
  • actions – tf.Tensor Actions whose shape is float32[batch_size, dim_action]

  • name – str Operator name

Returns

tf.Tensor

[batch_size, dim_obs]

reward: tf.Tensor

[batch_size]

done: tf.Tensor

[batch_size]

env_info: None

Return type

obs

py_step(actions)
Parameters

actions – np.array Actions whose shape is [batch_size, dim_action]

Returns

np.array reward: np.array done: np.array

Return type

obs

py_observation()
py_reset()
property max_action
property min_action
property state_dim

tf2rl.envs.normalizer module

MIT License

Copyright (c) 2017 Preferred Networks, Inc.

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

class tf2rl.envs.normalizer.EmpiricalNormalizer(shape, batch_axis=0, eps=0.01, dtype=<class 'numpy.float32'>, until=None, clip_threshold=None)

Bases: object

Normalize mean and variance of values based on emprical values. :param shape: Shape of input values except batch axis. :type shape: int or tuple of int :param batch_axis: Batch axis. :type batch_axis: int :param eps: Small value for stability. :type eps: float :param dtype: Dtype of input values. :type dtype: dtype :param until: If this arg is specified, the link learns input

values until the sum of batch sizes exceeds it.

__init__(shape, batch_axis=0, eps=0.01, dtype=<class 'numpy.float32'>, until=None, clip_threshold=None)
property mean
property std
experience(x)

Learn input values without computing the output values of them

inverse(y)

tf2rl.envs.utils module

tf2rl.envs.utils.is_discrete(space)
tf2rl.envs.utils.get_act_dim(action_space)
tf2rl.envs.utils.is_mujoco_env(env)
tf2rl.envs.utils.is_atari_env(env)
tf2rl.envs.utils.make(id, **kwargs)

Make gym.Env with version tolerance

Parameters

id (str) – Id specifying gym.Env registered to gym.env.registry. Valid format is “^(?:[w:-]+/)?([w:.-]+)-v(d+)$” See https://github.com/openai/gym/blob/v0.21.0/gym/envs/registration.py#L17-L19

Returns

Environment

Return type

gym.Env

Module contents