tf2rl.envs package

Submodules

tf2rl.envs.atari_wrapper module

The MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

class tf2rl.envs.atari_wrapper.NoopResetEnv(env, noop_max=30)

Bases: gym.core.Wrapper

__init__(env, noop_max=30): Sample initial states by taking random number of no-ops on reset. No-op is assumed to be action 0.

reset(**kwargs): Do no-op action for a number of steps in [1, noop_max].

step(ac)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters: action (object) – an action provided by the agent
Returns: agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
Return type: observation (object)

class tf2rl.envs.atari_wrapper.FireResetEnv(env)

Bases: gym.core.Wrapper

__init__(env): Take action on reset for environments that are fixed until firing.

reset(**kwargs)

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns: the initial observation.
Return type: observation (object)

step(ac)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters: action (object) – an action provided by the agent
Returns: agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
Return type: observation (object)

class tf2rl.envs.atari_wrapper.EpisodicLifeEnv(env)

Bases: gym.core.Wrapper

__init__(env): Make end-of-life == end-of-episode, but only reset on true game over. Done by DeepMind for the DQN and co. since it helps value estimation.

step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters: action (object) – an action provided by the agent
Returns: agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
Return type: observation (object)

reset(**kwargs): Reset only when lives are exhausted. This way all states are still reachable even though lives are episodic, and the learner need not know about any of this behind-the-scenes.

class tf2rl.envs.atari_wrapper.MaxAndSkipEnv(env, skip=4)

Bases: gym.core.Wrapper

__init__(env, skip=4): Return only every skip-th frame

step(action): Repeat action, sum reward, and max over last observations.

reset(**kwargs)

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns: the initial observation.
Return type: observation (object)

class tf2rl.envs.atari_wrapper.ClipRewardEnv(env)

Bases: gym.core.RewardWrapper

__init__(env)

reward(reward): Bin reward to {+1, 0, -1} by its sign.

class tf2rl.envs.atari_wrapper.WarpFrame(env, width=84, height=84, grayscale=True, dict_space_key=None)

Bases: gym.core.ObservationWrapper

__init__(env, width=84, height=84, grayscale=True, dict_space_key=None): Warp frames to 84x84 as done in the Nature paper and later work. If the environment uses dictionary observations, dict_space_key can be specified which indicates which observation should be warped.

observation(obs)

class tf2rl.envs.atari_wrapper.ProcessFrame84(env=None)

Bases: gym.core.ObservationWrapper

__init__(env=None)

observation(obs)

static process(frame)

class tf2rl.envs.atari_wrapper.FrameStack(env, k)

Bases: gym.core.Wrapper

__init__(env, k): Stack k last frames. Returns lazy array, which is much more memory efficient. See also baselines.common.atari_wrappers.LazyFrames

reset()

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns: the initial observation.
Return type: observation (object)

step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters: action (object) – an action provided by the agent
Returns: agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
Return type: observation (object)

class tf2rl.envs.atari_wrapper.ScaledFloatFrame(env)

Bases: gym.core.ObservationWrapper

__init__(env)

observation(observation)

class tf2rl.envs.atari_wrapper.LazyFrames(frames)

Bases: object

__init__(frames): This object ensures that common frames between the observations are only stored once. It exists purely to optimize memory usage which can be huge for DQN’s 1M frames replay buffers. This object should only be converted to numpy array before being passed to the model. You’d not believe how complex the previous solution was.

class tf2rl.envs.atari_wrapper.NdarrayFrames(env)

Bases: gym.core.Wrapper

__init__(env)

reset()

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns: the initial observation.
Return type: observation (object)

step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters: action (object) – an action provided by the agent
Returns: agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
Return type: observation (object)

tf2rl.envs.atari_wrapper.make_atari(env_id, max_episode_steps=None)

tf2rl.envs.atari_wrapper.wrap_deepmind(env, episode_life=True, clip_rewards=True, frame_stack=False, scale=False): Configure environment for DeepMind-style Atari.

tf2rl.envs.atari_wrapper.wrap_dqn(env, stack_frames=4, episodic_life=True, reward_clipping=True, wrap_ndarray=False): Apply a common set of wrappers for Atari games.

tf2rl.envs.dmc_wrapper module

class tf2rl.envs.dmc_wrapper.DMCWrapper(env, k, obs_shape, wait_ms=33.333333333333336, **kwargs)

Bases: tf2rl.envs.frame_stack_wrapper.FrameStack

Wrapper class to visualize DMC environments.

__init__(env, k, obs_shape, wait_ms=33.333333333333336, **kwargs)

render()

Renders the environment.

The set of supported modes varies per environment. (And some environments do not support rendering at all.) By convention, if mode is:

human: render to the current display or terminal and return nothing. Usually for human consumption.
rgb_array: Return an numpy.ndarray with shape (x, y, 3), representing RGB values for an x-by-y pixel image, suitable for turning into a video.
ansi: Return a string (str) or StringIO.StringIO containing a terminal-style text representation. The text can include newlines and ANSI escape sequences (e.g. for colors).

Note

Make sure that your class’s metadata ‘render.modes’ key includes: the list of supported modes. It’s recommended to call super() in implementations to use the functionality of this method.

Parameters: mode (str) – the mode to render with

Example:

class MyEnv(Env):

metadata = {‘render.modes’: [‘human’, ‘rgb_array’]}

def render(self, mode=’human’):

if mode == ‘rgb_array’:: return np.array(…) # return RGB frame suitable for video
elif mode == ‘human’:: … # pop up a window and render
else:: super(MyEnv, self).render(mode=mode) # just raise an exception

tf2rl.envs.env_utils module

tf2rl.envs.env_utils.get_act_dim(env)

tf2rl.envs.frame_stack_wrapper module

class tf2rl.envs.frame_stack_wrapper.FrameStack(env, k, obs_shape, channel_first=False)

Bases: gym.core.Wrapper

__init__(env, k, obs_shape, channel_first=False)

reset()

Resets the environment to an initial state and returns an initial observation.

Note that this function should not reset the environment’s random number generator(s); random variables in the environment’s state should be sampled independently between multiple calls to reset(). In other words, each call of reset() should yield an environment suitable for a new episode, independent of previous episodes.

Returns: the initial observation.
Return type: observation (object)

step(action)

Run one timestep of the environment’s dynamics. When end of episode is reached, you are responsible for calling reset() to reset this environment’s state.

Accepts an action and returns a tuple (observation, reward, done, info).

Parameters: action (object) – an action provided by the agent
Returns: agent’s observation of the current environment reward (float) : amount of reward returned after previous action done (bool): whether the episode has ended, in which case further step() calls will return undefined results info (dict): contains auxiliary diagnostic information (helpful for debugging, and sometimes learning)
Return type: observation (object)

tf2rl.envs.multi_thread_env module

class tf2rl.envs.multi_thread_env.MultiThreadEnv(env_fn, batch_size, thread_pool=4, max_episode_steps=1000)

Bases: object

This contains multiple environments. When step() is called, all of them forward one-step.

This serve tensorflow operators to manipulate multiple environments.

__init__(env_fn, batch_size, thread_pool=4, max_episode_steps=1000)

Parameters

env_fn – function Function to make an environment
batch_size – int Batch size
thread_pool – int Thread pool size
max_episode_steps – int Maximum step of an episode

property original_env

step(actions, name=None)

Parameters

actions – tf.Tensor Actions whose shape is float32[batch_size, dim_action]
name – str Operator name

Returns

tf.Tensor: [batch_size, dim_obs]
reward: tf.Tensor: [batch_size]
done: tf.Tensor: [batch_size]

env_info: None

Return type

obs

py_step(actions)

Parameters: actions – np.array Actions whose shape is [batch_size, dim_action]
Returns: np.array reward: np.array done: np.array
Return type: obs

py_observation()

py_reset()

property max_action

property min_action

property state_dim

tf2rl.envs.normalizer module

MIT License

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the “Software”), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED “AS IS”, WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

class tf2rl.envs.normalizer.EmpiricalNormalizer(shape, batch_axis=0, eps=0.01, dtype=<class 'numpy.float32'>, until=None, clip_threshold=None)

Bases: object

Normalize mean and variance of values based on emprical values. :param shape: Shape of input values except batch axis. :type shape: int or tuple of int :param batch_axis: Batch axis. :type batch_axis: int :param eps: Small value for stability. :type eps: float :param dtype: Dtype of input values. :type dtype: dtype :param until: If this arg is specified, the link learns input

values until the sum of batch sizes exceeds it.

__init__(shape, batch_axis=0, eps=0.01, dtype=<class 'numpy.float32'>, until=None, clip_threshold=None)

property mean

property std

experience(x): Learn input values without computing the output values of them

inverse(y)

tf2rl.envs.utils module

tf2rl.envs.utils.is_discrete(space)

tf2rl.envs.utils.get_act_dim(action_space)

tf2rl.envs.utils.is_mujoco_env(env)

tf2rl.envs.utils.is_atari_env(env)

tf2rl.envs.utils.make(id, **kwargs)

Make gym.Env with version tolerance

Parameters: id (str) – Id specifying gym.Env registered to gym.env.registry. Valid format is “^(?:[w:-]+/)?([w:.-]+)-v(d+)$” See https://github.com/openai/gym/blob/v0.21.0/gym/envs/registration.py#L17-L19
Returns: Environment
Return type: gym.Env

tf2rl.envs package

Submodules

tf2rl.envs.atari_wrapper module

tf2rl.envs.dmc_wrapper module

tf2rl.envs.env_utils module

tf2rl.envs.frame_stack_wrapper module

tf2rl.envs.multi_thread_env module

tf2rl.envs.normalizer module

tf2rl.envs.utils module

Module contents