tf2rl.algos package

Submodules

tf2rl.algos.apex module

tf2rl.algos.apex.import_tf()

tf2rl.algos.apex.explorer(global_rb, queue, trained_steps, is_training_done, lock, env_fn, policy_fn, set_weights_fn, noise_level, n_env=64, n_thread=4, buffer_size=1024, episode_max_steps=1000, gpu=0)

Collect transitions and store them to prioritized replay buffer.

Parameters

global_rb – multiprocessing.managers.AutoProxy[PrioritizedReplayBuffer] Prioritized replay buffer sharing with multiple explorers and only one learner. This object is shared over processes, so it must be locked when trying to operate something with lock object.
queue – multiprocessing.Queue A FIFO shared with the learner and evaluator to get the latest network weights. This is process safe, so you don’t need to lock process when use this.
trained_steps – multiprocessing.Value Number of steps to apply gradients.
is_training_done – multiprocessing.Event multiprocessing.Event object to share the status of training.
lock – multiprocessing.Lock Lock other processes.
env_fn – function Method object to generate an environment.
policy_fn – function Method object to generate an explorer.
set_weights_fn – function Method object to set network weights gotten from queue.
noise_level – float Noise level for exploration. For epsilon-greedy policy like DQN variants, this will be epsilon, and if DDPG variants this will be variance for Normal distribution.
n_env – int Number of environments to distribute. If this is set to be more than 1, MultiThreadEnv will be used.
n_thread – int Number of thread used in MultiThreadEnv.
buffer_size – int Size of local buffer. If this is filled with transitions, add them to global_rb
episode_max_steps – int Maximum number of steps of an episode.
gpu – int GPU id. If this is set to -1, then this process uses only CPU.

Returns

None

tf2rl.algos.apex.learner(global_rb, trained_steps, is_training_done, lock, env, policy_fn, get_weights_fn, n_training, update_freq, evaluation_freq, gpu, queues)

Update network weights using samples collected by explorers.

Parameters

global_rb – multiprocessing.managers.AutoProxy[PrioritizedReplayBuffer] Prioritized replay buffer sharing with multiple explorers and only one learner. This object is shared over processes, so it must be locked when trying to operate something with lock object.
trained_steps – multiprocessing.Value Number of steps to apply gradients.
is_training_done – multiprocessing.Event multiprocessing.Event object to share the status of training.
lock – multiprocessing.Lock multiprocessing.Lock to lock other processes.
env – OpenAI-gym compatible environment object
policy_fn – function Method object to generate an explorer.
get_weights_fn – function Method object to get network weights and put them to queue.
n_training – int Maximum number of times to apply gradients. If number of applying gradients is over this value, training will be done by setting is_training_done to True
update_freq – int Frequency to update parameters, i.e., put network parameters to queues
evaluation_freq – int Frequency to call evaluator.
gpu – int GPU id. If this is set to -1, then this process uses only CPU.
queues – List List of Queues shared with explorers to send latest network parameters.

Returns

None

tf2rl.algos.apex.evaluator(is_training_done, env, policy_fn, set_weights_fn, queue, gpu, save_model_interval=1000000, n_evaluation=10, episode_max_steps=1000, show_test_progress=False)

Evaluate trained network weights periodically.

Parameters

is_training_done – multiprocessing.Event multiprocessing.Event object to share the status of training.
env – Open-AI gym compatible environment Environment object.
policy_fn – function Method object to generate an explorer.
set_weights_fn – function Method object to set network weights gotten from queue.
queue – multiprocessing.Queue A FIFO shared with the learner to get the latest network weights. This is process safe, so you don’t need to lock process when use this.
gpu – int GPU id. If this is set to -1, then this process uses only CPU.
save_model_interval – int Interval to save model.
n_evaluation – int Number of episodes to evaluate.
episode_max_steps – int Maximum number of steps of an episode.
show_test_progress – bool If true, render will be called to visualize evaluation process.

tf2rl.algos.apex.apex_argument(parser=None)

tf2rl.algos.apex.prepare_experiment(env, args)

tf2rl.algos.apex.run(args, env_fn, policy_fn, get_weights_fn, set_weights_fn)

tf2rl.algos.bi_res_ddpg module

class tf2rl.algos.bi_res_ddpg.BiResDDPG(*args, **kwargs)

Bases: tf2rl.algos.ddpg.DDPG

Bi-Res-DDPG Agent: https://arxiv.org/abs/1905.01072

Command Line Args:

__init__(eta=0.05, name='BiResDDPG', **kwargs)

Initialize BiResDDPG agent

Parameters

eta (float) – Gradients mixing factor.
name (str) – Name of agent. The default is "BiResDDPG".
state_shape (iterable of int) –
action_dim (int) –
max_action (float) – Size of maximum action. (-max_action <= action <= max_action). The degault is 1.
lr_actor (float) – Learning rate for actor network. The default is 0.001.
lr_critic (float) – Learning rage for critic network. The default is 0.001.
actor_units (iterable of int) – Number of units at hidden layers of actor.
critic_units (iterable of int) – Number of units at hidden layers of critic.
sigma (float) – Standard deviation of Gaussian noise. The default is 0.1.
tau (float) – Weight update ratio for target network. target = (1-tau)*target + tau*network The default is 0.005.
n_warmup (int) – Number of warmup steps before training. The default is 1e4.
memory_capacity (int) – Replay Buffer size. The default is 1e4.
batch_size (int) – Batch size. The default is 256.
discount (float) – Discount factor. The default is 0.99.
max_grad (float) – Maximum gradient. The default is 10.
gpu (int) – GPU id. -1 disables GPU. The default is 0.

compute_td_error(states, actions, next_states, rewards, dones)

Compute TD error

Parameters

states –
actions –
next_states –
rewars –
dones –

Returns

Sum of two TD errors.

Return type

np.ndarray

static get_argument(parser=None)

Create or update argument parser for command line program

Parameters: parser (argparse.ArgParser, optional) – argument parser
Returns: argument parser
Return type: argparse.ArgParser

tf2rl.algos.categorical_dqn module

class tf2rl.algos.categorical_dqn.QFunc(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(state_shape, action_dim, units=(32, 32), name='CategoricalQFunc', enable_dueling_dqn=False, enable_noisy_dqn=False, n_atoms=51)

call(inputs)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters

inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

property n_atoms

class tf2rl.algos.categorical_dqn.CategoricalDQN(*args, **kwargs)

Bases: tf2rl.algos.policy_base.OffPolicyAgent

Categorical DQN Agent: https://arxiv.org/abs/1707.06887

Categorical DQN supports following algorithms;

Command Line Args:

__init__(state_shape, action_dim, q_func=None, name='DQN', lr=0.001, adam_eps=1e-07, units=(32, 32), epsilon=0.1, epsilon_min=None, epsilon_decay_step=1000000, n_warmup=10000, target_replace_interval=5000, memory_capacity=1000000, enable_double_dqn=False, enable_dueling_dqn=False, enable_noisy_dqn=False, **kwargs)

Initialize Categorical DQN

Parameters

state_shape (iterable of int) – Observation space shape
action_dim (int) – Dimension of discrete action
q_function (QFunc) – Custom Q function class. If None (default), Q function is constructed with QFunc.
name (str) – Name of agent. The default is "DQN"
lr (float) – Learning rate. The default is 0.001.
adam_eps (float) – Epsilon for Adam. The default is 1e-7
units (iterable of int) – Units of hidden layers. The default is (32, 32)
espilon (float) – Initial epsilon of e-greedy. The default is 0.1
epsilon_min (float) – Minimum epsilon of after decayed.
epsilon_decay_step (int) – Number of steps decaying. The default is 1e6
n_warmup (int) – Number of warmup steps befor training. The default is 1e4
target_replace_interval (int) – Number of steps between target network update. The default is 5e3
memory_capacity (int) – Size of replay buffer. The default is 1e6
enable_double_dqn (bool) – Whether use Double DQN. The default is False
enable_dueling_dqn (bool) – Whether use Dueling network. The default is False
enable_noisy_dqn (bool) – Whether use noisy network. The default is False
optimizer (tf.keras.optimizers.Optimizer) – Custom optimizer
batch_size (int) – Batch size. The default is 256.
discount (float) – Discount factor. The default is 0.99.
max_grad (float) – Maximum gradient. The default is 10.
gpu (int) – GPU id. -1 disables GPU. The default is 0.

get_action(state, test=False, tensor=False)

Get action

Parameters

state – Observation state
test (bool) – When False (default), policy returns exploratory action.
tensor (bool) – When True, return type is tf.Tensor

Returns

Selected action

Return type

tf.Tensor or np.ndarray or float

train(states, actions, next_states, rewards, done, weights=None)

Train DQN

Parameters

states –
actions –
next_states –
rewards –
done –
weights (optional) – Weights for importance sampling

compute_td_error(states, actions, next_states, rewards, dones)

Compute TD error

Parameters

states –
actions –
next_states –
rewars –
dones –

Returns: tf.Tensor: TD error

static get_argument(parser=None)

Create or update argument parser for command line program

Parameters: parser (argparse.ArgParser, optional) – argument parser
Returns: argument parser
Return type: argparse.ArgParser

tf2rl.algos.curl_sac module

class tf2rl.algos.curl_sac.CURL(*args, **kwargs)

Bases: tf2rl.algos.sac_ae.SACAE

Contrastive Unsuper Representations for Reinforcement Learning (CURL) Agent: https://arxiv.org/abs/2004.04136

Command Line Args:

__init__(*args, **kwargs)

Initialize CURL

Parameters

action_dim (int) –
obs_shape – (iterable of int): The default is (84, 84, 9)
n_conv_layers (int) – Number of convolutional layers at encoder. The default is 4
n_conv_filters (int) – Number of filters in convolutional layers. The default is 32
feature_dim (int) – Number of features after encoder. This features are treated as SAC input. The default is 50
tau_encoder (float) – Target network update rate for Encoder. The default is 0.05
tau_critic (float) – Target network update rate for Critic. The default is 0.01
auto_alpha (bool) – Automatic alpha tuning. The default is True
lr_sac (float) – Learning rate for SAC. The default is 1e-3
lr_encoder (float) – Learning rate for Encoder. The default is 1e-3
lr_decoder (float) – Learning rate for Decoder. The default is 1e-3
update_critic_target_freq (int) – The default is 2
update_actor_freq (int) – The default is 2
lr_alpha (alpha) – Learning rate for alpha. The default is 1e-4.
init_temperature (float) – Initial temperature. The default is 0.1
stop_q_grad (bool) – Whether sotp gradient propagation after encoder convolutional network. The default is False
lambda_latent_val (float) – AE loss = REC loss + lambda_latent_val * latent loss. The default is 1e-6
decoder_weight_lambda (float) – Weight decay of AdamW for Decoder. The default is 1e-7
name (str) – Name of network. The default is "CURL"
max_action (float) –
actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is (256, 256).
critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is (256, 256).
alpha (float) – Temperature parameter. The default is 0.2.
n_warmup (int) – Number of warmup steps before training. The default is int(1e4).
memory_capacity (int) – Replay Buffer size. The default is int(1e6).
batch_size (int) – Batch size. The default is 256.
discount (float) – Discount factor. The default is 0.99.
max_grad (float) – Maximum gradient. The default is 10.
gpu (int) – GPU id. -1 disables GPU. The default is 0.

train(states, actions, next_states, rewards, dones, weights=None)

Train CURL

Parameters

states –
actions –
next_states –
rewards –
done –
weights (optional) – Weights for importance sampling

tf2rl.algos.d2rl_sac module

class tf2rl.algos.d2rl_sac.DenseCriticQ(*args, **kwargs)

Bases: tf2rl.algos.sac.CriticQ

call(states, actions)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters

inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class tf2rl.algos.d2rl_sac.DenseGaussianActor(*args, **kwargs): Bases: tf2rl.policies.tfp_gaussian_actor.GaussianActor

class tf2rl.algos.d2rl_sac.D2RLSAC(*args, **kwargs)

Bases: tf2rl.algos.sac.SAC

__init__(*args, **kwargs)

Initialize SAC

Parameters

state_shape (iterable of int) –
action_dim (int) –
name (str) – Name of network. The default is "SAC"
max_action (float) –
lr (float) – Learning rate. The default is 3e-4.
lr_alpha (alpha) – Learning rate for alpha. The default is 3e-4.
actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is (256, 256).
critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is (256, 256).
tau (float) – Target network update rate.
alpha (float) – Temperature parameter. The default is 0.2.
auto_alpha (bool) – Automatic alpha tuning.
init_temperature (float) – Initial temperature
n_warmup (int) – Number of warmup steps before training. The default is int(1e4).
memory_capacity (int) – Replay Buffer size. The default is int(1e6).
batch_size (int) – Batch size. The default is 256.
discount (float) – Discount factor. The default is 0.99.
max_grad (float) – Maximum gradient. The default is 10.
gpu (int) – GPU id. -1 disables GPU. The default is 0.

tf2rl.algos.ddpg module

class tf2rl.algos.ddpg.Actor(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(state_shape, action_dim, max_action, units=(400, 300), name='Actor')

call(inputs)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters

inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class tf2rl.algos.ddpg.Critic(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(state_shape, action_dim, units=(400, 300), name='Critic')

call(states, actions)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters

inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class tf2rl.algos.ddpg.DDPG(*args, **kwargs)

Bases: tf2rl.algos.policy_base.OffPolicyAgent

DDPG agent: https://arxiv.org/abs/1509.02971

Command Line Args:

__init__(state_shape, action_dim, name='DDPG', max_action=1.0, lr_actor=0.001, lr_critic=0.001, actor_units=(400, 300), critic_units=(400, 300), sigma=0.1, tau=0.005, n_warmup=10000, memory_capacity=1000000, **kwargs)

Initialize DDPG agent

Parameters

state_shape (iterable of int) –
action_dim (int) –
name (str) – Name of agent. The default is "DDPG".
max_action (float) – Size of maximum action. (-max_action <= action <= max_action). The degault is 1.
lr_actor (float) – Learning rate for actor network. The default is 0.001.
lr_critic (float) – Learning rage for critic network. The default is 0.001.
actor_units (iterable of int) – Number of units at hidden layers of actor.
critic_units (iterable of int) – Number of units at hidden layers of critic.
sigma (float) – Standard deviation of Gaussian noise. The default is 0.1.
tau (float) – Weight update ratio for target network. target = (1-tau)*target + tau*network The default is 0.005.
n_warmup (int) – Number of warmup steps before training. The default is 1e4.
memory_capacity (int) – Replay Buffer size. The default is 1e4.
batch_size (int) – Batch size. The default is 256.
discount (float) – Discount factor. The default is 0.99.
max_grad (float) – Maximum gradient. The default is 10.
gpu (int) – GPU id. -1 disables GPU. The default is 0.

get_action(state, test=False, tensor=False)

Get action

Parameters

state – Observation state
test (bool) – When False (default), policy returns exploratory action.
tensor (bool) – When True, return type is tf.Tensor

Returns

Selected action

Return type

tf.Tensor or np.ndarray or float

train(states, actions, next_states, rewards, done, weights=None)

Train DDPG

Parameters

states –
actions –
next_states –
rewards –
done –
weights (optional) – Weights for importance sampling

compute_td_error(states, actions, next_states, rewards, dones)

Compute TD error

Parameters

states –
actions –
next_states –
rewars –
dones –

Returns: tf.Tensor: TD error

tf2rl.algos.dqn module

class tf2rl.algos.dqn.QFunc(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(state_shape, action_dim, units=(32, 32), name='QFunc', enable_dueling_dqn=False, enable_noisy_dqn=False)

call(inputs)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters

inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class tf2rl.algos.dqn.DQN(*args, **kwargs)

Bases: tf2rl.algos.policy_base.OffPolicyAgent

DQN Agent: https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf

DQN supports following algorithms;

Command Line Args:

__init__(state_shape, action_dim, q_func=None, name='DQN', lr=0.001, adam_eps=1e-07, units=(32, 32), epsilon=0.1, epsilon_min=None, epsilon_decay_step=1000000, n_warmup=10000, target_replace_interval=5000, memory_capacity=1000000, enable_double_dqn=False, enable_dueling_dqn=False, enable_noisy_dqn=False, optimizer=None, **kwargs)

Initialize DQN agent

Parameters

state_shape (iterable of int) – Observation space shape
action_dim (int) – Dimension of discrete action
q_function (QFunc) – Custom Q function class. If None (default), Q function is constructed with QFunc.
name (str) – Name of agent. The default is "DQN"
lr (float) – Learning rate. The default is 0.001.
adam_eps (float) – Epsilon for Adam. The default is 1e-7
units (iterable of int) – Units of hidden layers. The default is (32, 32)
espilon (float) – Initial epsilon of e-greedy. The default is 0.1
epsilon_min (float) – Minimum epsilon of after decayed.
epsilon_decay_step (int) – Number of steps decaying. The default is 1e6
n_warmup (int) – Number of warmup steps befor training. The default is 1e4
target_replace_interval (int) – Number of steps between target network update. The default is 5e3
memory_capacity (int) – Size of replay buffer. The default is 1e6
enable_double_dqn (bool) – Whether use Double DQN. The default is False
enable_dueling_dqn (bool) – Whether use Dueling network. The default is False
enable_noisy_dqn (bool) – Whether use noisy network. The default is False
optimizer (tf.keras.optimizers.Optimizer) – Custom optimizer
batch_size (int) – Batch size. The default is 256.
discount (float) – Discount factor. The default is 0.99.
max_grad (float) – Maximum gradient. The default is 10.
gpu (int) – GPU id. -1 disables GPU. The default is 0.

get_action(state, test=False, tensor=False)

Get action

Parameters

state – Observation state
test (bool) – When False (default), policy returns exploratory action.
tensor (bool) – When True, return type is tf.Tensor

Returns

Selected action

Return type

tf.Tensor or np.ndarray or float

train(states, actions, next_states, rewards, done, weights=None)

Train DQN

Parameters

states –
actions –
next_states –
rewards –
done –
weights (optional) – Weights for importance sampling

compute_td_error(states, actions, next_states, rewards, dones)

Compute TD error

Parameters

states –
actions –
next_states –
rewars –
dones –

Returns: tf.Tensor: TD error

static get_argument(parser=None)

Create or update argument parser for command line program

Parameters: parser (argparse.ArgParser, optional) – argument parser
Returns: argument parser
Return type: argparse.ArgParser

tf2rl.algos.gaifo module

class tf2rl.algos.gaifo.Discriminator(*args, **kwargs)

Bases: tf2rl.algos.gail.Discriminator

__init__(state_shape, units=(32, 32), enable_sn=False, output_activation='sigmoid', name='Discriminator')

class tf2rl.algos.gaifo.GAIfO(*args, **kwargs)

Bases: tf2rl.algos.gail.GAIL

Generative Adversarial Imitation from Observation (GAIfO) Agent: https://arxiv.org/abs/1807.06158

Command Line Args:

__init__(state_shape, units=(32, 32), lr=0.001, enable_sn=False, name='GAIfO', **kwargs)

Initialize GAIfO

Parameters

state_shape (iterable of int) –
action_dim (int) –
units (iterable of int) – The default is (32, 32)
lr (float) – Learning rate. The default is 0.001
enable_sn (bool) – Whether enable Spectral Normalization. The defailt is False
name (str) – The default is "GAIfO"

train(agent_states, agent_next_states, expert_states, expert_next_states, **kwargs)

Train GAIfO

Parameters

agent_states –
agent_acts –
expert_states –
expected_acts –

inference(states, actions, next_states)

Infer Reward with GAIfO

Parameters

states –
actions –
next_states –

Returns

Reward

Return type

tf.Tensor

tf2rl.algos.gail module

class tf2rl.algos.gail.Discriminator(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(state_shape, action_dim, units=(32, 32), enable_sn=False, output_activation='sigmoid', name='Discriminator')

call(inputs)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters

inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

compute_reward(inputs)

class tf2rl.algos.gail.GAIL(*args, **kwargs)

Bases: tf2rl.algos.policy_base.IRLPolicy

Generative Adversarial Imitation Learning (GAIL) Agent: https://arxiv.org/abs/1606.03476

Command Line Args:

__init__(state_shape, action_dim, units=[32, 32], lr=0.001, enable_sn=False, name='GAIL', **kwargs)

Initialize GAIL

Parameters

state_shape (iterable of int) –
action_dim (int) –
units (iterable of int) – The default is [32, 32]
lr (float) – Learning rate. The default is 0.001
enable_sn (bool) – Whether enable Spectral Normalization. The defailt is False
name (str) – The default is "GAIL"

train(agent_states, agent_acts, expert_states, expert_acts, **kwargs)

Train GAIL

Parameters

agent_states –
agent_acts –
expert_states –
expected_acts –

inference(states, actions, next_states)

Infer Reward with GAIL

Parameters

states –
actions –
next_states –

Returns

Reward

Return type

tf.Tensor

static get_argument(parser=None)

tf2rl.algos.policy_base module

class tf2rl.algos.policy_base.Policy(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(name, memory_capacity, update_interval=1, batch_size=256, discount=0.99, n_warmup=0, max_grad=10.0, n_epoch=1, gpu=0)

get_action(observation, test=False)

static get_argument(parser=None)

class tf2rl.algos.policy_base.OnPolicyAgent(*args, **kwargs)

Bases: tf2rl.algos.policy_base.Policy

Base class for on-policy agents

__init__(horizon=2048, lam=0.95, enable_gae=True, normalize_adv=True, entropy_coef=0.01, vfunc_coef=1.0, **kwargs)

static get_argument(parser=None)

class tf2rl.algos.policy_base.OffPolicyAgent(*args, **kwargs)

Bases: tf2rl.algos.policy_base.Policy

Base class for off-policy agents

__init__(memory_capacity, **kwargs)

static get_argument(parser=None)

class tf2rl.algos.policy_base.IRLPolicy(*args, **kwargs)

Bases: tf2rl.algos.policy_base.Policy

__init__(n_training=1, memory_capacity=0, **kwargs)

tf2rl.algos.ppo module

class tf2rl.algos.ppo.PPO(*args, **kwargs)

Bases: tf2rl.algos.vpg.VPG

Proximal Policy Optimization (PPO) Agent: https://arxiv.org/abs/1707.06347

Command Line Args:

__init__(clip=True, clip_ratio=0.2, name='PPO', **kwargs)

Initialize PPO

Parameters

clip (bool) – Whether clip or not. The default is True.
clip_ratio (float) – Probability ratio is clipped between 1-clip_ratio and 1+clip_ratio.
name (str) – Name of agent. The default is "PPO".
state_shape (iterable of int) –
action_dim (int) –
is_discrete (bool) –
actor –
critic –
actor_critic –
max_action (float) – maximum action size.
actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is (256, 256).
critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is (256, 256).
lr_actor (float) – Learning rate of actor. The default is 1e-3.
lr_critic (float) – Learning rate of critic. The default is 3e-3.
hidden_activation_actor (str) – Activation for actor. The default is "relu".
hidden_activation_critic (str) – Activation for critic. The default is "relu".
horizon (int) – Number of steps of online episode horizon. The horizon must be multiple of batch_size. The default is 2048.
enable_gae (bool) – Enable GAE. The default is True.
normalize_adv (bool) – Normalize Advantage. The default is True.
entropy_coef (float) – Entropy coefficient. The default is 0.01.
vfunc_coef (float) – Mixing ratio factor for actor and critic. actor_loss + vfunc_coef*critic_loss
batch_size (int) – Batch size. The default is 256.

train(states, actions, advantages, logp_olds, returns)

Train PPO

Parameters

states –
actions –
advantages –
logp_olds –
returns –

tf2rl.algos.sac module

class tf2rl.algos.sac.CriticQ(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(state_shape, action_dim, critic_units=(256, 256), name='qf')

call(states, actions)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters

inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class tf2rl.algos.sac.SAC(*args, **kwargs)

Bases: tf2rl.algos.policy_base.OffPolicyAgent

Soft Actor-Critic (SAC) Agent: https://arxiv.org/abs/1801.01290

Command Line Args:

__init__(state_shape, action_dim, name='SAC', max_action=1.0, lr=0.0003, lr_alpha=0.0003, actor_units=(256, 256), critic_units=(256, 256), tau=0.005, alpha=0.2, auto_alpha=False, init_temperature=None, n_warmup=10000, memory_capacity=1000000, **kwargs)

Initialize SAC

Parameters

state_shape (iterable of int) –
action_dim (int) –
name (str) – Name of network. The default is "SAC"
max_action (float) –
lr (float) – Learning rate. The default is 3e-4.
lr_alpha (alpha) – Learning rate for alpha. The default is 3e-4.
actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is (256, 256).
critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is (256, 256).
tau (float) – Target network update rate.
alpha (float) – Temperature parameter. The default is 0.2.
auto_alpha (bool) – Automatic alpha tuning.
init_temperature (float) – Initial temperature
n_warmup (int) – Number of warmup steps before training. The default is int(1e4).
memory_capacity (int) – Replay Buffer size. The default is int(1e6).
batch_size (int) – Batch size. The default is 256.
discount (float) – Discount factor. The default is 0.99.
max_grad (float) – Maximum gradient. The default is 10.
gpu (int) – GPU id. -1 disables GPU. The default is 0.

get_action(state, test=False)

Get action

Parameters

state – Observation state
test (bool) – When False (default), policy returns exploratory action.

Returns

Selected action

Return type

tf.Tensor or float

train(states, actions, next_states, rewards, dones, weights=None)

Train SAC

Parameters

states –
actions –
next_states –
rewards –
done –
weights (optional) – Weights for importance sampling

compute_td_error(states, actions, next_states, rewards, dones)

Compute TD error

Parameters

states –
actions –
next_states –
rewars –
dones –

Returns: np.ndarray: TD error

static get_argument(parser=None)

Create or update argument parser for command line program

Parameters: parser (argparse.ArgParser, optional) – argument parser
Returns: argument parser
Return type: argparse.ArgParser

tf2rl.algos.sac_ae module

class tf2rl.algos.sac_ae.SACAE(*args, **kwargs)

Bases: tf2rl.algos.sac.SAC

SAC+AE Agent: https://arxiv.org/abs/1910.01741

Command Line Args:

__init__(action_dim, obs_shape=(84, 84, 9), n_conv_layers=4, n_conv_filters=32, feature_dim=50, tau_encoder=0.05, tau_critic=0.01, auto_alpha=True, lr_sac=0.001, lr_encoder=0.001, lr_decoder=0.001, update_critic_target_freq=2, update_actor_freq=2, lr_alpha=0.0001, init_temperature=0.1, stop_q_grad=False, lambda_latent_val=1e-06, decoder_weight_lambda=1e-07, skip_making_decoder=False, name='SACAE', **kwargs)

Initialize SAC+AE

Parameters

action_dim (int) –
obs_shape – (iterable of int): The default is (84, 84, 9)
n_conv_layers (int) – Number of convolutional layers at encoder. The default is 4
n_conv_filters (int) – Number of filters in convolutional layers. The default is 32
feature_dim (int) – Number of features after encoder. This features are treated as SAC input. The default is 50
tau_encoder (float) – Target network update rate for Encoder. The default is 0.05
tau_critic (float) – Target network update rate for Critic. The default is 0.01
auto_alpha (bool) – Automatic alpha tuning. The default is True
lr_sac (float) – Learning rate for SAC. The default is 1e-3
lr_encoder (float) – Learning rate for Encoder. The default is 1e-3
lr_decoder (float) – Learning rate for Decoder. The default is 1e-3
update_critic_target_freq (int) – The default is 2
update_actor_freq (int) – The default is 2
lr_alpha (alpha) – Learning rate for alpha. The default is 1e-4.
init_temperature (float) – Initial temperature. The default is 0.1
stop_q_grad (bool) – Whether sotp gradient propagation after encoder convolutional network. The default is False
lambda_latent_val (float) – AE loss = REC loss + lambda_latent_val * latent loss. The default is 1e-6
decoder_weight_lambda (float) – Weight decay of AdamW for Decoder. The default is 1e-7
skip_making_decoder (bool) – Whther skip making Decoder. The default is False
name (str) – Name of network. The default is "SACAE"
max_action (float) –
actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is (256, 256).
critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is (256, 256).
alpha (float) – Temperature parameter. The default is 0.2.
n_warmup (int) – Number of warmup steps before training. The default is int(1e4).
memory_capacity (int) – Replay Buffer size. The default is int(1e6).
batch_size (int) – Batch size. The default is 256.
discount (float) – Discount factor. The default is 0.99.
max_grad (float) – Maximum gradient. The default is 10.
gpu (int) – GPU id. -1 disables GPU. The default is 0.

get_action(state, test=False)

Get action

Parameters

state – Observation state
test (bool) – When False (default), policy returns exploratory action.

Returns

Selected action

Return type

tf.Tensor or float

Notes

When the input image have different size, cropped image is used

train(states, actions, next_states, rewards, dones, weights=None)

Train SAC+AE

Parameters

states –
actions –
next_states –
rewards –
done –
weights (optional) – Weights for importance sampling

static get_argument(parser=None)

Create or update argument parser for command line program

Parameters: parser (argparse.ArgParser, optional) – argument parser
Returns: argument parser
Return type: argparse.ArgParser

compute_td_error(states, actions, next_states, rewards, dones)

Compute TD error

Parameters

states –
actions –
next_states –
rewars –
dones –

Returns: np.ndarray: TD error

tf2rl.algos.sac_discrete module

class tf2rl.algos.sac_discrete.CriticQ(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

Compared with original (continuous) version of SAC, the output of Q-function moves: from Q: S x A -> R to Q: S -> R^|A|

__init__(state_shape, action_dim, critic_units=(256, 256), name='qf')

call(states)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters

inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class tf2rl.algos.sac_discrete.SACDiscrete(*args, **kwargs)

Bases: tf2rl.algos.sac.SAC

__init__(state_shape, action_dim, *args, actor_fn=None, critic_fn=None, target_update_interval=None, **kwargs)

Initialize SAC

Parameters

state_shape (iterable of int) –
action_dim (int) –
name (str) – Name of network. The default is "SAC"
max_action (float) –
lr (float) – Learning rate. The default is 3e-4.
lr_alpha (alpha) – Learning rate for alpha. The default is 3e-4.
actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is (256, 256).
critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is (256, 256).
tau (float) – Target network update rate.
alpha (float) – Temperature parameter. The default is 0.2.
auto_alpha (bool) – Automatic alpha tuning.
init_temperature (float) – Initial temperature
n_warmup (int) – Number of warmup steps before training. The default is int(1e4).
memory_capacity (int) – Replay Buffer size. The default is int(1e6).
batch_size (int) – Batch size. The default is 256.
discount (float) – Discount factor. The default is 0.99.
max_grad (float) – Maximum gradient. The default is 10.
gpu (int) – GPU id. -1 disables GPU. The default is 0.

train(states, actions, next_states, rewards, dones, weights=None)

Train SAC

Parameters

states –
actions –
next_states –
rewards –
done –
weights (optional) – Weights for importance sampling

compute_td_error(states, actions, next_states, rewards, dones)

Compute TD error

Parameters

states –
actions –
next_states –
rewars –
dones –

Returns: np.ndarray: TD error

static get_argument(parser=None)

Create or update argument parser for command line program

Parameters: parser (argparse.ArgParser, optional) – argument parser
Returns: argument parser
Return type: argparse.ArgParser

tf2rl.algos.td3 module

class tf2rl.algos.td3.Critic(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(state_shape, action_dim, units=(400, 300), name='Critic')

call(states, actions)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters

inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class tf2rl.algos.td3.TD3(*args, **kwargs)

Bases: tf2rl.algos.ddpg.DDPG

Twin Delayed Deep Deterministic policy gradient (TD3) Agent: https://arxiv.org/abs/1802.09477

Command Line Args:

__init__(state_shape, action_dim, name='TD3', actor_update_freq=2, policy_noise=0.2, noise_clip=0.5, critic_units=(400, 300), **kwargs)

Initialize TD3

Parameters

shate_shape (iterable of ints) – Observation state shape
action_dim (int) – Action dimension
name (str) – Network name. The default is "TD3".
actor_update_freq (int) – Number of critic updates per one actor upate.
policy_noise (float) –
noise_clip (float) –
critic_units (iterable of int) – Numbers of units at hidden layer of critic. The default is (400, 300)
max_action (float) – Size of maximum action. (-max_action <= action <= max_action). The degault is 1.
lr_actor (float) – Learning rate for actor network. The default is 0.001.
lr_critic (float) – Learning rage for critic network. The default is 0.001.
actor_units (iterable of int) – Number of units at hidden layers of actor.
sigma (float) – Standard deviation of Gaussian noise. The default is 0.1.
tau (float) – Weight update ratio for target network. target = (1-tau)*target + tau*network The default is 0.005.
n_warmup (int) – Number of warmup steps before training. The default is 1e4.
memory_capacity (int) – Replay Buffer size. The default is 1e4.
batch_size (int) – Batch size. The default is 256.
discount (float) – Discount factor. The default is 0.99.
max_grad (float) – Maximum gradient. The default is 10.
gpu (int) – GPU id. -1 disables GPU. The default is 0.

compute_td_error(states, actions, next_states, rewards, dones)

Compute TD error

Parameters

states –
actions –
next_states –
rewars –
dones –

Returns

Sum of two TD errors.

Return type

np.ndarray

tf2rl.algos.vail module

class tf2rl.algos.vail.Discriminator(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

LOG_SIG_CAP_MAX = 2

LOG_SIG_CAP_MIN = -20

EPS = 1e-06

__init__(state_shape, action_dim, units=(32, 32), n_latent_unit=32, enable_sn=False, name='Discriminator')

call(inputs)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters

inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

compute_reward(inputs)

class tf2rl.algos.vail.VAIL(*args, **kwargs)

Bases: tf2rl.algos.gail.GAIL

Variational Adversarial Imitation Learning (VAIL) Agent: https://arxiv.org/abs/1810.00821

Command Line Args:

__init__(state_shape, action_dim, units=(32, 32), n_latent_unit=32, lr=5e-05, kl_target=0.5, reg_param=0.0, enable_sn=False, enable_gp=False, name='VAIL', **kwargs)

Initialize VAIL

Parameters

state_shape (iterable of int) –
action_dim (int) –
units (iterable of int) – The default is (32, 32)
lr (float) – Learning rate. The default is 5e-5
kl_target (float) – The default is 0.5
reg_param (float) – The default is 0
enable_sn (bool) – Whether enable Spectral Normalization. The defailt is False
enable_gp (bool) – Whether loss function includes gradient panalty
name (str) – The default is "VAIL"

train(agent_states, agent_acts, expert_states, expert_acts, **kwargs)

Train VAIL

Parameters

agent_states –
agent_acts –
expert_states –
expected_acts –

tf2rl.algos.vpg module

class tf2rl.algos.vpg.CriticV(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(state_shape, units, name='critic_v', hidden_activation='relu')

call(inputs)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters

inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class tf2rl.algos.vpg.VPG(*args, **kwargs)

Bases: tf2rl.algos.policy_base.OnPolicyAgent

VPG Agent: https://proceedings.neurips.cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf

Command Line Args:

__init__(state_shape, action_dim, is_discrete, actor=None, critic=None, actor_critic=None, max_action=1.0, actor_units=(256, 256), critic_units=(256, 256), lr_actor=0.001, lr_critic=0.003, hidden_activation_actor='relu', hidden_activation_critic='relu', name='VPG', **kwargs)

Initialize VPG

Parameters

state_shape (iterable of int) –
action_dim (int) –
is_discrete (bool) –
actor –
critic –
actor_critic –
max_action (float) – maximum action size.
actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is (256, 256).
critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is (256, 256).
lr_actor (float) – Learning rate of actor. The default is 1e-3.
lr_critic (float) – Learning rate of critic. The default is 3e-3.
hidden_activation_actor (str) – Activation for actor. The default is "relu".
hidden_activation_critic (str) – Activation for critic. The default is "relu".
name (str) – Name of agent. The default is "VPG".
horizon (int) – Number of steps of online episode horizon. The horizon must be multiple of batch_size. The default is 2048.
enable_gae (bool) – Enable GAE. The default is True.
normalize_adv (bool) – Normalize Advantage. The default is True.
entropy_coef (float) – Entropy coefficient. The default is 0.01.
vfunc_coef (float) – Mixing ratio factor for actor and critic. actor_loss + vfunc_coef*critic_loss
batch_size (int) – Batch size. The default is 256.

get_action(state, test=False)

Get action and probability

Parameters

state – Observation state
test (bool) – When False (default), policy returns exploratory action.

Returns

Selected action np.ndarray or float: Log(p)

Return type

np.ndarray or float

get_action_and_val(state, test=False)

Get action, probability, and critic value

Parameters

state – Observation state
test (bool) – When False (default), policy returns exploratory action.

Returns

Selected action np.ndarray: Log(p) np.ndarray: Critic value

Return type

np.ndarray

train(states, actions, advantages, logp_olds, returns)

Train VPG

Parameters

states –
actions –
advantages –
logp_olds –
returns –

tf2rl.algos package

Submodules

tf2rl.algos.apex module

tf2rl.algos.bi_res_ddpg module

tf2rl.algos.categorical_dqn module

tf2rl.algos.curl_sac module

tf2rl.algos.d2rl_sac module

tf2rl.algos.ddpg module

tf2rl.algos.dqn module

tf2rl.algos.gaifo module

tf2rl.algos.gail module

tf2rl.algos.policy_base module

tf2rl.algos.ppo module

tf2rl.algos.sac module

tf2rl.algos.sac_ae module

tf2rl.algos.sac_discrete module

tf2rl.algos.td3 module

tf2rl.algos.vail module

tf2rl.algos.vpg module

Module contents