tf2rl.algos package

Submodules

tf2rl.algos.apex module

tf2rl.algos.apex.import_tf()
tf2rl.algos.apex.explorer(global_rb, queue, trained_steps, is_training_done, lock, env_fn, policy_fn, set_weights_fn, noise_level, n_env=64, n_thread=4, buffer_size=1024, episode_max_steps=1000, gpu=0)

Collect transitions and store them to prioritized replay buffer.

Parameters
  • global_rb – multiprocessing.managers.AutoProxy[PrioritizedReplayBuffer] Prioritized replay buffer sharing with multiple explorers and only one learner. This object is shared over processes, so it must be locked when trying to operate something with lock object.

  • queue – multiprocessing.Queue A FIFO shared with the learner and evaluator to get the latest network weights. This is process safe, so you don’t need to lock process when use this.

  • trained_steps – multiprocessing.Value Number of steps to apply gradients.

  • is_training_done – multiprocessing.Event multiprocessing.Event object to share the status of training.

  • lock – multiprocessing.Lock Lock other processes.

  • env_fn – function Method object to generate an environment.

  • policy_fn – function Method object to generate an explorer.

  • set_weights_fn – function Method object to set network weights gotten from queue.

  • noise_level – float Noise level for exploration. For epsilon-greedy policy like DQN variants, this will be epsilon, and if DDPG variants this will be variance for Normal distribution.

  • n_env – int Number of environments to distribute. If this is set to be more than 1, MultiThreadEnv will be used.

  • n_thread – int Number of thread used in MultiThreadEnv.

  • buffer_size – int Size of local buffer. If this is filled with transitions, add them to global_rb

  • episode_max_steps – int Maximum number of steps of an episode.

  • gpu – int GPU id. If this is set to -1, then this process uses only CPU.

Returns

None

tf2rl.algos.apex.learner(global_rb, trained_steps, is_training_done, lock, env, policy_fn, get_weights_fn, n_training, update_freq, evaluation_freq, gpu, queues)

Update network weights using samples collected by explorers.

Parameters
  • global_rb – multiprocessing.managers.AutoProxy[PrioritizedReplayBuffer] Prioritized replay buffer sharing with multiple explorers and only one learner. This object is shared over processes, so it must be locked when trying to operate something with lock object.

  • trained_steps – multiprocessing.Value Number of steps to apply gradients.

  • is_training_done – multiprocessing.Event multiprocessing.Event object to share the status of training.

  • lock – multiprocessing.Lock multiprocessing.Lock to lock other processes.

  • env – OpenAI-gym compatible environment object

  • policy_fn – function Method object to generate an explorer.

  • get_weights_fn – function Method object to get network weights and put them to queue.

  • n_training – int Maximum number of times to apply gradients. If number of applying gradients is over this value, training will be done by setting is_training_done to True

  • update_freq – int Frequency to update parameters, i.e., put network parameters to queues

  • evaluation_freq – int Frequency to call evaluator.

  • gpu – int GPU id. If this is set to -1, then this process uses only CPU.

  • queues – List List of Queues shared with explorers to send latest network parameters.

Returns

None

tf2rl.algos.apex.evaluator(is_training_done, env, policy_fn, set_weights_fn, queue, gpu, save_model_interval=1000000, n_evaluation=10, episode_max_steps=1000, show_test_progress=False)

Evaluate trained network weights periodically.

Parameters
  • is_training_done – multiprocessing.Event multiprocessing.Event object to share the status of training.

  • env – Open-AI gym compatible environment Environment object.

  • policy_fn – function Method object to generate an explorer.

  • set_weights_fn – function Method object to set network weights gotten from queue.

  • queue – multiprocessing.Queue A FIFO shared with the learner to get the latest network weights. This is process safe, so you don’t need to lock process when use this.

  • gpu – int GPU id. If this is set to -1, then this process uses only CPU.

  • save_model_interval – int Interval to save model.

  • n_evaluation – int Number of episodes to evaluate.

  • episode_max_steps – int Maximum number of steps of an episode.

  • show_test_progress – bool If true, render will be called to visualize evaluation process.

tf2rl.algos.apex.apex_argument(parser=None)
tf2rl.algos.apex.prepare_experiment(env, args)
tf2rl.algos.apex.run(args, env_fn, policy_fn, get_weights_fn, set_weights_fn)

tf2rl.algos.bi_res_ddpg module

class tf2rl.algos.bi_res_ddpg.BiResDDPG(*args, **kwargs)

Bases: tf2rl.algos.ddpg.DDPG

Bi-Res-DDPG Agent: https://arxiv.org/abs/1905.01072

Command Line Args:

  • --n-warmup (int): Number of warmup steps before training. The default is 1e4.

  • --batch-size (int): Batch size for training. The default is 32.

  • --gpu (int): GPU id. -1 disables GPU. The default is 0.

  • --memory-capacity (int): Replay Buffer size. The default is 1e6.

  • --eta (float): Gradient mixing factor. The default is 0.05.

__init__(eta=0.05, name='BiResDDPG', **kwargs)

Initialize BiResDDPG agent

Parameters
  • eta (float) – Gradients mixing factor.

  • name (str) – Name of agent. The default is "BiResDDPG".

  • state_shape (iterable of int) –

  • action_dim (int) –

  • max_action (float) – Size of maximum action. (-max_action <= action <= max_action). The degault is 1.

  • lr_actor (float) – Learning rate for actor network. The default is 0.001.

  • lr_critic (float) – Learning rage for critic network. The default is 0.001.

  • actor_units (iterable of int) – Number of units at hidden layers of actor.

  • critic_units (iterable of int) – Number of units at hidden layers of critic.

  • sigma (float) – Standard deviation of Gaussian noise. The default is 0.1.

  • tau (float) – Weight update ratio for target network. target = (1-tau)*target + tau*network The default is 0.005.

  • n_warmup (int) – Number of warmup steps before training. The default is 1e4.

  • memory_capacity (int) – Replay Buffer size. The default is 1e4.

  • batch_size (int) – Batch size. The default is 256.

  • discount (float) – Discount factor. The default is 0.99.

  • max_grad (float) – Maximum gradient. The default is 10.

  • gpu (int) – GPU id. -1 disables GPU. The default is 0.

compute_td_error(states, actions, next_states, rewards, dones)

Compute TD error

Parameters
  • states

  • actions

  • next_states

  • rewars

  • dones

Returns

Sum of two TD errors.

Return type

np.ndarray

static get_argument(parser=None)

Create or update argument parser for command line program

Parameters

parser (argparse.ArgParser, optional) – argument parser

Returns

argument parser

Return type

argparse.ArgParser

tf2rl.algos.categorical_dqn module

class tf2rl.algos.categorical_dqn.QFunc(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(state_shape, action_dim, units=(32, 32), name='CategoricalQFunc', enable_dueling_dqn=False, enable_noisy_dqn=False, n_atoms=51)
call(inputs)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters
  • inputs – A tensor or list of tensors.

  • training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.

  • mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

property n_atoms
class tf2rl.algos.categorical_dqn.CategoricalDQN(*args, **kwargs)

Bases: tf2rl.algos.policy_base.OffPolicyAgent

Categorical DQN Agent: https://arxiv.org/abs/1707.06887

Categorical DQN supports following algorithms;

Command Line Args:

  • --n-warmup (int): Number of warmup steps before training. The default is 1e4.

  • --batch-size (int): Batch size of training. The default is 32.

  • --gpu (int): GPU id. -1 disables GPU. The default is 0.

  • --memory-capacity (int): Replay Buffer size. The default is 1e6.

  • --enable-double-dqn: Enable DDQN

  • --enable-dueling-dqn: Enable Dueling Network

  • --enable-noisy-dqn: Enable Noisy Network

__init__(state_shape, action_dim, q_func=None, name='DQN', lr=0.001, adam_eps=1e-07, units=(32, 32), epsilon=0.1, epsilon_min=None, epsilon_decay_step=1000000, n_warmup=10000, target_replace_interval=5000, memory_capacity=1000000, enable_double_dqn=False, enable_dueling_dqn=False, enable_noisy_dqn=False, **kwargs)

Initialize Categorical DQN

Parameters
  • state_shape (iterable of int) – Observation space shape

  • action_dim (int) – Dimension of discrete action

  • q_function (QFunc) – Custom Q function class. If None (default), Q function is constructed with QFunc.

  • name (str) – Name of agent. The default is "DQN"

  • lr (float) – Learning rate. The default is 0.001.

  • adam_eps (float) – Epsilon for Adam. The default is 1e-7

  • units (iterable of int) – Units of hidden layers. The default is (32, 32)

  • espilon (float) – Initial epsilon of e-greedy. The default is 0.1

  • epsilon_min (float) – Minimum epsilon of after decayed.

  • epsilon_decay_step (int) – Number of steps decaying. The default is 1e6

  • n_warmup (int) – Number of warmup steps befor training. The default is 1e4

  • target_replace_interval (int) – Number of steps between target network update. The default is 5e3

  • memory_capacity (int) – Size of replay buffer. The default is 1e6

  • enable_double_dqn (bool) – Whether use Double DQN. The default is False

  • enable_dueling_dqn (bool) – Whether use Dueling network. The default is False

  • enable_noisy_dqn (bool) – Whether use noisy network. The default is False

  • optimizer (tf.keras.optimizers.Optimizer) – Custom optimizer

  • batch_size (int) – Batch size. The default is 256.

  • discount (float) – Discount factor. The default is 0.99.

  • max_grad (float) – Maximum gradient. The default is 10.

  • gpu (int) – GPU id. -1 disables GPU. The default is 0.

get_action(state, test=False, tensor=False)

Get action

Parameters
  • state – Observation state

  • test (bool) – When False (default), policy returns exploratory action.

  • tensor (bool) – When True, return type is tf.Tensor

Returns

Selected action

Return type

tf.Tensor or np.ndarray or float

train(states, actions, next_states, rewards, done, weights=None)

Train DQN

Parameters
  • states

  • actions

  • next_states

  • rewards

  • done

  • weights (optional) – Weights for importance sampling

compute_td_error(states, actions, next_states, rewards, dones)

Compute TD error

Parameters
  • states

  • actions

  • next_states

  • rewars

  • dones

Returns

tf.Tensor: TD error

static get_argument(parser=None)

Create or update argument parser for command line program

Parameters

parser (argparse.ArgParser, optional) – argument parser

Returns

argument parser

Return type

argparse.ArgParser

tf2rl.algos.curl_sac module

class tf2rl.algos.curl_sac.CURL(*args, **kwargs)

Bases: tf2rl.algos.sac_ae.SACAE

Contrastive Unsuper Representations for Reinforcement Learning (CURL) Agent: https://arxiv.org/abs/2004.04136

Command Line Args:

  • --n-warmup (int): Number of warmup steps before training. The default is 1e4.

  • --batch-size (int): Batch size of training. The default is 32.

  • --gpu (int): GPU id. -1 disables GPU. The default is 0.

  • --memory-capacity (int): Replay Buffer size. The default is 1e5.

  • --alpha (float): Temperature parameter. The default is 0.2.

  • --auto-alpha: Automatic alpha tuning.

  • --stop-q-grad: Whether stop gradient after convolutional layers at Encoder

__init__(*args, **kwargs)

Initialize CURL

Parameters
  • action_dim (int) –

  • obs_shape – (iterable of int): The default is (84, 84, 9)

  • n_conv_layers (int) – Number of convolutional layers at encoder. The default is 4

  • n_conv_filters (int) – Number of filters in convolutional layers. The default is 32

  • feature_dim (int) – Number of features after encoder. This features are treated as SAC input. The default is 50

  • tau_encoder (float) – Target network update rate for Encoder. The default is 0.05

  • tau_critic (float) – Target network update rate for Critic. The default is 0.01

  • auto_alpha (bool) – Automatic alpha tuning. The default is True

  • lr_sac (float) – Learning rate for SAC. The default is 1e-3

  • lr_encoder (float) – Learning rate for Encoder. The default is 1e-3

  • lr_decoder (float) – Learning rate for Decoder. The default is 1e-3

  • update_critic_target_freq (int) – The default is 2

  • update_actor_freq (int) – The default is 2

  • lr_alpha (alpha) – Learning rate for alpha. The default is 1e-4.

  • init_temperature (float) – Initial temperature. The default is 0.1

  • stop_q_grad (bool) – Whether sotp gradient propagation after encoder convolutional network. The default is False

  • lambda_latent_val (float) – AE loss = REC loss + lambda_latent_val * latent loss. The default is 1e-6

  • decoder_weight_lambda (float) – Weight decay of AdamW for Decoder. The default is 1e-7

  • name (str) – Name of network. The default is "CURL"

  • max_action (float) –

  • actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is (256, 256).

  • critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is (256, 256).

  • alpha (float) – Temperature parameter. The default is 0.2.

  • n_warmup (int) – Number of warmup steps before training. The default is int(1e4).

  • memory_capacity (int) – Replay Buffer size. The default is int(1e6).

  • batch_size (int) – Batch size. The default is 256.

  • discount (float) – Discount factor. The default is 0.99.

  • max_grad (float) – Maximum gradient. The default is 10.

  • gpu (int) – GPU id. -1 disables GPU. The default is 0.

train(states, actions, next_states, rewards, dones, weights=None)

Train CURL

Parameters
  • states

  • actions

  • next_states

  • rewards

  • done

  • weights (optional) – Weights for importance sampling

tf2rl.algos.d2rl_sac module

class tf2rl.algos.d2rl_sac.DenseCriticQ(*args, **kwargs)

Bases: tf2rl.algos.sac.CriticQ

call(states, actions)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters
  • inputs – A tensor or list of tensors.

  • training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.

  • mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class tf2rl.algos.d2rl_sac.DenseGaussianActor(*args, **kwargs)

Bases: tf2rl.policies.tfp_gaussian_actor.GaussianActor

class tf2rl.algos.d2rl_sac.D2RLSAC(*args, **kwargs)

Bases: tf2rl.algos.sac.SAC

__init__(*args, **kwargs)

Initialize SAC

Parameters
  • state_shape (iterable of int) –

  • action_dim (int) –

  • name (str) – Name of network. The default is "SAC"

  • max_action (float) –

  • lr (float) – Learning rate. The default is 3e-4.

  • lr_alpha (alpha) – Learning rate for alpha. The default is 3e-4.

  • actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is (256, 256).

  • critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is (256, 256).

  • tau (float) – Target network update rate.

  • alpha (float) – Temperature parameter. The default is 0.2.

  • auto_alpha (bool) – Automatic alpha tuning.

  • init_temperature (float) – Initial temperature

  • n_warmup (int) – Number of warmup steps before training. The default is int(1e4).

  • memory_capacity (int) – Replay Buffer size. The default is int(1e6).

  • batch_size (int) – Batch size. The default is 256.

  • discount (float) – Discount factor. The default is 0.99.

  • max_grad (float) – Maximum gradient. The default is 10.

  • gpu (int) – GPU id. -1 disables GPU. The default is 0.

tf2rl.algos.ddpg module

class tf2rl.algos.ddpg.Actor(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(state_shape, action_dim, max_action, units=(400, 300), name='Actor')
call(inputs)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters
  • inputs – A tensor or list of tensors.

  • training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.

  • mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class tf2rl.algos.ddpg.Critic(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(state_shape, action_dim, units=(400, 300), name='Critic')
call(states, actions)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters
  • inputs – A tensor or list of tensors.

  • training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.

  • mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class tf2rl.algos.ddpg.DDPG(*args, **kwargs)

Bases: tf2rl.algos.policy_base.OffPolicyAgent

DDPG agent: https://arxiv.org/abs/1509.02971

Command Line Args:

  • --n-warmup (int): Number of warmup steps before training. The default is 1e4.

  • --batch-size (int): Batch size for training. The default is 32.

  • --gpu (int): GPU id. -1 disables GPU. The default is 0.

  • --memory-capacity (int): Replay Buffer size. The default is 1e6.

__init__(state_shape, action_dim, name='DDPG', max_action=1.0, lr_actor=0.001, lr_critic=0.001, actor_units=(400, 300), critic_units=(400, 300), sigma=0.1, tau=0.005, n_warmup=10000, memory_capacity=1000000, **kwargs)

Initialize DDPG agent

Parameters
  • state_shape (iterable of int) –

  • action_dim (int) –

  • name (str) – Name of agent. The default is "DDPG".

  • max_action (float) – Size of maximum action. (-max_action <= action <= max_action). The degault is 1.

  • lr_actor (float) – Learning rate for actor network. The default is 0.001.

  • lr_critic (float) – Learning rage for critic network. The default is 0.001.

  • actor_units (iterable of int) – Number of units at hidden layers of actor.

  • critic_units (iterable of int) – Number of units at hidden layers of critic.

  • sigma (float) – Standard deviation of Gaussian noise. The default is 0.1.

  • tau (float) – Weight update ratio for target network. target = (1-tau)*target + tau*network The default is 0.005.

  • n_warmup (int) – Number of warmup steps before training. The default is 1e4.

  • memory_capacity (int) – Replay Buffer size. The default is 1e4.

  • batch_size (int) – Batch size. The default is 256.

  • discount (float) – Discount factor. The default is 0.99.

  • max_grad (float) – Maximum gradient. The default is 10.

  • gpu (int) – GPU id. -1 disables GPU. The default is 0.

get_action(state, test=False, tensor=False)

Get action

Parameters
  • state – Observation state

  • test (bool) – When False (default), policy returns exploratory action.

  • tensor (bool) – When True, return type is tf.Tensor

Returns

Selected action

Return type

tf.Tensor or np.ndarray or float

train(states, actions, next_states, rewards, done, weights=None)

Train DDPG

Parameters
  • states

  • actions

  • next_states

  • rewards

  • done

  • weights (optional) – Weights for importance sampling

compute_td_error(states, actions, next_states, rewards, dones)

Compute TD error

Parameters
  • states

  • actions

  • next_states

  • rewars

  • dones

Returns

tf.Tensor: TD error

tf2rl.algos.dqn module

class tf2rl.algos.dqn.QFunc(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(state_shape, action_dim, units=(32, 32), name='QFunc', enable_dueling_dqn=False, enable_noisy_dqn=False)
call(inputs)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters
  • inputs – A tensor or list of tensors.

  • training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.

  • mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class tf2rl.algos.dqn.DQN(*args, **kwargs)

Bases: tf2rl.algos.policy_base.OffPolicyAgent

DQN Agent: https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf

DQN supports following algorithms;

Command Line Args:

  • --n-warmup (int): Number of warmup steps before training. The default is 1e4.

  • --batch-size (int): Batch size of training. The default is 32.

  • --gpu (int): GPU id. -1 disables GPU. The default is 0.

  • --memory-capacity (int): Replay Buffer size. The default is 1e6.

  • --enable-double-dqn: Enable DDQN

  • --enable-dueling-dqn: Enable Dueling Network

  • --enable-noisy-dqn: Enable Noisy Network

__init__(state_shape, action_dim, q_func=None, name='DQN', lr=0.001, adam_eps=1e-07, units=(32, 32), epsilon=0.1, epsilon_min=None, epsilon_decay_step=1000000, n_warmup=10000, target_replace_interval=5000, memory_capacity=1000000, enable_double_dqn=False, enable_dueling_dqn=False, enable_noisy_dqn=False, optimizer=None, **kwargs)

Initialize DQN agent

Parameters
  • state_shape (iterable of int) – Observation space shape

  • action_dim (int) – Dimension of discrete action

  • q_function (QFunc) – Custom Q function class. If None (default), Q function is constructed with QFunc.

  • name (str) – Name of agent. The default is "DQN"

  • lr (float) – Learning rate. The default is 0.001.

  • adam_eps (float) – Epsilon for Adam. The default is 1e-7

  • units (iterable of int) – Units of hidden layers. The default is (32, 32)

  • espilon (float) – Initial epsilon of e-greedy. The default is 0.1

  • epsilon_min (float) – Minimum epsilon of after decayed.

  • epsilon_decay_step (int) – Number of steps decaying. The default is 1e6

  • n_warmup (int) – Number of warmup steps befor training. The default is 1e4

  • target_replace_interval (int) – Number of steps between target network update. The default is 5e3

  • memory_capacity (int) – Size of replay buffer. The default is 1e6

  • enable_double_dqn (bool) – Whether use Double DQN. The default is False

  • enable_dueling_dqn (bool) – Whether use Dueling network. The default is False

  • enable_noisy_dqn (bool) – Whether use noisy network. The default is False

  • optimizer (tf.keras.optimizers.Optimizer) – Custom optimizer

  • batch_size (int) – Batch size. The default is 256.

  • discount (float) – Discount factor. The default is 0.99.

  • max_grad (float) – Maximum gradient. The default is 10.

  • gpu (int) – GPU id. -1 disables GPU. The default is 0.

get_action(state, test=False, tensor=False)

Get action

Parameters
  • state – Observation state

  • test (bool) – When False (default), policy returns exploratory action.

  • tensor (bool) – When True, return type is tf.Tensor

Returns

Selected action

Return type

tf.Tensor or np.ndarray or float

train(states, actions, next_states, rewards, done, weights=None)

Train DQN

Parameters
  • states

  • actions

  • next_states

  • rewards

  • done

  • weights (optional) – Weights for importance sampling

compute_td_error(states, actions, next_states, rewards, dones)

Compute TD error

Parameters
  • states

  • actions

  • next_states

  • rewars

  • dones

Returns

tf.Tensor: TD error

static get_argument(parser=None)

Create or update argument parser for command line program

Parameters

parser (argparse.ArgParser, optional) – argument parser

Returns

argument parser

Return type

argparse.ArgParser

tf2rl.algos.gaifo module

class tf2rl.algos.gaifo.Discriminator(*args, **kwargs)

Bases: tf2rl.algos.gail.Discriminator

__init__(state_shape, units=(32, 32), enable_sn=False, output_activation='sigmoid', name='Discriminator')
class tf2rl.algos.gaifo.GAIfO(*args, **kwargs)

Bases: tf2rl.algos.gail.GAIL

Generative Adversarial Imitation from Observation (GAIfO) Agent: https://arxiv.org/abs/1807.06158

Command Line Args:

  • --n-warmup (int): Number of warmup steps before training. The default is 1e4.

  • --batch-size (int): Batch size of training. The default is 32.

  • --gpu (int): GPU id. -1 disables GPU. The default is 0.

  • --memory-capacity (int): Replay Buffer size. The default is 1e4.

  • --enable-sn: Enable Spectral Normalization

__init__(state_shape, units=(32, 32), lr=0.001, enable_sn=False, name='GAIfO', **kwargs)

Initialize GAIfO

Parameters
  • state_shape (iterable of int) –

  • action_dim (int) –

  • units (iterable of int) – The default is (32, 32)

  • lr (float) – Learning rate. The default is 0.001

  • enable_sn (bool) – Whether enable Spectral Normalization. The defailt is False

  • name (str) – The default is "GAIfO"

train(agent_states, agent_next_states, expert_states, expert_next_states, **kwargs)

Train GAIfO

Parameters
  • agent_states

  • agent_acts

  • expert_states

  • expected_acts

inference(states, actions, next_states)

Infer Reward with GAIfO

Parameters
  • states

  • actions

  • next_states

Returns

Reward

Return type

tf.Tensor

tf2rl.algos.gail module

class tf2rl.algos.gail.Discriminator(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(state_shape, action_dim, units=(32, 32), enable_sn=False, output_activation='sigmoid', name='Discriminator')
call(inputs)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters
  • inputs – A tensor or list of tensors.

  • training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.

  • mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

compute_reward(inputs)
class tf2rl.algos.gail.GAIL(*args, **kwargs)

Bases: tf2rl.algos.policy_base.IRLPolicy

Generative Adversarial Imitation Learning (GAIL) Agent: https://arxiv.org/abs/1606.03476

Command Line Args:

  • --n-warmup (int): Number of warmup steps before training. The default is 1e4.

  • --batch-size (int): Batch size of training. The default is 32.

  • --gpu (int): GPU id. -1 disables GPU. The default is 0.

  • --memory-capacity (int): Replay Buffer size. The default is 1e4.

  • --enable-sn: Enable Spectral Normalization

__init__(state_shape, action_dim, units=[32, 32], lr=0.001, enable_sn=False, name='GAIL', **kwargs)

Initialize GAIL

Parameters
  • state_shape (iterable of int) –

  • action_dim (int) –

  • units (iterable of int) – The default is [32, 32]

  • lr (float) – Learning rate. The default is 0.001

  • enable_sn (bool) – Whether enable Spectral Normalization. The defailt is False

  • name (str) – The default is "GAIL"

train(agent_states, agent_acts, expert_states, expert_acts, **kwargs)

Train GAIL

Parameters
  • agent_states

  • agent_acts

  • expert_states

  • expected_acts

inference(states, actions, next_states)

Infer Reward with GAIL

Parameters
  • states

  • actions

  • next_states

Returns

Reward

Return type

tf.Tensor

static get_argument(parser=None)

tf2rl.algos.policy_base module

class tf2rl.algos.policy_base.Policy(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(name, memory_capacity, update_interval=1, batch_size=256, discount=0.99, n_warmup=0, max_grad=10.0, n_epoch=1, gpu=0)
get_action(observation, test=False)
static get_argument(parser=None)
class tf2rl.algos.policy_base.OnPolicyAgent(*args, **kwargs)

Bases: tf2rl.algos.policy_base.Policy

Base class for on-policy agents

__init__(horizon=2048, lam=0.95, enable_gae=True, normalize_adv=True, entropy_coef=0.01, vfunc_coef=1.0, **kwargs)
static get_argument(parser=None)
class tf2rl.algos.policy_base.OffPolicyAgent(*args, **kwargs)

Bases: tf2rl.algos.policy_base.Policy

Base class for off-policy agents

__init__(memory_capacity, **kwargs)
static get_argument(parser=None)
class tf2rl.algos.policy_base.IRLPolicy(*args, **kwargs)

Bases: tf2rl.algos.policy_base.Policy

__init__(n_training=1, memory_capacity=0, **kwargs)

tf2rl.algos.ppo module

class tf2rl.algos.ppo.PPO(*args, **kwargs)

Bases: tf2rl.algos.vpg.VPG

Proximal Policy Optimization (PPO) Agent: https://arxiv.org/abs/1707.06347

Command Line Args:

  • --batch-size (int): Batch size of training. The default is 32.

  • --gpu (int): GPU id. -1 disables GPU. The default is 0.

  • --horizon (int): The default is 2048.

  • --normalize_adv: Normalize Advantage.

  • --enable-gae: Enable GAE.

__init__(clip=True, clip_ratio=0.2, name='PPO', **kwargs)

Initialize PPO

Parameters
  • clip (bool) – Whether clip or not. The default is True.

  • clip_ratio (float) – Probability ratio is clipped between 1-clip_ratio and 1+clip_ratio.

  • name (str) – Name of agent. The default is "PPO".

  • state_shape (iterable of int) –

  • action_dim (int) –

  • is_discrete (bool) –

  • actor

  • critic

  • actor_critic

  • max_action (float) – maximum action size.

  • actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is (256, 256).

  • critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is (256, 256).

  • lr_actor (float) – Learning rate of actor. The default is 1e-3.

  • lr_critic (float) – Learning rate of critic. The default is 3e-3.

  • hidden_activation_actor (str) – Activation for actor. The default is "relu".

  • hidden_activation_critic (str) – Activation for critic. The default is "relu".

  • horizon (int) – Number of steps of online episode horizon. The horizon must be multiple of batch_size. The default is 2048.

  • enable_gae (bool) – Enable GAE. The default is True.

  • normalize_adv (bool) – Normalize Advantage. The default is True.

  • entropy_coef (float) – Entropy coefficient. The default is 0.01.

  • vfunc_coef (float) – Mixing ratio factor for actor and critic. actor_loss + vfunc_coef*critic_loss

  • batch_size (int) – Batch size. The default is 256.

train(states, actions, advantages, logp_olds, returns)

Train PPO

Parameters
  • states

  • actions

  • advantages

  • logp_olds

  • returns

tf2rl.algos.sac module

class tf2rl.algos.sac.CriticQ(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(state_shape, action_dim, critic_units=(256, 256), name='qf')
call(states, actions)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters
  • inputs – A tensor or list of tensors.

  • training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.

  • mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class tf2rl.algos.sac.SAC(*args, **kwargs)

Bases: tf2rl.algos.policy_base.OffPolicyAgent

Soft Actor-Critic (SAC) Agent: https://arxiv.org/abs/1801.01290

Command Line Args:

  • --n-warmup (int): Number of warmup steps before training. The default is 1e4.

  • --batch-size (int): Batch size of training. The default is 32.

  • --gpu (int): GPU id. -1 disables GPU. The default is 0.

  • --memory-capacity (int): Replay Buffer size. The default is 1e6.

  • --alpha (float): Temperature parameter. The default is 0.2.

  • --auto-alpha: Automatic alpha tuning.

__init__(state_shape, action_dim, name='SAC', max_action=1.0, lr=0.0003, lr_alpha=0.0003, actor_units=(256, 256), critic_units=(256, 256), tau=0.005, alpha=0.2, auto_alpha=False, init_temperature=None, n_warmup=10000, memory_capacity=1000000, **kwargs)

Initialize SAC

Parameters
  • state_shape (iterable of int) –

  • action_dim (int) –

  • name (str) – Name of network. The default is "SAC"

  • max_action (float) –

  • lr (float) – Learning rate. The default is 3e-4.

  • lr_alpha (alpha) – Learning rate for alpha. The default is 3e-4.

  • actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is (256, 256).

  • critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is (256, 256).

  • tau (float) – Target network update rate.

  • alpha (float) – Temperature parameter. The default is 0.2.

  • auto_alpha (bool) – Automatic alpha tuning.

  • init_temperature (float) – Initial temperature

  • n_warmup (int) – Number of warmup steps before training. The default is int(1e4).

  • memory_capacity (int) – Replay Buffer size. The default is int(1e6).

  • batch_size (int) – Batch size. The default is 256.

  • discount (float) – Discount factor. The default is 0.99.

  • max_grad (float) – Maximum gradient. The default is 10.

  • gpu (int) – GPU id. -1 disables GPU. The default is 0.

get_action(state, test=False)

Get action

Parameters
  • state – Observation state

  • test (bool) – When False (default), policy returns exploratory action.

Returns

Selected action

Return type

tf.Tensor or float

train(states, actions, next_states, rewards, dones, weights=None)

Train SAC

Parameters
  • states

  • actions

  • next_states

  • rewards

  • done

  • weights (optional) – Weights for importance sampling

compute_td_error(states, actions, next_states, rewards, dones)

Compute TD error

Parameters
  • states

  • actions

  • next_states

  • rewars

  • dones

Returns

np.ndarray: TD error

static get_argument(parser=None)

Create or update argument parser for command line program

Parameters

parser (argparse.ArgParser, optional) – argument parser

Returns

argument parser

Return type

argparse.ArgParser

tf2rl.algos.sac_ae module

class tf2rl.algos.sac_ae.SACAE(*args, **kwargs)

Bases: tf2rl.algos.sac.SAC

SAC+AE Agent: https://arxiv.org/abs/1910.01741

Command Line Args:

  • --n-warmup (int): Number of warmup steps before training. The default is 1e4.

  • --batch-size (int): Batch size of training. The default is 32.

  • --gpu (int): GPU id. -1 disables GPU. The default is 0.

  • --memory-capacity (int): Replay Buffer size. The default is 1e5.

  • --alpha (float): Temperature parameter. The default is 0.2.

  • --auto-alpha: Automatic alpha tuning.

  • --stop-q-grad: Whether stop gradient after convolutional layers at Encoder

__init__(action_dim, obs_shape=(84, 84, 9), n_conv_layers=4, n_conv_filters=32, feature_dim=50, tau_encoder=0.05, tau_critic=0.01, auto_alpha=True, lr_sac=0.001, lr_encoder=0.001, lr_decoder=0.001, update_critic_target_freq=2, update_actor_freq=2, lr_alpha=0.0001, init_temperature=0.1, stop_q_grad=False, lambda_latent_val=1e-06, decoder_weight_lambda=1e-07, skip_making_decoder=False, name='SACAE', **kwargs)

Initialize SAC+AE

Parameters
  • action_dim (int) –

  • obs_shape – (iterable of int): The default is (84, 84, 9)

  • n_conv_layers (int) – Number of convolutional layers at encoder. The default is 4

  • n_conv_filters (int) – Number of filters in convolutional layers. The default is 32

  • feature_dim (int) – Number of features after encoder. This features are treated as SAC input. The default is 50

  • tau_encoder (float) – Target network update rate for Encoder. The default is 0.05

  • tau_critic (float) – Target network update rate for Critic. The default is 0.01

  • auto_alpha (bool) – Automatic alpha tuning. The default is True

  • lr_sac (float) – Learning rate for SAC. The default is 1e-3

  • lr_encoder (float) – Learning rate for Encoder. The default is 1e-3

  • lr_decoder (float) – Learning rate for Decoder. The default is 1e-3

  • update_critic_target_freq (int) – The default is 2

  • update_actor_freq (int) – The default is 2

  • lr_alpha (alpha) – Learning rate for alpha. The default is 1e-4.

  • init_temperature (float) – Initial temperature. The default is 0.1

  • stop_q_grad (bool) – Whether sotp gradient propagation after encoder convolutional network. The default is False

  • lambda_latent_val (float) – AE loss = REC loss + lambda_latent_val * latent loss. The default is 1e-6

  • decoder_weight_lambda (float) – Weight decay of AdamW for Decoder. The default is 1e-7

  • skip_making_decoder (bool) – Whther skip making Decoder. The default is False

  • name (str) – Name of network. The default is "SACAE"

  • max_action (float) –

  • actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is (256, 256).

  • critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is (256, 256).

  • alpha (float) – Temperature parameter. The default is 0.2.

  • n_warmup (int) – Number of warmup steps before training. The default is int(1e4).

  • memory_capacity (int) – Replay Buffer size. The default is int(1e6).

  • batch_size (int) – Batch size. The default is 256.

  • discount (float) – Discount factor. The default is 0.99.

  • max_grad (float) – Maximum gradient. The default is 10.

  • gpu (int) – GPU id. -1 disables GPU. The default is 0.

get_action(state, test=False)

Get action

Parameters
  • state – Observation state

  • test (bool) – When False (default), policy returns exploratory action.

Returns

Selected action

Return type

tf.Tensor or float

Notes

When the input image have different size, cropped image is used

train(states, actions, next_states, rewards, dones, weights=None)

Train SAC+AE

Parameters
  • states

  • actions

  • next_states

  • rewards

  • done

  • weights (optional) – Weights for importance sampling

static get_argument(parser=None)

Create or update argument parser for command line program

Parameters

parser (argparse.ArgParser, optional) – argument parser

Returns

argument parser

Return type

argparse.ArgParser

compute_td_error(states, actions, next_states, rewards, dones)

Compute TD error

Parameters
  • states

  • actions

  • next_states

  • rewars

  • dones

Returns

np.ndarray: TD error

tf2rl.algos.sac_discrete module

class tf2rl.algos.sac_discrete.CriticQ(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

Compared with original (continuous) version of SAC, the output of Q-function moves

from Q: S x A -> R to Q: S -> R^|A|

__init__(state_shape, action_dim, critic_units=(256, 256), name='qf')
call(states)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters
  • inputs – A tensor or list of tensors.

  • training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.

  • mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class tf2rl.algos.sac_discrete.SACDiscrete(*args, **kwargs)

Bases: tf2rl.algos.sac.SAC

__init__(state_shape, action_dim, *args, actor_fn=None, critic_fn=None, target_update_interval=None, **kwargs)

Initialize SAC

Parameters
  • state_shape (iterable of int) –

  • action_dim (int) –

  • name (str) – Name of network. The default is "SAC"

  • max_action (float) –

  • lr (float) – Learning rate. The default is 3e-4.

  • lr_alpha (alpha) – Learning rate for alpha. The default is 3e-4.

  • actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is (256, 256).

  • critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is (256, 256).

  • tau (float) – Target network update rate.

  • alpha (float) – Temperature parameter. The default is 0.2.

  • auto_alpha (bool) – Automatic alpha tuning.

  • init_temperature (float) – Initial temperature

  • n_warmup (int) – Number of warmup steps before training. The default is int(1e4).

  • memory_capacity (int) – Replay Buffer size. The default is int(1e6).

  • batch_size (int) – Batch size. The default is 256.

  • discount (float) – Discount factor. The default is 0.99.

  • max_grad (float) – Maximum gradient. The default is 10.

  • gpu (int) – GPU id. -1 disables GPU. The default is 0.

train(states, actions, next_states, rewards, dones, weights=None)

Train SAC

Parameters
  • states

  • actions

  • next_states

  • rewards

  • done

  • weights (optional) – Weights for importance sampling

compute_td_error(states, actions, next_states, rewards, dones)

Compute TD error

Parameters
  • states

  • actions

  • next_states

  • rewars

  • dones

Returns

np.ndarray: TD error

static get_argument(parser=None)

Create or update argument parser for command line program

Parameters

parser (argparse.ArgParser, optional) – argument parser

Returns

argument parser

Return type

argparse.ArgParser

tf2rl.algos.td3 module

class tf2rl.algos.td3.Critic(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(state_shape, action_dim, units=(400, 300), name='Critic')
call(states, actions)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters
  • inputs – A tensor or list of tensors.

  • training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.

  • mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class tf2rl.algos.td3.TD3(*args, **kwargs)

Bases: tf2rl.algos.ddpg.DDPG

Twin Delayed Deep Deterministic policy gradient (TD3) Agent: https://arxiv.org/abs/1802.09477

Command Line Args:

  • --n-warmup (int): Number of warmup steps before training. The default is 1e4.

  • --batch-size (int): Batch size for training. The default is 32.

  • --gpu (int): GPU id. -1 disables GPU. The default is 0.

  • --memory-capacity (int): Replay Buffer size. The default is 1e6.

__init__(state_shape, action_dim, name='TD3', actor_update_freq=2, policy_noise=0.2, noise_clip=0.5, critic_units=(400, 300), **kwargs)

Initialize TD3

Parameters
  • shate_shape (iterable of ints) – Observation state shape

  • action_dim (int) – Action dimension

  • name (str) – Network name. The default is "TD3".

  • actor_update_freq (int) – Number of critic updates per one actor upate.

  • policy_noise (float) –

  • noise_clip (float) –

  • critic_units (iterable of int) – Numbers of units at hidden layer of critic. The default is (400, 300)

  • max_action (float) – Size of maximum action. (-max_action <= action <= max_action). The degault is 1.

  • lr_actor (float) – Learning rate for actor network. The default is 0.001.

  • lr_critic (float) – Learning rage for critic network. The default is 0.001.

  • actor_units (iterable of int) – Number of units at hidden layers of actor.

  • sigma (float) – Standard deviation of Gaussian noise. The default is 0.1.

  • tau (float) – Weight update ratio for target network. target = (1-tau)*target + tau*network The default is 0.005.

  • n_warmup (int) – Number of warmup steps before training. The default is 1e4.

  • memory_capacity (int) – Replay Buffer size. The default is 1e4.

  • batch_size (int) – Batch size. The default is 256.

  • discount (float) – Discount factor. The default is 0.99.

  • max_grad (float) – Maximum gradient. The default is 10.

  • gpu (int) – GPU id. -1 disables GPU. The default is 0.

compute_td_error(states, actions, next_states, rewards, dones)

Compute TD error

Parameters
  • states

  • actions

  • next_states

  • rewars

  • dones

Returns

Sum of two TD errors.

Return type

np.ndarray

tf2rl.algos.vail module

class tf2rl.algos.vail.Discriminator(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

LOG_SIG_CAP_MAX = 2
LOG_SIG_CAP_MIN = -20
EPS = 1e-06
__init__(state_shape, action_dim, units=(32, 32), n_latent_unit=32, enable_sn=False, name='Discriminator')
call(inputs)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters
  • inputs – A tensor or list of tensors.

  • training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.

  • mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

compute_reward(inputs)
class tf2rl.algos.vail.VAIL(*args, **kwargs)

Bases: tf2rl.algos.gail.GAIL

Variational Adversarial Imitation Learning (VAIL) Agent: https://arxiv.org/abs/1810.00821

Command Line Args:

  • --n-warmup (int): Number of warmup steps before training. The default is 1e4.

  • --batch-size (int): Batch size of training. The default is 32.

  • --gpu (int): GPU id. -1 disables GPU. The default is 0.

  • --memory-capacity (int): Replay Buffer size. The default is 1e4.

  • --enable-sn: Enable Spectral Normalization

__init__(state_shape, action_dim, units=(32, 32), n_latent_unit=32, lr=5e-05, kl_target=0.5, reg_param=0.0, enable_sn=False, enable_gp=False, name='VAIL', **kwargs)

Initialize VAIL

Parameters
  • state_shape (iterable of int) –

  • action_dim (int) –

  • units (iterable of int) – The default is (32, 32)

  • lr (float) – Learning rate. The default is 5e-5

  • kl_target (float) – The default is 0.5

  • reg_param (float) – The default is 0

  • enable_sn (bool) – Whether enable Spectral Normalization. The defailt is False

  • enable_gp (bool) – Whether loss function includes gradient panalty

  • name (str) – The default is "VAIL"

train(agent_states, agent_acts, expert_states, expert_acts, **kwargs)

Train VAIL

Parameters
  • agent_states

  • agent_acts

  • expert_states

  • expected_acts

tf2rl.algos.vpg module

class tf2rl.algos.vpg.CriticV(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(state_shape, units, name='critic_v', hidden_activation='relu')
call(inputs)

Calls the model on new inputs.

In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).

Parameters
  • inputs – A tensor or list of tensors.

  • training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.

  • mask – A mask or list of masks. A mask can be either a tensor or None (no mask).

Returns

A tensor if there is a single output, or a list of tensors if there are more than one outputs.

class tf2rl.algos.vpg.VPG(*args, **kwargs)

Bases: tf2rl.algos.policy_base.OnPolicyAgent

VPG Agent: https://proceedings.neurips.cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf

Command Line Args:

  • --batch-size (int): Batch size of training. The default is 32.

  • --gpu (int): GPU id. -1 disables GPU. The default is 0.

  • --horizon (int): The default is 2048.

  • --normalize_adv: Normalize Advantage.

  • --enable-gae: Enable GAE.

__init__(state_shape, action_dim, is_discrete, actor=None, critic=None, actor_critic=None, max_action=1.0, actor_units=(256, 256), critic_units=(256, 256), lr_actor=0.001, lr_critic=0.003, hidden_activation_actor='relu', hidden_activation_critic='relu', name='VPG', **kwargs)

Initialize VPG

Parameters
  • state_shape (iterable of int) –

  • action_dim (int) –

  • is_discrete (bool) –

  • actor

  • critic

  • actor_critic

  • max_action (float) – maximum action size.

  • actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is (256, 256).

  • critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is (256, 256).

  • lr_actor (float) – Learning rate of actor. The default is 1e-3.

  • lr_critic (float) – Learning rate of critic. The default is 3e-3.

  • hidden_activation_actor (str) – Activation for actor. The default is "relu".

  • hidden_activation_critic (str) – Activation for critic. The default is "relu".

  • name (str) – Name of agent. The default is "VPG".

  • horizon (int) – Number of steps of online episode horizon. The horizon must be multiple of batch_size. The default is 2048.

  • enable_gae (bool) – Enable GAE. The default is True.

  • normalize_adv (bool) – Normalize Advantage. The default is True.

  • entropy_coef (float) – Entropy coefficient. The default is 0.01.

  • vfunc_coef (float) – Mixing ratio factor for actor and critic. actor_loss + vfunc_coef*critic_loss

  • batch_size (int) – Batch size. The default is 256.

get_action(state, test=False)

Get action and probability

Parameters
  • state – Observation state

  • test (bool) – When False (default), policy returns exploratory action.

Returns

Selected action np.ndarray or float: Log(p)

Return type

np.ndarray or float

get_action_and_val(state, test=False)

Get action, probability, and critic value

Parameters
  • state – Observation state

  • test (bool) – When False (default), policy returns exploratory action.

Returns

Selected action np.ndarray: Log(p) np.ndarray: Critic value

Return type

np.ndarray

train(states, actions, advantages, logp_olds, returns)

Train VPG

Parameters
  • states

  • actions

  • advantages

  • logp_olds

  • returns

Module contents