tf2rl.algos package
Submodules
tf2rl.algos.apex module
- tf2rl.algos.apex.import_tf()
- tf2rl.algos.apex.explorer(global_rb, queue, trained_steps, is_training_done, lock, env_fn, policy_fn, set_weights_fn, noise_level, n_env=64, n_thread=4, buffer_size=1024, episode_max_steps=1000, gpu=0)
Collect transitions and store them to prioritized replay buffer.
- Parameters
global_rb – multiprocessing.managers.AutoProxy[PrioritizedReplayBuffer] Prioritized replay buffer sharing with multiple explorers and only one learner. This object is shared over processes, so it must be locked when trying to operate something with lock object.
queue – multiprocessing.Queue A FIFO shared with the learner and evaluator to get the latest network weights. This is process safe, so you don’t need to lock process when use this.
trained_steps – multiprocessing.Value Number of steps to apply gradients.
is_training_done – multiprocessing.Event multiprocessing.Event object to share the status of training.
lock – multiprocessing.Lock Lock other processes.
env_fn – function Method object to generate an environment.
policy_fn – function Method object to generate an explorer.
set_weights_fn – function Method object to set network weights gotten from queue.
noise_level – float Noise level for exploration. For epsilon-greedy policy like DQN variants, this will be epsilon, and if DDPG variants this will be variance for Normal distribution.
n_env – int Number of environments to distribute. If this is set to be more than 1, MultiThreadEnv will be used.
n_thread – int Number of thread used in MultiThreadEnv.
buffer_size – int Size of local buffer. If this is filled with transitions, add them to global_rb
episode_max_steps – int Maximum number of steps of an episode.
gpu – int GPU id. If this is set to -1, then this process uses only CPU.
- Returns
None
- tf2rl.algos.apex.learner(global_rb, trained_steps, is_training_done, lock, env, policy_fn, get_weights_fn, n_training, update_freq, evaluation_freq, gpu, queues)
Update network weights using samples collected by explorers.
- Parameters
global_rb – multiprocessing.managers.AutoProxy[PrioritizedReplayBuffer] Prioritized replay buffer sharing with multiple explorers and only one learner. This object is shared over processes, so it must be locked when trying to operate something with lock object.
trained_steps – multiprocessing.Value Number of steps to apply gradients.
is_training_done – multiprocessing.Event multiprocessing.Event object to share the status of training.
lock – multiprocessing.Lock multiprocessing.Lock to lock other processes.
env – OpenAI-gym compatible environment object
policy_fn – function Method object to generate an explorer.
get_weights_fn – function Method object to get network weights and put them to queue.
n_training – int Maximum number of times to apply gradients. If number of applying gradients is over this value, training will be done by setting is_training_done to True
update_freq – int Frequency to update parameters, i.e., put network parameters to queues
evaluation_freq – int Frequency to call evaluator.
gpu – int GPU id. If this is set to -1, then this process uses only CPU.
queues – List List of Queues shared with explorers to send latest network parameters.
- Returns
None
- tf2rl.algos.apex.evaluator(is_training_done, env, policy_fn, set_weights_fn, queue, gpu, save_model_interval=1000000, n_evaluation=10, episode_max_steps=1000, show_test_progress=False)
Evaluate trained network weights periodically.
- Parameters
is_training_done – multiprocessing.Event multiprocessing.Event object to share the status of training.
env – Open-AI gym compatible environment Environment object.
policy_fn – function Method object to generate an explorer.
set_weights_fn – function Method object to set network weights gotten from queue.
queue – multiprocessing.Queue A FIFO shared with the learner to get the latest network weights. This is process safe, so you don’t need to lock process when use this.
gpu – int GPU id. If this is set to -1, then this process uses only CPU.
save_model_interval – int Interval to save model.
n_evaluation – int Number of episodes to evaluate.
episode_max_steps – int Maximum number of steps of an episode.
show_test_progress – bool If true, render will be called to visualize evaluation process.
- tf2rl.algos.apex.apex_argument(parser=None)
- tf2rl.algos.apex.prepare_experiment(env, args)
- tf2rl.algos.apex.run(args, env_fn, policy_fn, get_weights_fn, set_weights_fn)
tf2rl.algos.bi_res_ddpg module
- class tf2rl.algos.bi_res_ddpg.BiResDDPG(*args, **kwargs)
Bases:
tf2rl.algos.ddpg.DDPG
Bi-Res-DDPG Agent: https://arxiv.org/abs/1905.01072
Command Line Args:
--n-warmup
(int): Number of warmup steps before training. The default is1e4
.--batch-size
(int): Batch size for training. The default is32
.--gpu
(int): GPU id.-1
disables GPU. The default is0
.--memory-capacity
(int): Replay Buffer size. The default is1e6
.--eta
(float): Gradient mixing factor. The default is0.05
.
- __init__(eta=0.05, name='BiResDDPG', **kwargs)
Initialize BiResDDPG agent
- Parameters
eta (float) – Gradients mixing factor.
name (str) – Name of agent. The default is
"BiResDDPG"
.state_shape (iterable of int) –
action_dim (int) –
max_action (float) – Size of maximum action. (
-max_action
<= action <=max_action
). The degault is1
.lr_actor (float) – Learning rate for actor network. The default is
0.001
.lr_critic (float) – Learning rage for critic network. The default is
0.001
.actor_units (iterable of int) – Number of units at hidden layers of actor.
critic_units (iterable of int) – Number of units at hidden layers of critic.
sigma (float) – Standard deviation of Gaussian noise. The default is
0.1
.tau (float) – Weight update ratio for target network.
target = (1-tau)*target + tau*network
The default is0.005
.n_warmup (int) – Number of warmup steps before training. The default is
1e4
.memory_capacity (int) – Replay Buffer size. The default is
1e4
.batch_size (int) – Batch size. The default is
256
.discount (float) – Discount factor. The default is
0.99
.max_grad (float) – Maximum gradient. The default is
10
.gpu (int) – GPU id.
-1
disables GPU. The default is0
.
- compute_td_error(states, actions, next_states, rewards, dones)
Compute TD error
- Parameters
states –
actions –
next_states –
rewars –
dones –
- Returns
Sum of two TD errors.
- Return type
np.ndarray
- static get_argument(parser=None)
Create or update argument parser for command line program
- Parameters
parser (argparse.ArgParser, optional) – argument parser
- Returns
argument parser
- Return type
argparse.ArgParser
tf2rl.algos.categorical_dqn module
- class tf2rl.algos.categorical_dqn.QFunc(*args, **kwargs)
Bases:
tensorflow.python.keras.engine.training.Model
- __init__(state_shape, action_dim, units=(32, 32), name='CategoricalQFunc', enable_dueling_dqn=False, enable_noisy_dqn=False, n_atoms=51)
- call(inputs)
Calls the model on new inputs.
In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).
- Parameters
inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).
- Returns
A tensor if there is a single output, or a list of tensors if there are more than one outputs.
- property n_atoms
- class tf2rl.algos.categorical_dqn.CategoricalDQN(*args, **kwargs)
Bases:
tf2rl.algos.policy_base.OffPolicyAgent
Categorical DQN Agent: https://arxiv.org/abs/1707.06887
Categorical DQN supports following algorithms;
Dueling Network: https://arxiv.org/abs/1511.06581
Noisy Network: https://arxiv.org/abs/1706.10295
Command Line Args:
--n-warmup
(int): Number of warmup steps before training. The default is1e4
.--batch-size
(int): Batch size of training. The default is32
.--gpu
(int): GPU id.-1
disables GPU. The default is0
.--memory-capacity
(int): Replay Buffer size. The default is1e6
.--enable-double-dqn
: Enable DDQN--enable-dueling-dqn
: Enable Dueling Network--enable-noisy-dqn
: Enable Noisy Network
- __init__(state_shape, action_dim, q_func=None, name='DQN', lr=0.001, adam_eps=1e-07, units=(32, 32), epsilon=0.1, epsilon_min=None, epsilon_decay_step=1000000, n_warmup=10000, target_replace_interval=5000, memory_capacity=1000000, enable_double_dqn=False, enable_dueling_dqn=False, enable_noisy_dqn=False, **kwargs)
Initialize Categorical DQN
- Parameters
state_shape (iterable of int) – Observation space shape
action_dim (int) – Dimension of discrete action
q_function (QFunc) – Custom Q function class. If
None
(default), Q function is constructed withQFunc
.name (str) – Name of agent. The default is
"DQN"
lr (float) – Learning rate. The default is
0.001
.adam_eps (float) – Epsilon for Adam. The default is
1e-7
units (iterable of int) – Units of hidden layers. The default is
(32, 32)
espilon (float) – Initial epsilon of e-greedy. The default is
0.1
epsilon_min (float) – Minimum epsilon of after decayed.
epsilon_decay_step (int) – Number of steps decaying. The default is
1e6
n_warmup (int) – Number of warmup steps befor training. The default is
1e4
target_replace_interval (int) – Number of steps between target network update. The default is
5e3
memory_capacity (int) – Size of replay buffer. The default is
1e6
enable_double_dqn (bool) – Whether use Double DQN. The default is
False
enable_dueling_dqn (bool) – Whether use Dueling network. The default is
False
enable_noisy_dqn (bool) – Whether use noisy network. The default is
False
optimizer (tf.keras.optimizers.Optimizer) – Custom optimizer
batch_size (int) – Batch size. The default is
256
.discount (float) – Discount factor. The default is
0.99
.max_grad (float) – Maximum gradient. The default is
10
.gpu (int) – GPU id.
-1
disables GPU. The default is0
.
- get_action(state, test=False, tensor=False)
Get action
- Parameters
state – Observation state
test (bool) – When
False
(default), policy returns exploratory action.tensor (bool) – When
True
, return type istf.Tensor
- Returns
Selected action
- Return type
tf.Tensor or np.ndarray or float
- train(states, actions, next_states, rewards, done, weights=None)
Train DQN
- Parameters
states –
actions –
next_states –
rewards –
done –
weights (optional) – Weights for importance sampling
- compute_td_error(states, actions, next_states, rewards, dones)
Compute TD error
- Parameters
states –
actions –
next_states –
rewars –
dones –
- Returns
tf.Tensor: TD error
- static get_argument(parser=None)
Create or update argument parser for command line program
- Parameters
parser (argparse.ArgParser, optional) – argument parser
- Returns
argument parser
- Return type
argparse.ArgParser
tf2rl.algos.curl_sac module
- class tf2rl.algos.curl_sac.CURL(*args, **kwargs)
Bases:
tf2rl.algos.sac_ae.SACAE
Contrastive Unsuper Representations for Reinforcement Learning (CURL) Agent: https://arxiv.org/abs/2004.04136
Command Line Args:
--n-warmup
(int): Number of warmup steps before training. The default is1e4
.--batch-size
(int): Batch size of training. The default is32
.--gpu
(int): GPU id.-1
disables GPU. The default is0
.--memory-capacity
(int): Replay Buffer size. The default is1e5
.--alpha
(float): Temperature parameter. The default is0.2
.--auto-alpha
: Automatic alpha tuning.--stop-q-grad
: Whether stop gradient after convolutional layers at Encoder
- __init__(*args, **kwargs)
Initialize CURL
- Parameters
action_dim (int) –
obs_shape – (iterable of int): The default is
(84, 84, 9)
n_conv_layers (int) – Number of convolutional layers at encoder. The default is
4
n_conv_filters (int) – Number of filters in convolutional layers. The default is
32
feature_dim (int) – Number of features after encoder. This features are treated as SAC input. The default is
50
tau_encoder (float) – Target network update rate for Encoder. The default is
0.05
tau_critic (float) – Target network update rate for Critic. The default is
0.01
auto_alpha (bool) – Automatic alpha tuning. The default is
True
lr_sac (float) – Learning rate for SAC. The default is
1e-3
lr_encoder (float) – Learning rate for Encoder. The default is
1e-3
lr_decoder (float) – Learning rate for Decoder. The default is
1e-3
update_critic_target_freq (int) – The default is
2
update_actor_freq (int) – The default is
2
lr_alpha (alpha) – Learning rate for alpha. The default is
1e-4
.init_temperature (float) – Initial temperature. The default is
0.1
stop_q_grad (bool) – Whether sotp gradient propagation after encoder convolutional network. The default is
False
lambda_latent_val (float) – AE loss = REC loss +
lambda_latent_val
* latent loss. The default is1e-6
decoder_weight_lambda (float) – Weight decay of AdamW for Decoder. The default is
1e-7
name (str) – Name of network. The default is
"CURL"
max_action (float) –
actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is
(256, 256)
.critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is
(256, 256)
.alpha (float) – Temperature parameter. The default is
0.2
.n_warmup (int) – Number of warmup steps before training. The default is
int(1e4)
.memory_capacity (int) – Replay Buffer size. The default is
int(1e6)
.batch_size (int) – Batch size. The default is
256
.discount (float) – Discount factor. The default is
0.99
.max_grad (float) – Maximum gradient. The default is
10
.gpu (int) – GPU id.
-1
disables GPU. The default is0
.
- train(states, actions, next_states, rewards, dones, weights=None)
Train CURL
- Parameters
states –
actions –
next_states –
rewards –
done –
weights (optional) – Weights for importance sampling
tf2rl.algos.d2rl_sac module
- class tf2rl.algos.d2rl_sac.DenseCriticQ(*args, **kwargs)
Bases:
tf2rl.algos.sac.CriticQ
- call(states, actions)
Calls the model on new inputs.
In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).
- Parameters
inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).
- Returns
A tensor if there is a single output, or a list of tensors if there are more than one outputs.
- class tf2rl.algos.d2rl_sac.DenseGaussianActor(*args, **kwargs)
- class tf2rl.algos.d2rl_sac.D2RLSAC(*args, **kwargs)
Bases:
tf2rl.algos.sac.SAC
- __init__(*args, **kwargs)
Initialize SAC
- Parameters
state_shape (iterable of int) –
action_dim (int) –
name (str) – Name of network. The default is
"SAC"
max_action (float) –
lr (float) – Learning rate. The default is
3e-4
.lr_alpha (alpha) – Learning rate for alpha. The default is
3e-4
.actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is
(256, 256)
.critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is
(256, 256)
.tau (float) – Target network update rate.
alpha (float) – Temperature parameter. The default is
0.2
.auto_alpha (bool) – Automatic alpha tuning.
init_temperature (float) – Initial temperature
n_warmup (int) – Number of warmup steps before training. The default is
int(1e4)
.memory_capacity (int) – Replay Buffer size. The default is
int(1e6)
.batch_size (int) – Batch size. The default is
256
.discount (float) – Discount factor. The default is
0.99
.max_grad (float) – Maximum gradient. The default is
10
.gpu (int) – GPU id.
-1
disables GPU. The default is0
.
tf2rl.algos.ddpg module
- class tf2rl.algos.ddpg.Actor(*args, **kwargs)
Bases:
tensorflow.python.keras.engine.training.Model
- __init__(state_shape, action_dim, max_action, units=(400, 300), name='Actor')
- call(inputs)
Calls the model on new inputs.
In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).
- Parameters
inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).
- Returns
A tensor if there is a single output, or a list of tensors if there are more than one outputs.
- class tf2rl.algos.ddpg.Critic(*args, **kwargs)
Bases:
tensorflow.python.keras.engine.training.Model
- __init__(state_shape, action_dim, units=(400, 300), name='Critic')
- call(states, actions)
Calls the model on new inputs.
In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).
- Parameters
inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).
- Returns
A tensor if there is a single output, or a list of tensors if there are more than one outputs.
- class tf2rl.algos.ddpg.DDPG(*args, **kwargs)
Bases:
tf2rl.algos.policy_base.OffPolicyAgent
DDPG agent: https://arxiv.org/abs/1509.02971
Command Line Args:
--n-warmup
(int): Number of warmup steps before training. The default is1e4
.--batch-size
(int): Batch size for training. The default is32
.--gpu
(int): GPU id.-1
disables GPU. The default is0
.--memory-capacity
(int): Replay Buffer size. The default is1e6
.
- __init__(state_shape, action_dim, name='DDPG', max_action=1.0, lr_actor=0.001, lr_critic=0.001, actor_units=(400, 300), critic_units=(400, 300), sigma=0.1, tau=0.005, n_warmup=10000, memory_capacity=1000000, **kwargs)
Initialize DDPG agent
- Parameters
state_shape (iterable of int) –
action_dim (int) –
name (str) – Name of agent. The default is
"DDPG"
.max_action (float) – Size of maximum action. (
-max_action
<= action <=max_action
). The degault is1
.lr_actor (float) – Learning rate for actor network. The default is
0.001
.lr_critic (float) – Learning rage for critic network. The default is
0.001
.actor_units (iterable of int) – Number of units at hidden layers of actor.
critic_units (iterable of int) – Number of units at hidden layers of critic.
sigma (float) – Standard deviation of Gaussian noise. The default is
0.1
.tau (float) – Weight update ratio for target network.
target = (1-tau)*target + tau*network
The default is0.005
.n_warmup (int) – Number of warmup steps before training. The default is
1e4
.memory_capacity (int) – Replay Buffer size. The default is
1e4
.batch_size (int) – Batch size. The default is
256
.discount (float) – Discount factor. The default is
0.99
.max_grad (float) – Maximum gradient. The default is
10
.gpu (int) – GPU id.
-1
disables GPU. The default is0
.
- get_action(state, test=False, tensor=False)
Get action
- Parameters
state – Observation state
test (bool) – When
False
(default), policy returns exploratory action.tensor (bool) – When
True
, return type istf.Tensor
- Returns
Selected action
- Return type
tf.Tensor or np.ndarray or float
- train(states, actions, next_states, rewards, done, weights=None)
Train DDPG
- Parameters
states –
actions –
next_states –
rewards –
done –
weights (optional) – Weights for importance sampling
- compute_td_error(states, actions, next_states, rewards, dones)
Compute TD error
- Parameters
states –
actions –
next_states –
rewars –
dones –
- Returns
tf.Tensor: TD error
tf2rl.algos.dqn module
- class tf2rl.algos.dqn.QFunc(*args, **kwargs)
Bases:
tensorflow.python.keras.engine.training.Model
- __init__(state_shape, action_dim, units=(32, 32), name='QFunc', enable_dueling_dqn=False, enable_noisy_dqn=False)
- call(inputs)
Calls the model on new inputs.
In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).
- Parameters
inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).
- Returns
A tensor if there is a single output, or a list of tensors if there are more than one outputs.
- class tf2rl.algos.dqn.DQN(*args, **kwargs)
Bases:
tf2rl.algos.policy_base.OffPolicyAgent
DQN Agent: https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf
DQN supports following algorithms;
Dueling Network: https://arxiv.org/abs/1511.06581
Noisy Network: https://arxiv.org/abs/1706.10295
Command Line Args:
--n-warmup
(int): Number of warmup steps before training. The default is1e4
.--batch-size
(int): Batch size of training. The default is32
.--gpu
(int): GPU id.-1
disables GPU. The default is0
.--memory-capacity
(int): Replay Buffer size. The default is1e6
.--enable-double-dqn
: Enable DDQN--enable-dueling-dqn
: Enable Dueling Network--enable-noisy-dqn
: Enable Noisy Network
- __init__(state_shape, action_dim, q_func=None, name='DQN', lr=0.001, adam_eps=1e-07, units=(32, 32), epsilon=0.1, epsilon_min=None, epsilon_decay_step=1000000, n_warmup=10000, target_replace_interval=5000, memory_capacity=1000000, enable_double_dqn=False, enable_dueling_dqn=False, enable_noisy_dqn=False, optimizer=None, **kwargs)
Initialize DQN agent
- Parameters
state_shape (iterable of int) – Observation space shape
action_dim (int) – Dimension of discrete action
q_function (QFunc) – Custom Q function class. If
None
(default), Q function is constructed withQFunc
.name (str) – Name of agent. The default is
"DQN"
lr (float) – Learning rate. The default is
0.001
.adam_eps (float) – Epsilon for Adam. The default is
1e-7
units (iterable of int) – Units of hidden layers. The default is
(32, 32)
espilon (float) – Initial epsilon of e-greedy. The default is
0.1
epsilon_min (float) – Minimum epsilon of after decayed.
epsilon_decay_step (int) – Number of steps decaying. The default is
1e6
n_warmup (int) – Number of warmup steps befor training. The default is
1e4
target_replace_interval (int) – Number of steps between target network update. The default is
5e3
memory_capacity (int) – Size of replay buffer. The default is
1e6
enable_double_dqn (bool) – Whether use Double DQN. The default is
False
enable_dueling_dqn (bool) – Whether use Dueling network. The default is
False
enable_noisy_dqn (bool) – Whether use noisy network. The default is
False
optimizer (tf.keras.optimizers.Optimizer) – Custom optimizer
batch_size (int) – Batch size. The default is
256
.discount (float) – Discount factor. The default is
0.99
.max_grad (float) – Maximum gradient. The default is
10
.gpu (int) – GPU id.
-1
disables GPU. The default is0
.
- get_action(state, test=False, tensor=False)
Get action
- Parameters
state – Observation state
test (bool) – When
False
(default), policy returns exploratory action.tensor (bool) – When
True
, return type istf.Tensor
- Returns
Selected action
- Return type
tf.Tensor or np.ndarray or float
- train(states, actions, next_states, rewards, done, weights=None)
Train DQN
- Parameters
states –
actions –
next_states –
rewards –
done –
weights (optional) – Weights for importance sampling
- compute_td_error(states, actions, next_states, rewards, dones)
Compute TD error
- Parameters
states –
actions –
next_states –
rewars –
dones –
- Returns
tf.Tensor: TD error
- static get_argument(parser=None)
Create or update argument parser for command line program
- Parameters
parser (argparse.ArgParser, optional) – argument parser
- Returns
argument parser
- Return type
argparse.ArgParser
tf2rl.algos.gaifo module
- class tf2rl.algos.gaifo.Discriminator(*args, **kwargs)
Bases:
tf2rl.algos.gail.Discriminator
- __init__(state_shape, units=(32, 32), enable_sn=False, output_activation='sigmoid', name='Discriminator')
- class tf2rl.algos.gaifo.GAIfO(*args, **kwargs)
Bases:
tf2rl.algos.gail.GAIL
Generative Adversarial Imitation from Observation (GAIfO) Agent: https://arxiv.org/abs/1807.06158
Command Line Args:
--n-warmup
(int): Number of warmup steps before training. The default is1e4
.--batch-size
(int): Batch size of training. The default is32
.--gpu
(int): GPU id.-1
disables GPU. The default is0
.--memory-capacity
(int): Replay Buffer size. The default is1e4
.--enable-sn
: Enable Spectral Normalization
- __init__(state_shape, units=(32, 32), lr=0.001, enable_sn=False, name='GAIfO', **kwargs)
Initialize GAIfO
- Parameters
state_shape (iterable of int) –
action_dim (int) –
units (iterable of int) – The default is
(32, 32)
lr (float) – Learning rate. The default is
0.001
enable_sn (bool) – Whether enable Spectral Normalization. The defailt is
False
name (str) – The default is
"GAIfO"
- train(agent_states, agent_next_states, expert_states, expert_next_states, **kwargs)
Train GAIfO
- Parameters
agent_states –
agent_acts –
expert_states –
expected_acts –
- inference(states, actions, next_states)
Infer Reward with GAIfO
- Parameters
states –
actions –
next_states –
- Returns
Reward
- Return type
tf.Tensor
tf2rl.algos.gail module
- class tf2rl.algos.gail.Discriminator(*args, **kwargs)
Bases:
tensorflow.python.keras.engine.training.Model
- __init__(state_shape, action_dim, units=(32, 32), enable_sn=False, output_activation='sigmoid', name='Discriminator')
- call(inputs)
Calls the model on new inputs.
In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).
- Parameters
inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).
- Returns
A tensor if there is a single output, or a list of tensors if there are more than one outputs.
- compute_reward(inputs)
- class tf2rl.algos.gail.GAIL(*args, **kwargs)
Bases:
tf2rl.algos.policy_base.IRLPolicy
Generative Adversarial Imitation Learning (GAIL) Agent: https://arxiv.org/abs/1606.03476
Command Line Args:
--n-warmup
(int): Number of warmup steps before training. The default is1e4
.--batch-size
(int): Batch size of training. The default is32
.--gpu
(int): GPU id.-1
disables GPU. The default is0
.--memory-capacity
(int): Replay Buffer size. The default is1e4
.--enable-sn
: Enable Spectral Normalization
- __init__(state_shape, action_dim, units=[32, 32], lr=0.001, enable_sn=False, name='GAIL', **kwargs)
Initialize GAIL
- Parameters
state_shape (iterable of int) –
action_dim (int) –
units (iterable of int) – The default is
[32, 32]
lr (float) – Learning rate. The default is
0.001
enable_sn (bool) – Whether enable Spectral Normalization. The defailt is
False
name (str) – The default is
"GAIL"
- train(agent_states, agent_acts, expert_states, expert_acts, **kwargs)
Train GAIL
- Parameters
agent_states –
agent_acts –
expert_states –
expected_acts –
- inference(states, actions, next_states)
Infer Reward with GAIL
- Parameters
states –
actions –
next_states –
- Returns
Reward
- Return type
tf.Tensor
- static get_argument(parser=None)
tf2rl.algos.policy_base module
- class tf2rl.algos.policy_base.Policy(*args, **kwargs)
Bases:
tensorflow.python.keras.engine.training.Model
- __init__(name, memory_capacity, update_interval=1, batch_size=256, discount=0.99, n_warmup=0, max_grad=10.0, n_epoch=1, gpu=0)
- get_action(observation, test=False)
- static get_argument(parser=None)
- class tf2rl.algos.policy_base.OnPolicyAgent(*args, **kwargs)
Bases:
tf2rl.algos.policy_base.Policy
Base class for on-policy agents
- __init__(horizon=2048, lam=0.95, enable_gae=True, normalize_adv=True, entropy_coef=0.01, vfunc_coef=1.0, **kwargs)
- static get_argument(parser=None)
- class tf2rl.algos.policy_base.OffPolicyAgent(*args, **kwargs)
Bases:
tf2rl.algos.policy_base.Policy
Base class for off-policy agents
- __init__(memory_capacity, **kwargs)
- static get_argument(parser=None)
- class tf2rl.algos.policy_base.IRLPolicy(*args, **kwargs)
Bases:
tf2rl.algos.policy_base.Policy
- __init__(n_training=1, memory_capacity=0, **kwargs)
tf2rl.algos.ppo module
- class tf2rl.algos.ppo.PPO(*args, **kwargs)
Bases:
tf2rl.algos.vpg.VPG
Proximal Policy Optimization (PPO) Agent: https://arxiv.org/abs/1707.06347
Command Line Args:
--batch-size
(int): Batch size of training. The default is32
.--gpu
(int): GPU id.-1
disables GPU. The default is0
.--horizon
(int): The default is2048
.--normalize_adv
: Normalize Advantage.--enable-gae
: Enable GAE.
- __init__(clip=True, clip_ratio=0.2, name='PPO', **kwargs)
Initialize PPO
- Parameters
clip (bool) – Whether clip or not. The default is
True
.clip_ratio (float) – Probability ratio is clipped between
1-clip_ratio
and1+clip_ratio
.name (str) – Name of agent. The default is
"PPO"
.state_shape (iterable of int) –
action_dim (int) –
is_discrete (bool) –
actor –
critic –
actor_critic –
max_action (float) – maximum action size.
actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is
(256, 256)
.critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is
(256, 256)
.lr_actor (float) – Learning rate of actor. The default is
1e-3
.lr_critic (float) – Learning rate of critic. The default is
3e-3
.hidden_activation_actor (str) – Activation for actor. The default is
"relu"
.hidden_activation_critic (str) – Activation for critic. The default is
"relu"
.horizon (int) – Number of steps of online episode horizon. The horizon must be multiple of
batch_size
. The default is2048
.enable_gae (bool) – Enable GAE. The default is
True
.normalize_adv (bool) – Normalize Advantage. The default is
True
.entropy_coef (float) – Entropy coefficient. The default is
0.01
.vfunc_coef (float) – Mixing ratio factor for actor and critic.
actor_loss + vfunc_coef*critic_loss
batch_size (int) – Batch size. The default is
256
.
- train(states, actions, advantages, logp_olds, returns)
Train PPO
- Parameters
states –
actions –
advantages –
logp_olds –
returns –
tf2rl.algos.sac module
- class tf2rl.algos.sac.CriticQ(*args, **kwargs)
Bases:
tensorflow.python.keras.engine.training.Model
- __init__(state_shape, action_dim, critic_units=(256, 256), name='qf')
- call(states, actions)
Calls the model on new inputs.
In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).
- Parameters
inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).
- Returns
A tensor if there is a single output, or a list of tensors if there are more than one outputs.
- class tf2rl.algos.sac.SAC(*args, **kwargs)
Bases:
tf2rl.algos.policy_base.OffPolicyAgent
Soft Actor-Critic (SAC) Agent: https://arxiv.org/abs/1801.01290
Command Line Args:
--n-warmup
(int): Number of warmup steps before training. The default is1e4
.--batch-size
(int): Batch size of training. The default is32
.--gpu
(int): GPU id.-1
disables GPU. The default is0
.--memory-capacity
(int): Replay Buffer size. The default is1e6
.--alpha
(float): Temperature parameter. The default is0.2
.--auto-alpha
: Automatic alpha tuning.
- __init__(state_shape, action_dim, name='SAC', max_action=1.0, lr=0.0003, lr_alpha=0.0003, actor_units=(256, 256), critic_units=(256, 256), tau=0.005, alpha=0.2, auto_alpha=False, init_temperature=None, n_warmup=10000, memory_capacity=1000000, **kwargs)
Initialize SAC
- Parameters
state_shape (iterable of int) –
action_dim (int) –
name (str) – Name of network. The default is
"SAC"
max_action (float) –
lr (float) – Learning rate. The default is
3e-4
.lr_alpha (alpha) – Learning rate for alpha. The default is
3e-4
.actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is
(256, 256)
.critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is
(256, 256)
.tau (float) – Target network update rate.
alpha (float) – Temperature parameter. The default is
0.2
.auto_alpha (bool) – Automatic alpha tuning.
init_temperature (float) – Initial temperature
n_warmup (int) – Number of warmup steps before training. The default is
int(1e4)
.memory_capacity (int) – Replay Buffer size. The default is
int(1e6)
.batch_size (int) – Batch size. The default is
256
.discount (float) – Discount factor. The default is
0.99
.max_grad (float) – Maximum gradient. The default is
10
.gpu (int) – GPU id.
-1
disables GPU. The default is0
.
- get_action(state, test=False)
Get action
- Parameters
state – Observation state
test (bool) – When
False
(default), policy returns exploratory action.
- Returns
Selected action
- Return type
tf.Tensor or float
- train(states, actions, next_states, rewards, dones, weights=None)
Train SAC
- Parameters
states –
actions –
next_states –
rewards –
done –
weights (optional) – Weights for importance sampling
- compute_td_error(states, actions, next_states, rewards, dones)
Compute TD error
- Parameters
states –
actions –
next_states –
rewars –
dones –
- Returns
np.ndarray: TD error
- static get_argument(parser=None)
Create or update argument parser for command line program
- Parameters
parser (argparse.ArgParser, optional) – argument parser
- Returns
argument parser
- Return type
argparse.ArgParser
tf2rl.algos.sac_ae module
- class tf2rl.algos.sac_ae.SACAE(*args, **kwargs)
Bases:
tf2rl.algos.sac.SAC
SAC+AE Agent: https://arxiv.org/abs/1910.01741
Command Line Args:
--n-warmup
(int): Number of warmup steps before training. The default is1e4
.--batch-size
(int): Batch size of training. The default is32
.--gpu
(int): GPU id.-1
disables GPU. The default is0
.--memory-capacity
(int): Replay Buffer size. The default is1e5
.--alpha
(float): Temperature parameter. The default is0.2
.--auto-alpha
: Automatic alpha tuning.--stop-q-grad
: Whether stop gradient after convolutional layers at Encoder
- __init__(action_dim, obs_shape=(84, 84, 9), n_conv_layers=4, n_conv_filters=32, feature_dim=50, tau_encoder=0.05, tau_critic=0.01, auto_alpha=True, lr_sac=0.001, lr_encoder=0.001, lr_decoder=0.001, update_critic_target_freq=2, update_actor_freq=2, lr_alpha=0.0001, init_temperature=0.1, stop_q_grad=False, lambda_latent_val=1e-06, decoder_weight_lambda=1e-07, skip_making_decoder=False, name='SACAE', **kwargs)
Initialize SAC+AE
- Parameters
action_dim (int) –
obs_shape – (iterable of int): The default is
(84, 84, 9)
n_conv_layers (int) – Number of convolutional layers at encoder. The default is
4
n_conv_filters (int) – Number of filters in convolutional layers. The default is
32
feature_dim (int) – Number of features after encoder. This features are treated as SAC input. The default is
50
tau_encoder (float) – Target network update rate for Encoder. The default is
0.05
tau_critic (float) – Target network update rate for Critic. The default is
0.01
auto_alpha (bool) – Automatic alpha tuning. The default is
True
lr_sac (float) – Learning rate for SAC. The default is
1e-3
lr_encoder (float) – Learning rate for Encoder. The default is
1e-3
lr_decoder (float) – Learning rate for Decoder. The default is
1e-3
update_critic_target_freq (int) – The default is
2
update_actor_freq (int) – The default is
2
lr_alpha (alpha) – Learning rate for alpha. The default is
1e-4
.init_temperature (float) – Initial temperature. The default is
0.1
stop_q_grad (bool) – Whether sotp gradient propagation after encoder convolutional network. The default is
False
lambda_latent_val (float) – AE loss = REC loss +
lambda_latent_val
* latent loss. The default is1e-6
decoder_weight_lambda (float) – Weight decay of AdamW for Decoder. The default is
1e-7
skip_making_decoder (bool) – Whther skip making Decoder. The default is
False
name (str) – Name of network. The default is
"SACAE"
max_action (float) –
actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is
(256, 256)
.critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is
(256, 256)
.alpha (float) – Temperature parameter. The default is
0.2
.n_warmup (int) – Number of warmup steps before training. The default is
int(1e4)
.memory_capacity (int) – Replay Buffer size. The default is
int(1e6)
.batch_size (int) – Batch size. The default is
256
.discount (float) – Discount factor. The default is
0.99
.max_grad (float) – Maximum gradient. The default is
10
.gpu (int) – GPU id.
-1
disables GPU. The default is0
.
- get_action(state, test=False)
Get action
- Parameters
state – Observation state
test (bool) – When
False
(default), policy returns exploratory action.
- Returns
Selected action
- Return type
tf.Tensor or float
Notes
When the input image have different size, cropped image is used
- train(states, actions, next_states, rewards, dones, weights=None)
Train SAC+AE
- Parameters
states –
actions –
next_states –
rewards –
done –
weights (optional) – Weights for importance sampling
- static get_argument(parser=None)
Create or update argument parser for command line program
- Parameters
parser (argparse.ArgParser, optional) – argument parser
- Returns
argument parser
- Return type
argparse.ArgParser
- compute_td_error(states, actions, next_states, rewards, dones)
Compute TD error
- Parameters
states –
actions –
next_states –
rewars –
dones –
- Returns
np.ndarray: TD error
tf2rl.algos.sac_discrete module
- class tf2rl.algos.sac_discrete.CriticQ(*args, **kwargs)
Bases:
tensorflow.python.keras.engine.training.Model
- Compared with original (continuous) version of SAC, the output of Q-function moves
from Q: S x A -> R to Q: S -> R^|A|
- __init__(state_shape, action_dim, critic_units=(256, 256), name='qf')
- call(states)
Calls the model on new inputs.
In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).
- Parameters
inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).
- Returns
A tensor if there is a single output, or a list of tensors if there are more than one outputs.
- class tf2rl.algos.sac_discrete.SACDiscrete(*args, **kwargs)
Bases:
tf2rl.algos.sac.SAC
- __init__(state_shape, action_dim, *args, actor_fn=None, critic_fn=None, target_update_interval=None, **kwargs)
Initialize SAC
- Parameters
state_shape (iterable of int) –
action_dim (int) –
name (str) – Name of network. The default is
"SAC"
max_action (float) –
lr (float) – Learning rate. The default is
3e-4
.lr_alpha (alpha) – Learning rate for alpha. The default is
3e-4
.actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is
(256, 256)
.critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is
(256, 256)
.tau (float) – Target network update rate.
alpha (float) – Temperature parameter. The default is
0.2
.auto_alpha (bool) – Automatic alpha tuning.
init_temperature (float) – Initial temperature
n_warmup (int) – Number of warmup steps before training. The default is
int(1e4)
.memory_capacity (int) – Replay Buffer size. The default is
int(1e6)
.batch_size (int) – Batch size. The default is
256
.discount (float) – Discount factor. The default is
0.99
.max_grad (float) – Maximum gradient. The default is
10
.gpu (int) – GPU id.
-1
disables GPU. The default is0
.
- train(states, actions, next_states, rewards, dones, weights=None)
Train SAC
- Parameters
states –
actions –
next_states –
rewards –
done –
weights (optional) – Weights for importance sampling
- compute_td_error(states, actions, next_states, rewards, dones)
Compute TD error
- Parameters
states –
actions –
next_states –
rewars –
dones –
- Returns
np.ndarray: TD error
- static get_argument(parser=None)
Create or update argument parser for command line program
- Parameters
parser (argparse.ArgParser, optional) – argument parser
- Returns
argument parser
- Return type
argparse.ArgParser
tf2rl.algos.td3 module
- class tf2rl.algos.td3.Critic(*args, **kwargs)
Bases:
tensorflow.python.keras.engine.training.Model
- __init__(state_shape, action_dim, units=(400, 300), name='Critic')
- call(states, actions)
Calls the model on new inputs.
In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).
- Parameters
inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).
- Returns
A tensor if there is a single output, or a list of tensors if there are more than one outputs.
- class tf2rl.algos.td3.TD3(*args, **kwargs)
Bases:
tf2rl.algos.ddpg.DDPG
Twin Delayed Deep Deterministic policy gradient (TD3) Agent: https://arxiv.org/abs/1802.09477
Command Line Args:
--n-warmup
(int): Number of warmup steps before training. The default is1e4
.--batch-size
(int): Batch size for training. The default is32
.--gpu
(int): GPU id.-1
disables GPU. The default is0
.--memory-capacity
(int): Replay Buffer size. The default is1e6
.
- __init__(state_shape, action_dim, name='TD3', actor_update_freq=2, policy_noise=0.2, noise_clip=0.5, critic_units=(400, 300), **kwargs)
Initialize TD3
- Parameters
shate_shape (iterable of ints) – Observation state shape
action_dim (int) – Action dimension
name (str) – Network name. The default is
"TD3"
.actor_update_freq (int) – Number of critic updates per one actor upate.
policy_noise (float) –
noise_clip (float) –
critic_units (iterable of int) – Numbers of units at hidden layer of critic. The default is
(400, 300)
max_action (float) – Size of maximum action. (
-max_action
<= action <=max_action
). The degault is1
.lr_actor (float) – Learning rate for actor network. The default is
0.001
.lr_critic (float) – Learning rage for critic network. The default is
0.001
.actor_units (iterable of int) – Number of units at hidden layers of actor.
sigma (float) – Standard deviation of Gaussian noise. The default is
0.1
.tau (float) – Weight update ratio for target network.
target = (1-tau)*target + tau*network
The default is0.005
.n_warmup (int) – Number of warmup steps before training. The default is
1e4
.memory_capacity (int) – Replay Buffer size. The default is
1e4
.batch_size (int) – Batch size. The default is
256
.discount (float) – Discount factor. The default is
0.99
.max_grad (float) – Maximum gradient. The default is
10
.gpu (int) – GPU id.
-1
disables GPU. The default is0
.
- compute_td_error(states, actions, next_states, rewards, dones)
Compute TD error
- Parameters
states –
actions –
next_states –
rewars –
dones –
- Returns
Sum of two TD errors.
- Return type
np.ndarray
tf2rl.algos.vail module
- class tf2rl.algos.vail.Discriminator(*args, **kwargs)
Bases:
tensorflow.python.keras.engine.training.Model
- LOG_SIG_CAP_MAX = 2
- LOG_SIG_CAP_MIN = -20
- EPS = 1e-06
- __init__(state_shape, action_dim, units=(32, 32), n_latent_unit=32, enable_sn=False, name='Discriminator')
- call(inputs)
Calls the model on new inputs.
In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).
- Parameters
inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).
- Returns
A tensor if there is a single output, or a list of tensors if there are more than one outputs.
- compute_reward(inputs)
- class tf2rl.algos.vail.VAIL(*args, **kwargs)
Bases:
tf2rl.algos.gail.GAIL
Variational Adversarial Imitation Learning (VAIL) Agent: https://arxiv.org/abs/1810.00821
Command Line Args:
--n-warmup
(int): Number of warmup steps before training. The default is1e4
.--batch-size
(int): Batch size of training. The default is32
.--gpu
(int): GPU id.-1
disables GPU. The default is0
.--memory-capacity
(int): Replay Buffer size. The default is1e4
.--enable-sn
: Enable Spectral Normalization
- __init__(state_shape, action_dim, units=(32, 32), n_latent_unit=32, lr=5e-05, kl_target=0.5, reg_param=0.0, enable_sn=False, enable_gp=False, name='VAIL', **kwargs)
Initialize VAIL
- Parameters
state_shape (iterable of int) –
action_dim (int) –
units (iterable of int) – The default is
(32, 32)
lr (float) – Learning rate. The default is
5e-5
kl_target (float) – The default is
0.5
reg_param (float) – The default is
0
enable_sn (bool) – Whether enable Spectral Normalization. The defailt is
False
enable_gp (bool) – Whether loss function includes gradient panalty
name (str) – The default is
"VAIL"
- train(agent_states, agent_acts, expert_states, expert_acts, **kwargs)
Train VAIL
- Parameters
agent_states –
agent_acts –
expert_states –
expected_acts –
tf2rl.algos.vpg module
- class tf2rl.algos.vpg.CriticV(*args, **kwargs)
Bases:
tensorflow.python.keras.engine.training.Model
- __init__(state_shape, units, name='critic_v', hidden_activation='relu')
- call(inputs)
Calls the model on new inputs.
In this case call just reapplies all ops in the graph to the new inputs (e.g. build a new computational graph from the provided inputs).
- Parameters
inputs – A tensor or list of tensors.
training – Boolean or boolean scalar tensor, indicating whether to run the Network in training mode or inference mode.
mask – A mask or list of masks. A mask can be either a tensor or None (no mask).
- Returns
A tensor if there is a single output, or a list of tensors if there are more than one outputs.
- class tf2rl.algos.vpg.VPG(*args, **kwargs)
Bases:
tf2rl.algos.policy_base.OnPolicyAgent
VPG Agent: https://proceedings.neurips.cc/paper/1999/file/464d828b85b0bed98e80ade0a5c43b0f-Paper.pdf
Command Line Args:
--batch-size
(int): Batch size of training. The default is32
.--gpu
(int): GPU id.-1
disables GPU. The default is0
.--horizon
(int): The default is2048
.--normalize_adv
: Normalize Advantage.--enable-gae
: Enable GAE.
- __init__(state_shape, action_dim, is_discrete, actor=None, critic=None, actor_critic=None, max_action=1.0, actor_units=(256, 256), critic_units=(256, 256), lr_actor=0.001, lr_critic=0.003, hidden_activation_actor='relu', hidden_activation_critic='relu', name='VPG', **kwargs)
Initialize VPG
- Parameters
state_shape (iterable of int) –
action_dim (int) –
is_discrete (bool) –
actor –
critic –
actor_critic –
max_action (float) – maximum action size.
actor_units (iterable of int) – Numbers of units at hidden layers of actor. The default is
(256, 256)
.critic_units (iterable of int) – Numbers of units at hidden layers of critic. The default is
(256, 256)
.lr_actor (float) – Learning rate of actor. The default is
1e-3
.lr_critic (float) – Learning rate of critic. The default is
3e-3
.hidden_activation_actor (str) – Activation for actor. The default is
"relu"
.hidden_activation_critic (str) – Activation for critic. The default is
"relu"
.name (str) – Name of agent. The default is
"VPG"
.horizon (int) – Number of steps of online episode horizon. The horizon must be multiple of
batch_size
. The default is2048
.enable_gae (bool) – Enable GAE. The default is
True
.normalize_adv (bool) – Normalize Advantage. The default is
True
.entropy_coef (float) – Entropy coefficient. The default is
0.01
.vfunc_coef (float) – Mixing ratio factor for actor and critic.
actor_loss + vfunc_coef*critic_loss
batch_size (int) – Batch size. The default is
256
.
- get_action(state, test=False)
Get action and probability
- Parameters
state – Observation state
test (bool) – When
False
(default), policy returns exploratory action.
- Returns
Selected action np.ndarray or float: Log(p)
- Return type
np.ndarray or float
- get_action_and_val(state, test=False)
Get action, probability, and critic value
- Parameters
state – Observation state
test (bool) – When
False
(default), policy returns exploratory action.
- Returns
Selected action np.ndarray: Log(p) np.ndarray: Critic value
- Return type
np.ndarray
- train(states, actions, advantages, logp_olds, returns)
Train VPG
- Parameters
states –
actions –
advantages –
logp_olds –
returns –