tf2rl.experiments package

Submodules

tf2rl.experiments.irl_trainer module

class tf2rl.experiments.irl_trainer.IRLTrainer(policy, env, args, irl, expert_obs, expert_next_obs, expert_act, test_env=None)

Bases: tf2rl.experiments.trainer.Trainer

Trainer class for inverse reinforce learning

Command Line Args:

--max-steps (int): The maximum steps for training. The default is int(1e6)

--episode-max-steps (int): The maximum steps for an episode. The default is int(1e3)

--n-experiments (int): Number of experiments. The default is 1

--show-progress: Call render function during training

--save-model-interval (int): Interval to save model. The default is int(1e4)

--save-summary-interval (int): Interval to save summary. The default is int(1e3)

--model-dir (str): Directory to restore model.

--dir-suffix (str): Suffix for directory that stores results.

--normalize-obs: Whether normalize observation

--logdir (str): Output directory name. The default is "results"

--evaluate: Whether evaluate trained model

--test-interval (int): Interval to evaluate trained model. The default is int(1e4)

--show-test-progress: Call render function during evaluation.

--test-episodes (int): Number of episodes at test. The default is 5

--save-test-path: Save trajectories of evaluation.

--show-test-images: Show input images to neural networks when an episode finishes

--save-test-movie: Save rendering results.

--use-prioritized-rb: Use prioritized experience replay

--use-nstep-rb: Use Nstep experience replay

--n-step (int): Number of steps for nstep experience reward. The default is 4

--logging-level (DEBUG, INFO, WARNING): Choose logging level. The default is INFO

--expert-path-dir (str): Path to directory that contains expert trajectories

__init__(policy, env, args, irl, expert_obs, expert_next_obs, expert_act, test_env=None)

Initialize Trainer class

Parameters

policy – Policy to be trained
env (gym.Env) – Environment for train
args (Namespace or dict) – config parameters specified with command line
irl –
expert_obs –
expert_next_obs –
expert_act –
test_env (gym.Env) – Environment for test.

static get_argument(parser=None)

Create or update argument parser for command line program

Parameters: parser (argparse.ArgParser, optional) – argument parser
Returns: argument parser
Return type: argparse.ArgParser

tf2rl.experiments.me_trpo_trainer module

class tf2rl.experiments.me_trpo_trainer.MeTrpoTrainer(*args, n_eval_episodes_per_model=5, **kwargs)

Bases: tf2rl.experiments.mpc_trainer.MPCTrainer

Trainer class for Model-Ensemble Trust-Region Policy Optimization (ME-TRPO):https://arxiv.org/abs/1802.10592

Command Line Args:

--max-steps (int): The maximum steps for training. The default is int(1e6)

--episode-max-steps (int): The maximum steps for an episode. The default is int(1e3)

--n-experiments (int): Number of experiments. The default is 1

--show-progress: Call render function during training

--save-model-interval (int): Interval to save model. The default is int(1e4)

--save-summary-interval (int): Interval to save summary. The default is int(1e3)

--model-dir (str): Directory to restore model.

--dir-suffix (str): Suffix for directory that stores results.

--normalize-obs: Whether normalize observation

--logdir (str): Output directory name. The default is "results"

--evaluate: Whether evaluate trained model

--test-interval (int): Interval to evaluate trained model. The default is int(1e4)

--show-test-progress: Call render function during evaluation.

--test-episodes (int): Number of episodes at test. The default is 5

--save-test-path: Save trajectories of evaluation.

--show-test-images: Show input images to neural networks when an episode finishes

--save-test-movie: Save rendering results.

--use-prioritized-rb: Use prioritized experience replay

--use-nstep-rb: Use Nstep experience replay

--n-step (int): Number of steps for nstep experience reward. The default is 4

--logging-level (DEBUG, INFO, WARNING): Choose logging level. The default is INFO

--gpu (int): The default is 0

--max-iter (int): Maximum iteration. The default is 100

--horizon (int): Number of steps to online horizon

--n-sample (int): Number of samples. The default is 1000

--batch-size (int): Batch size. The default is 512.

--n-collect-steps (int): Number of steps to collect. The default is 100

--debug: Enable debug

__init__(*args, n_eval_episodes_per_model=5, **kwargs)

Initialize ME-TRPO

Parameters

policy – Policy to be trained
env (gym.Env) – Environment for train
args (Namespace or dict) – config parameters specified with command line
test_env (gym.Env) – Environment for test.
reward_fn (callable) – Reward function
buffer_size (int) – The default is int(1e6)
lr (float) – Learning rate for dynamics model. The default is 0.001.
n_eval_episode_per_model (int) – Number of evalation episodes per a model. The default is 5

predict_next_state(obses, acts, idx=None)

Predict Next State

Parameters

obses –
acts –
idx (int) – Index number of dynamics mode to use. If None (default), choose randomly.

Returns

next state

Return type

np.ndarray

update_policy(): Update Policy

collect_transitions_real_env(): Collect Trandisions from Real Environment

collect_transitions_sim_env(): Generate transitions using dynamics model

finish_horizon(last_val=0): TODO: These codes are completly identical to the ones defined in on_policy_trainer.py. Use it.

evaluate_policy(total_steps)

static get_argument(parser=None)

Create or update argument parser for command line program

Parameters: parser (argparse.ArgParser, optional) – argument parser
Returns: argument parser
Return type: argparse.ArgParser

tf2rl.experiments.mpc_trainer module

class tf2rl.experiments.mpc_trainer.DynamicsModel(*args, **kwargs)

Bases: tensorflow.python.keras.engine.training.Model

__init__(input_dim, output_dim, units=[32, 32], name='DymamicsModel', gpu=0)

Initialize DynamicsModel

Parameters

input_dim (int) –
output_dim (int) –
units (iterable of int) – The default is [32, 32]
name (str) – The default is "DynamicsModel"
gpu (int) – The default is 0.

call(inputs)

Call Dynamics Model

Parameters: inputs (tf.Tensor) –
Returns: tf.Tensor

predict(inputs)

Generates output predictions for the input samples.

Computation is done in batches. This method is designed for performance in large scale inputs. For small amount of inputs that fit in one batch, directly using __call__ is recommended for faster execution, e.g., model(x), or model(x, training=False) if you have layers such as tf.keras.layers.BatchNormalization that behaves differently during inference. Also, note the fact that test loss is not affected by regularization layers like noise and dropout.

Parameters

x –
Input samples. It could be: - A Numpy array (or array-like), or a list of arrays

(in case the model has multiple inputs).
- A TensorFlow tensor, or a list of tensors (in case the model has multiple inputs).
- A tf.data dataset.
- A generator or keras.utils.Sequence instance.
A more detailed description of unpacking behavior for iterator types (Dataset, generator, Sequence) is given in the Unpacking behavior for iterator-like inputs section of Model.fit.
batch_size – Integer or None. Number of samples per batch. If unspecified, batch_size will default to 32. Do not specify the batch_size if your data is in the form of dataset, generators, or keras.utils.Sequence instances (since they generate batches).
verbose – Verbosity mode, 0 or 1.
steps – Total number of steps (batches of samples) before declaring the prediction round finished. Ignored with the default value of None. If x is a tf.data dataset and steps is None, predict will run until the input dataset is exhausted.
callbacks – List of keras.callbacks.Callback instances. List of callbacks to apply during prediction. See [callbacks](/api_docs/python/tf/keras/callbacks).
max_queue_size – Integer. Used for generator or keras.utils.Sequence input only. Maximum size for the generator queue. If unspecified, max_queue_size will default to 10.
workers – Integer. Used for generator or keras.utils.Sequence input only. Maximum number of processes to spin up when using process-based threading. If unspecified, workers will default to 1. If 0, will execute the generator on the main thread.
use_multiprocessing – Boolean. Used for generator or keras.utils.Sequence input only. If True, use process-based threading. If unspecified, use_multiprocessing will default to False. Note that because this implementation relies on multiprocessing, you should not pass non-picklable arguments to the generator as they can’t be passed easily to children processes.

See the discussion of Unpacking behavior for iterator-like inputs for Model.fit. Note that Model.predict uses the same interpretation rules as Model.fit and Model.evaluate, so inputs must be unambiguous for all three methods.

Returns

Numpy array(s) of predictions.

Raises

RuntimeError – If model.predict is wrapped in tf.function.
ValueError – In case of mismatch between the provided input data and the model’s expectations, or in case a stateful model receives a number of samples that is not a multiple of the batch size.

class tf2rl.experiments.mpc_trainer.RandomPolicy(max_action, act_dim)

Bases: object

__init__(max_action, act_dim)

Initialize RandomPolicy

Parameters

max_action (float) –
act_dim (int) –

get_action(obs)

Get random action

Parameters: obs –
Returns: action
Return type: float

get_actions(obses)

Get batch actions

Parameters: obses –
Returns: batch actions
Return type: np.dnarray

class tf2rl.experiments.mpc_trainer.MPCTrainer(policy, env, args, reward_fn, buffer_size=1000000, n_dynamics_model=1, lr=0.001, **kwargs)

Bases: tf2rl.experiments.trainer.Trainer

Trainer class for Model Predictive Control (MPC): https://arxiv.org/abs/1708.02596

Command Line Args:

--max-steps (int): The maximum steps for training. The default is int(1e6)

--episode-max-steps (int): The maximum steps for an episode. The default is int(1e3)

--n-experiments (int): Number of experiments. The default is 1

--show-progress: Call render function during training

--save-model-interval (int): Interval to save model. The default is int(1e4)

--save-summary-interval (int): Interval to save summary. The default is int(1e3)

--model-dir (str): Directory to restore model.

--dir-suffix (str): Suffix for directory that stores results.

--normalize-obs: Whether normalize observation

--logdir (str): Output directory name. The default is "results"

--evaluate: Whether evaluate trained model

--test-interval (int): Interval to evaluate trained model. The default is int(1e4)

--show-test-progress: Call render function during evaluation.

--test-episodes (int): Number of episodes at test. The default is 5

--save-test-path: Save trajectories of evaluation.

--show-test-images: Show input images to neural networks when an episode finishes

--save-test-movie: Save rendering results.

--use-prioritized-rb: Use prioritized experience replay

--use-nstep-rb: Use Nstep experience replay

--n-step (int): Number of steps for nstep experience reward. The default is 4

--logging-level (DEBUG, INFO, WARNING): Choose logging level. The default is INFO

--gpu (int): The default is 0

--max-iter (int): Maximum iteration. The default is 100

--horizon (int): Number of steps to online horizon

--n-sample (int): Number of samples. The default is 1000

--batch-size (int): Batch size. The default is 512.

__init__(policy, env, args, reward_fn, buffer_size=1000000, n_dynamics_model=1, lr=0.001, **kwargs)

Initialize MPCTrainer class

Parameters

policy – Policy to be trained
env (gym.Env) – Environment for train
args (Namespace or dict) – config parameters specified with command line
test_env (gym.Env) – Environment for test.
reward_fn (callable) – Reward function
buffer_size (int) – The default is int(1e6)
n_dynamics_model (int) – Number of dynamics models. The default is 1.
lr (float) – Learning rate for dynamics model. The default is 0.001.

predict_next_state(obses, acts)

Predict Next State

Parameters

obses –
acts –

Returns

next state

Return type

np.ndarray

collect_episodes(n_rollout=1)

Collect Episodes

Parameters: n_rollout (int) – Number of rollout. The default is 1

fit_dynamics(n_epoch=1)

Fit dynamics

Parameters: n_epocs (int) – Number of epocs to fit

static get_argument(parser=None)

Create or update argument parser for command line program

Parameters: parser (argparse.ArgParser, optional) – argument parser
Returns: argument parser
Return type: argparse.ArgParser

tf2rl.experiments.on_policy_trainer module

class tf2rl.experiments.on_policy_trainer.OnPolicyTrainer(*args, **kwargs)

Bases: tf2rl.experiments.trainer.Trainer

Trainer class for on-policy reinforcement learning

Command Line Args:

--max-steps (int): The maximum steps for training. The default is int(1e6)

--episode-max-steps (int): The maximum steps for an episode. The default is int(1e3)

--n-experiments (int): Number of experiments. The default is 1

--show-progress: Call render function during training

--save-model-interval (int): Interval to save model. The default is int(1e4)

--save-summary-interval (int): Interval to save summary. The default is int(1e3)

--model-dir (str): Directory to restore model.

--dir-suffix (str): Suffix for directory that stores results.

--normalize-obs: Whether normalize observation

--logdir (str): Output directory name. The default is "results"

--evaluate: Whether evaluate trained model

--test-interval (int): Interval to evaluate trained model. The default is int(1e4)

--show-test-progress: Call render function during evaluation.

--test-episodes (int): Number of episodes at test. The default is 5

--save-test-path: Save trajectories of evaluation.

--show-test-images: Show input images to neural networks when an episode finishes

--save-test-movie: Save rendering results.

--use-prioritized-rb: Use prioritized experience replay

--use-nstep-rb: Use Nstep experience replay

--n-step (int): Number of steps for nstep experience reward. The default is 4

--logging-level (DEBUG, INFO, WARNING): Choose logging level. The default is INFO

__init__(*args, **kwargs)

Initialize On-Policy Trainer

Parameters

policy – Policy to be trained
env (gym.Env) – Environment for train
args (Namespace or dict) – config parameters specified with command line
test_env (gym.Env) – Environment for test.

finish_horizon(last_val=0): Finish horizon

evaluate_policy(total_steps)

Evaluate policy

Parameters: total_steps (int) – Current total steps of training

tf2rl.experiments.trainer module

class tf2rl.experiments.trainer.Trainer(policy, env, args, test_env=None)

Bases: object

Trainer class for off-policy reinforce learning

Command Line Args:

--max-steps (int): The maximum steps for training. The default is int(1e6)

--episode-max-steps (int): The maximum steps for an episode. The default is int(1e3)

--n-experiments (int): Number of experiments. The default is 1

--show-progress: Call render function during training

--save-model-interval (int): Interval to save model. The default is int(1e4)

--save-summary-interval (int): Interval to save summary. The default is int(1e3)

--model-dir (str): Directory to restore model.

--dir-suffix (str): Suffix for directory that stores results.

--normalize-obs: Whether normalize observation

--logdir (str): Output directory name. The default is "results"

--evaluate: Whether evaluate trained model

--test-interval (int): Interval to evaluate trained model. The default is int(1e4)

--show-test-progress: Call render function during evaluation.

--test-episodes (int): Number of episodes at test. The default is 5

--save-test-path: Save trajectories of evaluation.

--show-test-images: Show input images to neural networks when an episode finishes

--save-test-movie: Save rendering results.

--use-prioritized-rb: Use prioritized experience replay

--use-nstep-rb: Use Nstep experience replay

--n-step (int): Number of steps for nstep experience reward. The default is 4

--logging-level (DEBUG, INFO, WARNING): Choose logging level. The default is INFO

__init__(policy, env, args, test_env=None)

Initialize Trainer class

Parameters

policy – Policy to be trained
env (gym.Env) – Environment for train
args (Namespace or dict) – config parameters specified with command line
test_env (gym.Env) – Environment for test.

evaluate_policy_continuously(): Periodically search the latest checkpoint, and keep evaluating with the latest model until user kills process.

evaluate_policy(total_steps)

static get_argument(parser=None)

Create or update argument parser for command line program

Parameters: parser (argparse.ArgParser, optional) – argument parser
Returns: argument parser
Return type: argparse.ArgParser

tf2rl.experiments.utils module

tf2rl.experiments.utils.save_path(samples, filename)

tf2rl.experiments.utils.restore_latest_n_traj(dirname, n_path=10, max_steps=None)

tf2rl.experiments.utils.get_filenames(dirname, n_path=None)

tf2rl.experiments.utils.load_trajectories(filenames, max_steps=None)

tf2rl.experiments.utils.frames_to_gif(frames, prefix, save_dir, interval=50, fps=30): Convert frames to gif file

tf2rl.experiments package

Submodules

tf2rl.experiments.irl_trainer module

tf2rl.experiments.me_trpo_trainer module

tf2rl.experiments.mpc_trainer module

tf2rl.experiments.on_policy_trainer module

tf2rl.experiments.trainer module

tf2rl.experiments.utils module

Module contents