Implicit assumption that episodes only finish once max_ep_steps is reached
In Stable Baselines3 an episode can finish in two different ways:
-
max_ep_steps
is reached - the step-function returns
done
as True (which is useful for early stoppage on success or unrecoverable failure)
best_mean_reward
to determine rl_model_best
can be problematic in case of early episode stoppages; Selecting models that fail as quickly as possible.
In randomized environments with sparse rewards the usage of Using the success_rate
instead would fix this problem, but has detrimental consequences for environments that never stop early and are intended to accumulate successes within an episode
Edited by Pascal Gleske