Implicit assumption that episodes only finish once max_ep_steps is reached

In Stable Baselines3 an episode can finish in two different ways:

max_ep_steps is reached
the step-function returns done as True (which is useful for early stoppage on success or unrecoverable failure)

In randomized environments with sparse rewards the usage of `best_mean_reward` to determine `rl_model_best` can be problematic in case of early episode stoppages; Selecting models that fail as quickly as possible.

Using the success_rate instead would fix this problem, but has detrimental consequences for environments that never stop early and are intended to accumulate successes within an episode

Edited Aug 27, 2023 by Pascal Gleske

Implicit assumption that episodes only finish once max_ep_steps is reached

In Stable Baselines3 an episode can finish in two different ways:

In randomized environments with sparse rewards the usage of best_mean_reward to determine rl_model_best can be problematic in case of early episode stoppages; Selecting models that fail as quickly as possible.

In randomized environments with sparse rewards the usage of `best_mean_reward` to determine `rl_model_best` can be problematic in case of early episode stoppages; Selecting models that fail as quickly as possible.