improve HAC testing transitions
There are currently two problems with HAC testing transitions.
-
During online HER sampling in MBCHAC, the fraction of testing transitions is currently hardcoded to 0,1. This should depend on the test_transiction_percentage command-line parameter. -
The bigger problem is that the testing transitions in the original HAC implementation foresee the \gamma learning rate to be set to 0 and the penalty for not achieving a testing subgoal to the number of steps per episode. This, however, is not possible when we want to use the DRL algorithm implementations from stable-baselines3 repository. So how to get around this? One option would just be to stick with it and to model it this way also in our old MBCHAC implementation to check if it makes a big difference. Another option is to just not care about the gamma. As far as i know, a high penalty would set it to a large negative value that is clipped any ways when computing the loss. -
Find a better way to implement testing transitions. Summary of drawbacks: a) a testing transition remains in teh replay buffer. So if previously a layer was not able to achieve a subgoal, it may be able to achieve one later. b) Hard to implement without touching stable-baselines3 code.
Edited by Manfred Eppe