Find a workaround for HAC testing transitions

HAC testing transitions are necessary for the agent to learn not to propose subgoals that are unachievable by the agent. However, they require to set the \gamma during the TD-Learning loss computation to zero. In our current architecture, this is not possible because we want to be able to use the unmodified SAC/TD3/DDPG implementations from stable baselines. Hence, we have to find a workaround for the subgoal testing transitions.

Modify our old HAC code so that gamma is not set to 0 as in this implementation and compare the performance. Maybe it does not make such a big difference at all. Here, we use \gamma=0.99 and the penalty is half the number of steps of the lower-level layer (cf. in HHERReplayBuffer. __init__())

Edited Mar 17, 2021 by Manfred Eppe