Rethink q-computation of SAC in respect to Goal Replay
In SAC, the q-value is computed as follows:
replay_data.rewards + (1 - replay_data.dones) * gamma * target_q
A problem is that the dones
are wrt. the actually executed transitions, and not the hindsight transitions. However, the dones should depend on the hindsight ones, i.e., there should be no discounted future target_q if the transition is done in hindsight.