Rethink q-computation of SAC in respect to Goal Replay

In SAC, the q-value is computed as follows: replay_data.rewards + (1 - replay_data.dones) * gamma * target_q

A problem is that the dones are wrt. the actually executed transitions, and not the hindsight transitions. However, the dones should depend on the hindsight ones, i.e., there should be no discounted future target_q if the transition is done in hindsight.