Re-check early stopping mechanism
I got some weird results when stopping an episode early on success. This may have been fixed now because I discovered that in SAC, the q-backup value is computed based on whether an episode is done or not. The done parameter has not been stored properly in thereplay buffer, and this may have caused the weird results. Now I have to re-check in a simple environment (FetchReach) whether the results are more intuitive.