## Improve MBSAC

Phil Brafield has now finished his master's thesis, and we should now start improving MBSAC. As we discussed yesterday, there is four steps that we need to perform to achieve the goal of having an abstract representation on which we can train a policy hopefully more efficiently, and to also address the noisy-tv-problem.

- Work on improving the intrinsic reward with a forward model
- Implement the abstraction function \phi. This requires also to implement the inverse model, because otherwise, the forward model with just learn the identity function.
- Implement the intrinsic reward based on the prediction error in the abstract space.
- Have the policy operate in the abstract space. If we want to use HER, this requires us to also learn an inverse abstraction function \phi^-1, because otherwise we cannot implement the state-2-goal function that is necessary to compute the hindsight rewards.

This issue is about the first part. Phil implemented the intrinsic reward as the normalized entropy of the predictions of the ensemble. The steps for future work are as follows:

- Automate the shape of the forward model. Right now, it is manually defined in the .yaml file.
- Implement a different normalization method. Phil normalized based on the min and max entropy over the whole training process. It is probably better to normalize over a sliding window of min and max of the past X episodes. This requires adding a respective paramter "fwd_normalize_over_n_episodes" (instead of episodes, we could also use steps)
- Compare using the entropy with using the (normalized) prediction error as intrinsic reward. We could also use both and have balance values for both.
- Try multiplying the entropy with the prediction error. This way, the system would generate intrinsic reward if it is very certain about a prediction, and if that prediction was false. It would not generate intrinsic reward if the prediction is false, but the prediction confidence is low.
- Use a decaying balance parameter, i.e., the better the model is trained, the lower the intrinsic reward.
- Implement a visualization method that helps to identify overfitting. For example, this could be a heatmap to show where the ant moves in the ant environment. A more general approach would be to record the variance of the observation vector for each individual episode, and compare the average variance over all episodes with the overall variance over all episodes when not using intrinsic reward.