Recurrent / GATELORD Policies

This is to include recurrent policies / critics in the implementation, in collaboration with Chris Gumbsch.

We have the following options:

Use SAC off-policy. Then we can use our usual setup with goal-conditioned envs, but the problem is that the replay buffer does not support episodicity of state transitions. That is, state transitions are not sampled in the correct order of the eposides with our current replay buffer. This requires one of the following:
- Implementation of an episodicity-aware replay buffer and also adaption of training process to enable episodicity.
- Store hidden states in replay buffer. This has the disadvantage that the hidden states will change over time while the network is changing.
Use experimental PPO implementation that supports recurrency. This can be done in two ways:
- Quick but not sustainable: use SB-contrib repository and quickly exchange layers
- Slow but sustainable: copy PPO-implementation into Scilab-RL and use it there.
Tasks:

Edited Jan 30, 2023 by Manfred Eppe