I am working on the development of new algorithms for RL.
I tried to look around for examples and solutions for similar problems, but I did not really find much.
My problem is a bit unusual: the learner has to pick and action from a set of available ones and each action is characterized by a vector of d dimensions.
Basically, each action needs to be passed to the same NN, which will generated a smaller representation and later these representations will be used to sample the action by using a very complex version of softmax.
Just to make it simpler: let’s assume the network is just a single dense block with two outputs and that I have a gluon block which can receive a 2D vector and returns some kind of score.
Currently I just call the network for every action and then I have some logic handling the outputs, but doesn’t look like the correct way of doing it. I would like to create a gluon block which gets an nd array (one row per action) and returns the index of the selected one.
My questions are:
- I am not a NN expert: does this make sense to you? I know it makes sense from the ML point of view but I am not sure the software structure makes any sense.
- What do you think it’s the best way to structure these blocks?
- I saw a few RL examples, but TBH some of them seems quite ‘basic’. Do you have any good reference?
- I guess this can be seen as a multi-task regression problem where the tasks are all the same and the parameters shared, but seems overkilling and I would still need to select the action from the scores at the end. Any thought on this?