Is there any example implementation of Deep Deterministic Policy Gradient (DDPG) for the Gluon API? If there isn’t one, can someone help me with the implementation?
I tried to implement it by myself but I got stuck at the point where I have to update my actor network.
I implemented the following training routine:
if do_training(): # Sample random batch from replay buffer states, actions, rewards, next_states, terminals = replay_buffer.sample(batch_size=BATCH_SIZE) # Calculate target y with actor and critic target networks target_actions = actor_target(next_states) target_qvalues = critic_target(next_states, target_actions) y = rewards + (1.0 - terminals) * DISCOUNT_FACTOR * target_qvalues # Update critic network by minimizing reward prediction error with autograd.record(): qvalues = critic(states, actions) loss = l2_loss(qvalues, y) loss.backward() trainer_critic.step(BATCH_SIZE) # actual update with gluon.trainer # Let actor propose particular action for given state actor_action = actor(states) actor_action.attach_grad() # Compute Q(state, action) and backpropagate w.r.t. actions with autograd.record(): qvalues = critic(states, actor_action) qvalues.backward() action_gradients = actor_action.grad
My first problem is, that all the gradients of
action_gradients are the same for the whole batch, so I am not sure if this is correct. My second problem is that I do not know how to proceed with the algorithm. How can I update the actor weights with the calculated gradients from the critic network?