Introduction
Just a disclaimer that consider this as a supplement for the lilian weng policy grad blog cause i made these while going through that hell but ofc wonderful experience. If you are fine with derivation & maths then go ahead dawg.
Agent acts in the environment → how the environment reacts to actions is defined by the model (may know /
may
not know) → reward function.
The model defines the transition probability of the environment:
$$
P(s' \mid s, a)
$$
The effective transition probability under our policy is:
$$
P^{\pi}(s' \mid s) = \sum_{a} \pi(a \mid s) \, P(s' \mid s, a)
$$
If we know the model → we do model-based RL → Dynamic Programming.
Either model-free or try to learn the model explicitly as part of the algorithm.
Transition probability & reward dynamics.
Model-based → rely on a model of the environment, either it is known or the algorithm
learns it
explicitly.
Model-free → no dependency on the model during learning. It is like ignoring the world,
but
the
method tends to work. They just want to map states to the best action to maximize reward through trial &
error.
Like Q-learning, reinforcement learning methods that learn the policy $\pi$ directly.
Joint $P = P(A \cap B)$ like $P(s',r|s,a)$
Marginal $P(A) = \sum P(A,B) = \int P(A,B) \, db$
Conditional $P(A|B) = \frac{P(A \cap B)}{P(B)}$
$$
S_t \; \xrightarrow{A_t} \; [R_{t+1}, S_{t+1}]
$$
State transition func can be defined as func of $P(s',r|s,a)$
So say vending machine state $(s)$, action $(a)$ is insert coin
Outcome A : get Soda $(s')$ & reward +1 (it worked)
B : get soda $(s')$ & reward +5 (machine glitched & got your coin back)
C : eat your coin (S Fail) & reward -1
now
$$
P(s', r \mid s, a) = P[ S_{t+1}=s' , R_{t+1}=r \mid S_t = s , A_t = a ]
$$
$$
P_{ss'}^a = P(s'|s,a) = P[ S_{t+1}=s' \mid S_t = s , A_t = a ] = \sum_{r \in R} P(s',r \mid s,a)
$$
this marginalizes out rewards
if reward was always same for specific move , this Summation is formality -that doesn't change numbers.
$$
R(s,a) = E[ R_{t+1} \mid S_t = s , A_t = a ] = \sum_{r \in R} r \sum_{s' \in S} P(s',r \mid s,a)
$$
Now Policy → agent's behavior function $\pi$ tells which action to take in state $s$.
• Deterministic $\quad \pi(s) = a$
• Stochastic $\quad \pi(a|s) = P_\pi [ A=a \mid S=s ]$
$$
G_t = R_{t+1} + \gamma R_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}
$$
State Value (s)
$$
V_\pi(s) = E_\pi [ G_t \mid S_t = s ]
$$
Action Value (s,a)
$$
Q_\pi(s,a) = E_\pi [ G_t \mid S_t = s , A_t = a ]
$$
$$
= R(s,a) + \gamma E_\pi [ V_\pi(S_{t+1}) \mid S_t = s , A_t = a ]
$$
if we marginalize the action we again get value of state
$$
V_\pi(s) = \sum_{a \in A} Q_\pi(s,a) \, \pi(a|s)
$$
Advantage (A)
$$
A_\pi(s,a) = Q_\pi(s,a) - V_\pi(s)
$$
$$
V(s) = E[ R_{t+1} + \gamma V(S_{t+1}) \mid S_t = s ]
$$
$$
Q(s,a) = E[ R_{t+1} + \gamma E_{a \sim \pi} Q(S_{t+1}, a) \mid S_t = s , A_t = a ]
$$
action is sampled from policy
$$
Q_\pi(s,a) = R(s,a) + \gamma \sum_{s' \in S} P^a_{ss'} V_\pi(s')
$$
⇒ tp jus a way to cover all possible way /situations to reach $s'$
$$
V_\pi(s) =
\sum_{a \in A} \pi(a|s)
\left(
R(s,a)
+ \gamma \sum_{s' \in S} P^a_{ss'} V_\pi(s')
\right)
$$
$$
Q_\pi(s,a) =
R(s,a)
+ \gamma
\sum_{s' \in S}
P^a_{ss'}
\sum_{a' \in A}
\pi(a'|s')
Q_\pi(s',a')
$$
$$
V_*(s) = \max_{\pi} V_\pi(s)
\quad , \quad
Q_*(s,a) = \max_{\pi} Q_\pi(s,a)
$$
↓
among all policies pick the one that gives highest expected return.
this is that policy
$$
\pi_* = \arg\max_{\pi} V_\pi(s)
\quad , \quad
\pi_* = \arg\max_{\pi} Q_\pi(s,a)
$$
returns a policy
$$
V_{\pi_*}(s) = V_*(s)
\qquad
Q_{\pi_*}(s,a) = Q_*(s,a)
$$
⇒ best possible strategy & best possible reward
So to find the optimal policy $(\pi_*)$ if agent learns optimal
values say $(Q_*)$ then finding policy becomes easy. cause
then its just
$$
\pi_*(s) = \arg\max_a Q_*(s,a)
$$
in optimal case
$$
V_*(s) = \max_{a \in A} Q_*(s,a)
$$
cause its like now we know
true Q values , no need to explore
hence go deterministically
$$
Q_*(s,a) =
R(s,a)
+ \gamma
\sum_{s' \in S}
P^a_{ss'}
V_*(s')
$$
⇒ again to cover all cases
$$
V_*(s) =
\max_{a \in A}
\left(
R(s,a)
+ \gamma
\sum_{s' \in S}
P^a_{ss'}
V_*(s')
\right)
$$
$$
Q_*(s,a) =
R(s,a)
+ \gamma
\sum_{s' \in S}
P_{ss'}^a
\max_{a' \in A}
Q_*(s',a')
$$
In most cases we don't know $P_{ss'}^a$ or $R(s,a)$
$$
V(s) =
\sum_{a}
\pi(a|s)
\sum_{s',r}
P(s',r \mid s,a)
\left(
r + \gamma V(s')
\right)
$$
$$
V_{t+1}(s) =
E_\pi
[ r + \gamma V_t(s') \mid S_t = s ]
$$
→ not change how agent behaves
just makes Values more accurate for current
policy $\pi$
$$
Q_\pi(s,a) =
\sum_{s',r}
P(s',r \mid s,a)
(r + \gamma V_\pi(s'))
$$
$$
=
E
[ R_{t+1} + \gamma V_\pi(S_{t+1})
\mid S_t = s , A_t = a ]
$$
→ this change agent behavior
When we evaluate $V_{t+1}$ we update knowledge of current
scenario & when we improve $Q_\pi \to \pi'$ we update our
behavior on that knowledge.
$$
\pi'(s) = \arg\max_a Q_\pi(s,a)
$$
So basically we have value of states, Q value & policy
now if we say updated our Policy so we
will update our State values , that would
lead to Q value updates , then our new policy
will become this argmax & then again loop.
This is Policy iteration & a common algorithm
Generalized Policy iteration
$$
\pi_0 \to V^{\pi_0}
\to \pi_1 \to V^{\pi_1}
\to \pi_2
\to \dots
\to V_*
$$
Now Monte Carlo is model free with more variance
as one episode can skew things. In Temporal
diff we don't wait for whole episode
So in TD learning we use bootstrapping
$$
V(S_t) \leftarrow
V(S_t)
+ \alpha
\left(
R_{t+1}
+ \gamma V(S_{t+1})
- V(S_t)
\right)
$$
$$
Q(S_t, A_t) \leftarrow
Q(S_t, A_t)
+ \alpha
\left(
R_{t+1}
+ \gamma Q(S_{t+1}, A_{t+1})
- Q(S_t, A_t)
\right)
$$
now SARSA is on-policy TD Control (optimal policy in TD learning)
Q learning is off-policy TD control
So in SARSA, Q learning
$$
A_{t+1} = \arg\max_{a \in A} Q(S_{t+1}, a)
$$
↓
$\varepsilon$-greedy in action so sometimes even the action with
less Q value might get selected.
Q learning
$$
Q(S_t, A_t) \leftarrow
\dots
\left(
R_{t+1}
+ \gamma \max_{a \in A} Q(S_{t+1}, a)
\right)
\dots
$$
key diff is that Q learning doesn't follow current policy to
pick the second action $A_{t+1}$ , it estimates the
best one directly.
But now Q learning table will easily become large
hence now we use model to predict Q value as
$$
Q(s,a;\theta)
$$
Training could be hard & unstable due to deadly triad. Non linear func approx, off policy, bootstrapping
So in DQN LOSS is like
$$
L(\theta) =
E_{(s,a,r,s') \sim U(D)}
\left[
\left(
r
+ \gamma
\max_{a'}
Q(s',a';\theta^-)
- Q(s,a;\theta)
\right)^2
\right]
$$
↓
uniform
distribution
of replay memory
in generalized n-step TD learning
$$
V(S_t) \leftarrow
V(S_t)
+ \alpha
\left(
G_t^n - V(S_t)
\right)
$$
↓
$$
G_t^n =
R_{t+1}
+ \gamma R_{t+2}
+ \cdots
+ \gamma^n V(S_{t+n})
$$
now as of now we learn state/action value then to
select action accordingly , now in PG directly policy.
$$
\pi(a|s ; \theta)
$$
policy with $\theta$ param
$$
J(\theta)
$$
: expected total reward , we want to max this
Discrete →
$$
J(\theta) = V_{\pi_\theta}(S_1)
$$
initial state return so
goal is to find $\theta$ that
maximize value of initial state
(here or this for
Starting state is same)
Continuous / Stationary case →
many Starting state or agent
run forever or continuous
(stationary distribution)
$$
d^{\pi_\theta}(s)
$$
← So we track how often
agent visit each state
prob that agent will be in states
if it follow $\pi_\theta$ for long time
$$
J(\theta) =
\sum_{s \in S}
d^{\pi_\theta}(s)
V_{\pi_\theta}(s)
$$
this is for
each state
$$
=
\sum_{s}
d^{\pi_\theta}(s)
\sum_{a \in A}
\pi(a|s,\theta)
Q_\pi(s,a)
$$
now Policy gradient theorem allow us to calculate gradient
$$
\nabla J(\theta)
$$
without actually calc. derivative of state distribution
$$
d^{\pi_\theta}(s)
$$
Cause when we change $\theta$ to improve $(\pi)$ we
accidentally change where we go
$$
d^{\pi_\theta}(s)
$$
cause they are dependent.
now this
$$
d^{\pi_\theta}(s)
$$
is problem & we'll proof that it is replaced via
$$
d(s)
$$
just distribution not dep. on $\theta$
$$
J(\theta)
=
E_{\pi_\theta}[r]
=
\sum_{s \in S}
d^{\pi_\theta}(s)
\sum_{a \in A}
\pi(a|s;\theta)
R(s,a)
$$
$$
J(\theta)
=
\sum_{s \in S}
d(s)
\sum \dots
$$
$$
\nabla J(\theta)
=
\sum_{s \in S}
d(s)
\sum_{a \in A}
\pi(a|s;\theta)
\nabla_\theta \pi(a|s;\theta)
Q_\pi(s,a)
$$
$$
\nabla J(\theta)
=
\sum_{s \in S}
d(s)
\sum_{a \in A}
\pi(a|s;\theta)
\nabla_\theta
\ln \pi(a|s;\theta)
Q_\pi(s,a)
$$
$$
=
E_{\pi_\theta}
\left[
\nabla
\ln \pi(a|s;\theta)
Q_\pi(s,a)
\right]
$$
Reinforce / Monte carlo PG → since whole episode we have $G_t$
$$
\nabla J(\theta)
=
E_{\pi_\theta}
\left[
\nabla_\theta
\ln \pi_\theta(a_t|s_t)
G_t
\right]
$$
So
$$
\theta
\leftarrow
\theta
+
\alpha
\gamma^t
G_t
\nabla_\theta
\ln \pi_\theta (A_t | S_t)
$$
→ like $G_t$ at state $t = R_{t+1} + \gamma R_{t+2} + \cdots$
now here again variance
so we use advantage baseline
which is
$$
A(s,a) = Q(s,a) - V(s)
$$
now its not like we can't make it REINFORCE with
baseline like
$$
\nabla J(\theta)
=
E_{\pi_\theta}
\left[
\nabla_\theta
\ln \pi_\theta (a_t | s_t)
(G_t - V(s_t))
\right]
$$
monte carlo with baseline
Now in Actor critic Temporal difference comes into play
Now instead of $G_t$ or baseline,
$$
A(s_t, a_t)
\approx
r_t
+ \gamma V(s_{t+1})
- V(s_t)
$$
bootstrap
Critic is value network
$$
V(s)
$$
can be
$$
Q(a|s;\omega)
$$
or
$$
V(s;\omega)
$$
Actor is policy network
$$
\pi(a|s;\theta)
$$
updates in direction suggested by critic
now policy param $\theta$
$$
\theta
\leftarrow
\theta
+
\alpha_\theta
\, Q(s,a;\omega)
\, \nabla_\theta
\ln \pi(a|s;\theta)
$$
& for critic value param $\omega$
$$
\omega
\leftarrow
\omega
+
\alpha_\omega
\, G_{t:t+1}
\, \nabla_\omega
Q(s,a;\omega)
$$
↓
$$
G_{t:t+1}
=
r_t
+
\gamma
Q(s',a';\omega)
-
Q(s,a;\omega)
$$
this eqn is basically
$$
\omega_{\text{new}}
\leftarrow
\omega_{\text{old}}
+
\alpha
\frac{\partial L}{\partial \omega}
$$
now
$$
L(\omega)
=
\frac{1}{2}
(G_{t:t+1})^2
$$
basically MSE of TD error
now gradient
$$
\frac{\partial L}{\partial \omega}
=
\text{error}
\times
\frac{\partial (Q(s,a;\omega))}{\partial \omega}
$$
So that gives eqn
$$
\omega
\leftarrow
\omega
+
\alpha_\omega
\, (\text{TD error})
\, \nabla_\omega
Q(s,a;\omega)
$$
in A3C there is Async Asynchronous , so like multiple
workers and like global param update async settings
same as A2C.
Evolution strategies : Think of this like distribution
of $d$ dimension now a sample
would give us a neural network
params. So we basically have
mean + variance × epsilon
Where we get high fitness score that becomes
our answer.
$$
J(\mu)
=
E_{\theta \sim p_\mu}
[F(\theta)]
$$
$$
\nabla_\mu
E_{\theta \sim \mathcal{N}(\mu,\sigma^2)}
[F(\theta)]
=
\nabla_\mu
\int
F(\theta)
p_\mu(\theta)
d\theta
$$
$$
=
\int
F(\theta)
\nabla_\mu
p_\mu(\theta)
d\theta
$$
$$
\nabla_\mu p_\mu(\theta)
=
p_\mu(\theta)
\nabla_\mu
\ln p_\mu(\theta)
$$
$$
=
\int
F(\theta)
p_\mu(\theta)
\nabla_\mu
\ln p_\mu(\theta)
d\theta
$$
$$
=
E_\theta
\left[
F(\theta)
\nabla_\mu
\log p_\mu(\theta)
\right]
$$
$$
\nabla_\mu
\log p_\mu(\theta)
=
\frac{\theta - \mu}{\sigma^2}
$$
$$
=
E_\theta
\left[
F(\theta)
\frac{\theta - \mu}{\sigma^2}
\right]
$$
$$
\theta
=
\mu
+
\sigma \varepsilon
\quad ,
\quad
\varepsilon \sim \mathcal{N}(0,I)
\quad ,
\quad
\theta \sim \mathcal{N}(0,\sigma^2)
$$
$$
=
\frac{1}{\sigma}
E_{\varepsilon \sim \mathcal{N}(0,I)}
\left[
F(\mu + \sigma \varepsilon)
\varepsilon
\right]
$$
then same
$$
\mu_{\text{new}}
=
\mu_{\text{old}}
+
\alpha
\cdot
\text{Gradient}
$$
• In value learned methods we can use diff policy /
replay buffers cause they don't depend on
the policy , only need transition.
like
$$
Q(s,a)
\leftarrow
r
+
\gamma
\max_{a'}
Q(s',a')
$$
while policy grad :
$$
\nabla_\theta J(\theta)
=
E_{(s,a) \sim \pi_\theta}
\left[
Q^\pi(s,a)
\nabla_\theta
\log \pi_\theta(a|s)
\right]
$$
So expectation over current policy $\pi_\theta$ , (but
replay buffer of old transitions $\sim \pi_{\text{old}}$
so distribution mismatch
So that's why in PG if using off-policy we
use importance sampling where
$$
\frac{\pi_\theta(a|s)}{\beta(a|s)}
$$
new / old = importance weighting
$$
\nabla_\theta J(\theta)
=
E_{(s,a) \sim \mu}
\left[
\frac{\pi_\theta(a|s)}{\mu(a|s)}
Q(s,a)
\nabla_\theta
\log \pi_\theta(a|s)
\right]
$$
• Now a Summary
Reinforce →
$$
\nabla_\theta J(\theta)
=
E
\left[
G_t
\nabla_\theta
\log \pi_\theta (a_t | s_t)
\right]
$$
$$
\theta
\leftarrow
\theta
+
\alpha
G_t
\nabla_\theta
\log \pi_\theta (a_t | s_t)
$$
no bias high variance due to full return
now Q value
Actor Critic →
$$
\nabla_\theta J(\theta)
=
E
\left[
Q_\omega (s_t, a_t)
\nabla_\theta
\log \pi_\theta (a_t | s_t)
\right]
$$
$$
\theta
\leftarrow
\theta
+
\alpha_\theta
\, Q_\omega(s_t, a_t)
\, \nabla_\theta
\log \pi_\theta (a_t | s_t)
$$
now for value → how much to correct → TD error
$$ \delta_t =
r_t
+
\gamma Q_\omega(s',a')
-
Q_\omega(s,a)
$$
grad direction
$$
\nabla_\omega Q_\omega(s_t, a_t)
$$
$$
\omega
\leftarrow
\omega
+
\alpha_\omega
\, \delta_t
\, \nabla_\omega Q_\omega(s,a)
$$
See critic min error
$$
\min_\omega
( V_\omega(s_t) - \text{target} )^2
$$
so grad
$$
\nabla_\omega L
=
2 \delta_t
\nabla_\omega V_\omega(s_t)
$$
Actor max reward
$$
\max_\theta
E
\left[
A_t
\log \pi_\theta (a_t | s_t)
\right]
$$
grad
$$
\nabla_\theta J
=
A_t
\nabla_\theta
\log \pi_\theta (a_t | s_t)
$$
So
$$
L_{\text{critic}}
=
(-V(s_t)
+
r_t
+
\gamma V(s_{t+1}))^2
$$
$$
\delta_t
=
r_t
+
\gamma V(s_{t+1})
-
V(s_t)
$$
$$
L_{\text{critic}}
=
\delta_t^2
$$
in Advantage Actor critic (A2C)
$$
\theta
\leftarrow
\theta
+
\alpha_\theta
\, A_t
\, \nabla_\theta
\log \pi_\theta (a_t | s_t)
$$
$$
A_t
=
Q(s_t,a_t)
-
V(s_t)
$$
$$
\omega
\leftarrow
\omega
+
\alpha_\omega
\, (\delta_t)^2
$$
$$
\delta_t
=
A_t
\approx
r_t
+
\gamma V(s_{t+1})
-
V(s_t)
$$
return our content in same HTML foramt
now in DPG see the policy outputs one action
a = μ(s) not stochastic like π(·|s) like
distribution.
So then where randomness come from -that is
maybe env is stochastic , reward is remember that
coke machine
or sometimes we add noise.
But still how gradient of action prob when it
outputs a single action ???
also PG is naturally on policy but DPG is off
policy as action is deterministic. cause in
sampling trajectory we add noise ( μ(s) = π_θ(s) + ε )
most correct PG theorem form is
$$
\nabla_\theta J(\theta)
=
\sum_s p^\pi(s)
\sum_a
\pi(a|s)
\nabla_\theta
\log \pi(a|s)
Q(s,a)
$$
$$
=
E_{(s,a)\sim\pi_\theta}
\left[
Q^{\pi_\theta}(s,a)
\nabla_\theta
\log \pi(a|s)
\right]
$$
in DPG
$$
\nabla_\theta J(\theta)
=
E_{s \sim p^\mu}
\left[
\nabla_\theta
\pi_\theta(s)
\;
\nabla_a
Q^\pi(s,a)
\Big|
a = \pi_\theta(s)
\right]
$$
$$
p^\mu(s)
=
\sum_{t=0}^{\infty}
\gamma^t
P(s_t = s \mid \mu),
a_t = \pi_\theta(s_t)
$$
If you confused see its simple
PG
$$
\nabla_\theta J(\theta)
=
\sum_s
p^\pi(s)
\sum_a
\pi(a|s)
\nabla_\theta
\log \pi(a|s)
Q^\pi(s,a)
$$
here
$$
p^\pi(s)
=
\sum_{t=0}^{\infty}
\gamma^t
P(s_t = s)
$$
now stationary dist is the one that sums up to 1
which is prob. distribution rule
$$
p^\pi(s)
=
\frac{d^\pi(s)}{1-\gamma}
$$
here $\frac{1}{1-\gamma}$ is constant factor
but in DPG states comes from $p^\mu$ not $p^\pi$
So off policy hence we must mention which
distribution we are integrating over
$$
\nabla_\theta J(\theta)
=
\int
p^\mu(s)
\;
\nabla_a Q^\mu(s,a)
\;
\nabla_\theta \mu_\theta(s)
\Big|
_{a=\mu_\theta(s)}
\; ds
$$
$\mu$ → behavior policy
$\mu_\theta$ → actor & target policy
deterministic action
$$
\nabla_\theta
Q(s, \mu_\theta(s))
=
\frac{\partial Q(s,a)}{\partial a}
\cdot
\frac{\partial \mu_\theta(s)}{\partial \theta}
\Bigg|
_{a=\mu_\theta(s)}
$$
$$
\nabla_\theta J(\theta)
=
E_{s \sim p^\mu}
\left[
\nabla_a Q^\mu(s,a)
\;
\nabla_\theta \mu_\theta(s)
\Big|
_{a=\mu_\theta(s)}
\right]
$$
the DDPG or DPG + DQN
now that gradient of Q means slightly change action → how much Q value change → slightly change actor
param
→
how much action change
here
$$
\nabla_a Q
\cdot
\nabla_\theta \mu
$$
means how should i
change actor params so that action moves
uphill in Q
So here instead of using prob. we directly
pushes actions uphill.
Stochastic PG
$$
\nabla_\theta J(\theta)
=
E
\left[
Q(s,a)
\nabla_\theta
\log \pi_\theta(a|s)
\right]
$$
inc prob. of good actions
DPG
$$
\nabla_\theta J(\theta)
=
E
\left[
\nabla_a Q(s,a)
\cdot
\nabla_\theta \mu_\theta(s)
\right]
$$
move actions in direction that improves Q.
now D4 PG → distributed distribution deep deterministic PG
here what the words mean : many agents, instead of a Q value we get Z distribution, neural net, single
action
in MADDPG → Single env many agents like
- think of two robots carrying
table one pushes forward others
backward , the table don't move.
Multi agent rl when state transition depend on joint action.
like single agent
$$
P(s' \mid s, a)
$$
multi agent (all action at once)
$$
P(s' \mid s, a_1, a_2, \dots, a_N)
$$
now agent i only controls $a_i$ , not the rest actions
$a_{-i}$
but next state depends on them so from $a_i$ agent
perspective ,
next state distribution given only its action must average
over what other agents might do. So marginalization, also see agent work like
$$
P(s' \mid s, a_i, a_{-i})
$$
$$
P(s' \mid s, a_i)
=
\sum_{a_{-i}}
P(s' \mid s, a_i, a_{-i})
P(a_{-i} \mid s)
$$
$$
P(a_{-i} \mid s)
=
\prod_{j \ne i}
\pi_j(a_j \mid s_j)
$$
other agent policies
$$
\pi_j(a_j \mid s_j)
$$
now each actor so like multiple actors
but our marginalization would cause non stationarity as
policy will change So critic directly sees all actions
$$
Q_i^\mu (s, a_1, a_2, \dots, a_n)
$$
So critic is centralized but actors aren't, here o is for observation
$$
a_i
=
\mu_i(o_i)
$$
$$
Q_i^\mu
(
\vec{o},
a_1,
\dots,
a_n
)
$$
↓
$$
\vec{o}
=
(o_1, \dots, o_n),
\qquad
\vec{\mu}
=
(\mu_1, \dots, \mu_n) by \vec\theta = (\theta_1, \dots, \theta_n ),
\qquad
\vec{a}
=
(a_1, \dots, a_n)
$$
Actor update
$$
\nabla_{\theta_i} J(\theta_i)
=
E_{\vec{o}, \vec{a}}
\left[
\nabla_{a_i}
Q_i^\mu
(
\vec{o},
a_1,
\dots,
a_n
)
\;
\nabla_{\theta_i}
\mu_{\theta_i}(o_i)
\Big|
_{a_i=\mu_{\theta_i}(o_i)}
\right]
$$
Critic update
$$
L(\theta_i)
=
E_{\vec{o}, a_1, \dots, a_n, r_1, \dots, r_n, \vec{o}'}
\left[
\left(
Q_i^\mu
(
\vec{o},
a_1,
\dots,
a_n
)
- y
\right)^2
\right]
$$
↓
$$
y
=
r_i
+
\gamma
Q_i^{\vec{\mu}'}
(
\vec{o}',
a_1',
\dots,
a_n'
)
\Big|
_{a_j'=\mu_j'(o_j')}
$$
So here
$$
a_j' = \mu_j'(o_j')
$$
in some MARL others agent policy is inaccessible
So agent i may not have access to policies $\mu_j$ or
$\mu_j'$
So without knowing $\mu_j'$ we can't get
$a_j'$ For target $Q_i^{\mu'}(\dots)$ blocking critic update.
So then each agent i learn approximation $\mu_{ji}$ for
other agents j .
then loss is just log likelihood
$$
L(\phi_{ji})
=
-
E_{o_j, a_j}
\left[
\log \mu_{ji}(a_j \mid o_j)
+
\lambda
H(\mu_{ji})
\right]
$$
now in TRPO we just clip or stop the $\pi_{\text{old}}$ to change
or update too much / wildly
$$
J(\theta)
=
E_{s \sim p^{\pi_{\text{old}}}, \;
a \sim \pi_{\text{old}}}
\left[
\frac{\pi_\theta(a \mid s)}
{\pi_{\text{old}}(a \mid s)}
\;
\hat{A}^{\pi_{\text{old}}}(s,a)
\right]
$$
$$
E_{s \sim p^{\pi_{\text{old}}}}
\left[
D_{KL}
\big(
\pi_{\text{old}}(\cdot \mid s)
\,\|\,
\pi_\theta(\cdot \mid s)
\big)
\right]
\le \delta
$$
in PPO we use clipping as KL div might get complex
$$
r(\theta)
=
\frac{\pi_\theta(a \mid s)}
{\pi_{\text{old}}(a \mid s)}
$$
$$
J^{\text{TRPO}}(\theta)
=
E
\left[
r(\theta)
\hat{A}^{\pi_{\text{old}}}(s,a)
\right]
$$
$$
J^{\text{CLIP}}(\theta)
=
E
\left[
\min\left(
r(\theta)\hat{A}^{\pi_{\text{old}}}(s,a),
\text{clip}(r(\theta),1-\epsilon,1+\epsilon)\hat{A}^{\pi_{\text{old}}}(s,a)
\right)
\right]
$$
3 failure modes are → sensitive to initialization when local
optimal action are close to initialization, discrete action space with sparse high
PPO is unstable when rewards , standard PPO might stuck at suboptimal actions. Continuous action space PPO
is
unstable when rewards
vanish outside suboptimal actions.
see
$$
\pi(a \mid s)
\sim
\mathcal{N}(\mu(s), \sigma(s))
$$
now this can be any numbers
say $a_{\text{raw}} = 2.3$ but env clips to 1
now reward computed using $a_{\text{env}}$
but $\pi_{\text{old}}(a_{\text{raw}} \mid s)$ ,
$\pi_{\text{new}}(a_{\text{raw}} \mid s)$
see the mismatch. also $r(\theta)$
is sensitive when $a_{\text{raw}}$ lies in tail of logic formula
of prob. dist.
when action clips to 1 reward is flat so
high var × tiny adv
$\hat{A} \approx 0$ , $r(\theta)$ fluctuates wildly so
high var × tiny adv → terrible signal to noise
$$
D_{KL}
(\pi_{\text{old}} \,\|\, \pi_\theta)
$$
→ penalty when $\pi_\theta$ puts low prob
on actions that old policy liked
$$
D_{KL}
(\pi_\theta \,\|\, \pi_{\text{old}})
$$
→ penalty when $\pi_\theta$ puts high prob
on actions old policy considered very unlikely
now PPG → basically shared parameters value
policy to value head Sometimes cause good
interference , so train in phases
hence the name Phasic PG.
Phase 1 for policy :- nothing changed same loss
as prev just value head is frozen
$$
L_{\text{CLIP}}(\theta)
=
E
\left[
\min\left(
r(\theta)\hat{A},
\text{clip}(r(\theta),1-\epsilon,1+\epsilon)\hat{A}
\right)
\right]
$$
Phase 2 Auxiliary phase :- here value improvement +
distillation into policy.
$$
L_{\text{joint}}
=
L_{\text{aux}}
+
\beta_{\text{clone}}
\;
E_t
\left[
KL
\big(
\pi_{\text{old}}(\cdot \mid s_t),
\pi_\theta(\cdot \mid s_t)
\big)
\right]
$$
↓
$$
L_{\text{aux}}
=
L_{\text{value}}
=
E_t
\left[
\frac{1}{2}
\left(
V_\omega(s_t)
-
\hat{V}_t^{\text{targ}}
\right)^2
\right]
$$
how very imp was ACER method uses Retrace in Q value
like actor critic off-policy just uses sampling ratio
then why it needs this thing ?
see to make a on policy things off policy we have to
introduce the imp. sampling ratio right.
now see in actor critic off policy it is single
step learning like
$$
Q(s_t, a_t)
\leftarrow
r
+
\gamma
V(s_{t+1})
$$
but this is bootstrapped estimate means assuming our
prediction is true . so there is bias & hence
it might be slow & take more iterations to converge.
also in off policy AC the policies aren't that drifted
also actor critic is more sensitive to critic bias cause
actor literally follows critic's advantage estimate.
& if it had bias then it might yolo sometimes hence
ACER wants better estimate. so that multi step or
n-step learning gives but there is high variance.
now in 1 step we scale by ratio $\rho_t$
in n-step its like $\rho_t \times \rho_{t+1} \times \rho_{t+2} \dots$ .
So n-step becomes instability source.
this becomes instability source.
in ACER we use replay buffer so policies can differ
a lot.
and in pure off policy Q learning we just use
$$
\max_{a'}
Q(s', a')
$$
instead of
$$
\sum_{a'}
\pi(a'|s')
Q(s',a')
$$
hence no imp sampling ratio needed. cause no policy
evaluation at all.
In actor critic since 1 step just one ratio term
but in ACER its a product hence problem.
So in ACER → Multi step returns, off policy correction, Replay buffer
see
$$
\delta_t
=
R_t
+
\gamma
E_{a \sim \pi}
Q(S_{t+1}, a)
-
Q(S_t, A_t)
$$
→ future update using the current
policy $\pi$.
$$
Q(S_t, A_t)
\leftarrow
Q(S_t, A_t)
+
\alpha
\delta_t
$$
so the update
$$
\Delta Q(S_t, A_t)
=
\alpha
\delta_t
$$
now introducing imp sampling
$$
\Delta Q^{\text{imp}}(S_t, A_t)
=
\gamma^t
\prod_{1 \le \tau \le t}
\frac{\pi(A_\tau | S_\tau)}
{\beta(A_\tau | S_\tau)}
\;
\delta_t
$$
now this would have high variance so
$$
\Delta Q^{\text{ret}}(S_t, A_t)
=
\gamma^t
\prod_{1 \le \tau \le t}
\min
\left(
c,
\frac{\pi(A_\tau|S_\tau)}
{\beta(A_\tau|S_\tau)}
\right)
\;
\delta_t
$$
so ACER uses this as target
$$
\left(
Q^{\text{ret}}(s_t, a_t)
-
Q(s_t,a_t)
\right)^2
$$
so acer pg
$$
\hat{g}_t^{\text{acer}}
=
\omega_t
\left(
Q^{\text{ret}}(S_t, A_t)
-
V_\omega(S_t)
\right)
\nabla_\theta
\ln
\pi_\theta(A_t | S_t)
$$
$$
\omega_t
=
\frac{\pi(A_t | S_t)}
{\beta(A_t | S_t)}
$$
see Value is a baseline & its
located on current policy , $Q(s,a)$
need correction cause we evaluating a
specific action $a$ taken by behavior policy
& there is a $\omega_t$ cause at the actor level
& at the critic level.
now this eqn also might have high variance
as $\omega_t$ might be large.
so truncate aggressively + correct truncation bias
$$
\min(c, \omega_t)
\left(
Q^{\text{ret}} - V
\right)
\nabla
\log
\pi(A_t | S_t)
$$
$$
E_{a \sim \pi_\theta}
\left[
\max(0, \omega(a) - c)
\frac{Q(s,a) - V(S_t)}
{\omega(a)}
\nabla
\log
\pi(a|S_t)
\right]
$$
alright see in the first term we reduced the
thing $\omega - c$ so now we add that & see
the action coming from behavior policy earlier
here we make that $a \sim \pi_\theta$ cause we trying
to estimate expectation under current policy $\pi$.
So
$$
E_\beta
\left[
\omega(a)
F(a)
\right]
=
E_\pi
\left[
F(a)
\right]
$$
now SAC (soft actor critic) → off policy
$$
J(\theta)
=
\sum_{t=1}^{T}
E_{(s_t,a_t)\sim P_{\pi_\theta}}
\left[
r(s_t,a_t)
+
\alpha
H(\pi_\theta(\cdot|s_t))
\right]
$$
← trained with policy $\pi_\theta$
see there is entropy maximization as well so
it keeps on exploring , here policy is stochastic
one as deterministic one not makes sense.
$$
H(\pi(\cdot|S_t))
=
-
E_{a\sim\pi}
\left[
\log \pi(a|S_t)
\right]
$$
now in classic vs soft like how reward & value func would change
reward
$$
r_t
\quad \rightarrow \quad
r_t + \alpha H(\pi(\cdot|S_t))
$$
Value
$$
V^\pi(s)
=
E[R_t \mid S_t=s]
\quad \rightarrow \quad
E
\left[
R_t
+
\alpha
\sum_{k=t}^{\infty}
H_k
\;\middle|\;
S_t=s
\right]
$$
Q func
$$
Q^\pi(s,a)
=
E
\left[
R_t
\;\middle|\;
S_t=s, A_t=a
\right]
\quad \rightarrow \quad
r(s,a)
+
\gamma
E_{s'}
\left[
V^\pi(s')
\right]
$$
Now these two are from classic one basically as reward changed so functions are also changing
$$
V^\pi(s)
=
E_{a\sim\pi}
\left[
Q^\pi(s,a)
\right]
$$
$$
R_t
=
\sum_{k=t}^{\infty}
\gamma^{k-t}
r_k
$$
soft return
$$
R_t^{\text{soft}}
=
\sum_{k=t}^{\infty}
\gamma^{k-t}
\left(
r_k
+
\alpha
H(\pi(\cdot|S_k))
\right)
$$
optimal soft
$$
Q^*(s,a)
=
r(s,a)
+
\gamma
E_{s'\sim p(\cdot|s,a)}
\left[
V^*(s')
\right]
$$
$$
V^*(s)
=
E_{a\sim\pi^*}
\left[
Q^*(s,a)
-
\alpha
\log \pi^*(a|s)
\right]
$$
now soft RL definitions
$$
R_t^{\text{soft}}
=
\sum_{k=t}^{\infty}
\gamma^{k-t}
(r_k + \alpha H(\pi(\cdot|S_k)))
$$
$$
V^\pi(s)
=
E_{a\sim\pi}
\left[
Q^\pi(s,a)
-
\alpha
\log \pi(a|s)
\right]
$$
$$
Q^\pi(s,a)
=
r(s,a)
+
\gamma
E_{s'\sim p(\cdot|s,a)}
\left[
V^\pi(s')
\right]
$$
now MSE losses
$$
J_V(\psi)
=
E_{s_t \sim D}
\left[
\frac{1}{2}
\left(
V_\psi(s_t)
-
E_{a_t\sim\pi_\theta}
[
Q_\omega(s_t,a_t)
-
\log \pi_\theta(a_t|s_t)
]
\right)^2
\right]
$$
$$
\nabla_\psi J_V(\psi)
=
\nabla_\psi V_\psi(s_t)
\left(
V_\psi(s_t)
-
Q_\omega(s_t,a_t)
+
\log \pi_\theta(a_t|s_t)
\right)
$$
D is replay buffer
soft Q func
$$
J_Q(\omega)
=
E_{(s_t,a_t)\sim D}
\left[
\frac{1}{2}
\left(
Q_\omega(s_t,a_t)
-
(r(s_t,a_t)
+
\gamma
E_{s_{t+1}}
[V_\psi(s_{t+1})])
\right)^2
\right]
$$
here $\Phi$ is target value func
now there is a thing called Boltzmann distribution related to equilibrium systems.
$$
P(E_i)
=
\frac{1}{Z}
e^{-E_i / k_B T}
$$
$$
Z
=
\sum_{i=1}^{N}
e^{-E_i / k_B T}
$$
each state i has a Energy that is $E_i$. probability of finding a particular state i with particular
energy
is
proportional to exponential of energy
divided by Boltzmann constant & temperature. $Z$ is partition function basically normalizing constant.
and as they are probabilities
$$
\sum_{i=1}^{N} P(E_i) = 1
$$
now again our reward was
$$
\mathbb{E}[Q(s,a)] + \alpha H(\pi)
$$
solving a constrained problem so calculus tells optimal solution should satisfy
$$
\pi^*(a|s)
\propto
\exp\!\left(
\frac{Q(s,a)}{\alpha}
\right)
$$
see this RHS is Boltzmann over Q
this as target so
so KL Div would make this as target so
our policy go toward that
$$
\pi_{\text{new}}
=
\arg\min_{\pi' \in \Pi}
D_{KL}
\left(
\pi'(\cdot|s)
\,\bigg\|\,
\frac{
\exp(Q^{\text{old}}(s,\cdot)/\alpha)
}{
Z^{\text{old}}(s)
}
\right)
$$
policy objective
$$
J_\pi(\theta)
=
\nabla_\theta
D_{KL}
\left(
\pi_\theta(\cdot|s_t)
\,\bigg\|\,
\frac{
\exp(Q_\omega(s_t,\cdot)/\alpha)
}{
Z_\omega(s_t)
}
\right)
$$
$$
=
E_{a \sim \pi_\theta}
\left[
\log \pi_\theta(a_t|s_t)
-
\frac{1}{\alpha}
Q_\omega(s_t,a_t)
+
\log Z_\omega(s_t)
\right]
$$
here $Z_\omega(s_t)$ is partition func to normalize the
distribution , as it not depend on action so
its grad will be zero & its complex to
calc hence this is good.
also $\pi_\theta$ might not able to represent arbitrary
distribution so we restrict to simpler one like
Gaussian , GMM or some more.
as if Gaussian
$$
\pi_\theta(a|s)
=
\mathcal{N}
\big(
\mu_\theta(s),
\sigma_\theta(s)
\big)
$$
hence if more expressive $\pi$ → better approximation of Boltzmann
→ good perform → hard optimization & compute.
see these loss was
$$
J_\pi(\theta)
=
\mathbb{E}_{a_t \sim \pi_\theta}
\Big[
\alpha \log \pi_\theta(a_t|s_t)
-
Q_\omega(s_t,a_t)
+
\log Z_\omega(s_t)
\Big]
$$
here $\alpha$ this temp is like we want
entropy to be like $\ge H_0$ a constant like
to make it explorable
So now this is constraint optimization with
constraint so we used lagrangian concept &
maximize expected return
subject to entropy constraint
$$
\mathbb{E}
\big[
-\log \pi_\theta(a_t|s_t)
\big]
\ge H_0
$$
Lagrangian:
$$
\mathcal{L}
=
\mathbb{E}
\Big[
Q_\omega(s_t,a_t)
-
\alpha
\big(
\log \pi_\theta(a_t|s_t)
+
H_0
\big)
\Big]
$$
& so the temp ($\alpha$) eqn comes up
$$
J(\alpha)
=
\mathbb{E}_{a_t \sim \pi_t}
\Big[
-\alpha \log \pi_t(a_t|s_t)
-
\alpha H_0
\Big]
$$
so $\alpha$ automatically adjusts.
if entropy low → increase $\alpha$,
if entropy high → decrease $\alpha$
TD3 → Twin Delayed Deep Deterministic Policy Gradient
main thing is overestimation of value func
So policy follows estimation then lead update & it
collapses kind of.
idea inspired from Double Q-learning
two critic networks
take minimum to reduce overestimation bias
target:
$$
y
\leftarrow
r_t
+
\gamma
\min_{i=1,2}
Q_{\theta_i'}
(s', \tilde{a})
$$
$$
\tilde{a}
=
\pi_{\theta'}(s')
+
\epsilon,
\quad
\epsilon
\sim
\text{clip}
\big(
\mathcal{N}(0,\sigma),
-c,
c
\big)
$$
see sometimes there is artificial peaks in Q at some
action , so adding some noise basically
neighbours would have pretty low value so it
gets down.
this is target policy smoothing
and two target critic networks minimum of them
would act as target.
also delayed policy updates
critic updated more frequently
actor updated slowly
to stabilize learning.
why learn one policy , learn whole population
of em that cooperate & repel
maintain set of policies
$\{ \theta_i \}_{i=1}^N$
update combines:
1) policy gradient term → improve return
2) repulsive term → encourage diversity
update direction:
$$
\Delta \theta_i
=
\frac{1}{N}
\sum_{j=1}^N
\Big[
k(\theta_j,\theta_i)
\nabla_{\theta_j} J(\theta_j)
+
\nabla_{\theta_j}
k(\theta_j,\theta_i)
\Big]
$$
first term → pulls toward high reward,
second term → pushes particles apart
so instead of single optimum
you approximate distribution over good policies.
IMPALA method → Importance Weighted Actor-Learner Architecture
Vθ → θ , πφ → φ , old policy μ
multi-step corrected target:
$$
V_t
=
V_\theta(s_t)
+
\sum_{i=t}^{t+n-1}
\gamma^{\,i-t}
\Big(
\prod_{j=t}^{i-1} c_j
\Big)
\delta_i
$$
where
$$
\delta_i
=
\rho_i
\big(
r_i
+
\gamma V_\theta(s_{i+1})
-
V_\theta(s_i)
\big)
$$
importance ratios:
$$
\rho_i
=
\min
\left(
\bar{\rho},
\frac{\pi_\phi(a_i|s_i)}{\mu(a_i|s_i)}
\right)
$$
$$
c_i
=
\min
\left(
\bar{c},
\frac{\pi_\phi(a_i|s_i)}{\mu(a_i|s_i)}
\right)
$$
single-step case:
$$
V_t
=
V_\theta(s_t)
+
\delta_t
$$
and update:
$$
\theta
\leftarrow
\theta
+
\alpha
\big(
V_t
-
V_\theta(s_t)
\big)
\nabla_\theta V_\theta(s_t)
$$
pure importance sampling multi-step TD:
$$
V_t
=
\rho_t \delta_t
+
\gamma \rho_t \rho_{t+1} \delta_{t+1}
+
\dots
+
\gamma^n V_\theta(s_{t+n})
$$
where
$$
\rho_t
=
\frac{\pi_\phi(a_t|s_t)}{\mu(a_t|s_t)}
$$
variance explodes due to product of ratios
so instead rewrite using TD errors
and clip ratios → V-trace correction
value update:
$$
\Delta \theta
=
\big(
V_t
-
V_\theta(s_t)
\big)
\nabla_\theta V_\theta(s_t)
$$
actor update:
$$
\Delta \phi
=
\rho_t
\nabla_\phi
\log
\pi_\phi(a_t|s_t)
\Big(
r_t
+
\gamma V_\theta(s_{t+1})
-
V_\theta(s_t)
\Big)
+
\nabla_\phi H(\pi_\phi)
$$
summary:
off-policy correction via clipped is
stable multi-step learning
decoupled actors generate data under μ
learner updates toward π using V-trace
RL Basics &
Environment
Model-based vs Model-free
Probability Basics
State Transitions
Policy & Value
Functions
Return & Value
Bellman Equations
Optimality
Policy Iteration
(GPI)
MC vs TD Learning
SARSA & Q-Learning
Deep Q-Networks
(DQN)
n-step TD Learning
Policy Gradient
(PG)
Policy Gradient Theorem
REINFORCE (Monte Carlo PG)
Advantage & Baselines
Actor-Critic
Methods
A2C / A3C (Async Methods)
Evolution
Strategies
Off-policy PG & Importance Sampling
Policy Gradient Summary
Deterministic PG
(DPG)
DDPG (DPG + DQN)
D4PG
Multi-Agent RL
(MADDPG)
PPO & TRPO
TRPO
PPO
PPO Failure Modes
Phasic Policy
Gradient (PPG)
ACER (Actor-Critic
with Replay)
Retrace & Importance Sampling
Soft Actor-Critic
(SAC)
SAC Objectives & Losses
Boltzmann Distribution & Equilibrium
TD3 (Twin Delayed DDPG)
SVPG (Stein Variational PG)
IMPALA & V-trace
Hope this helped you & if there was some mistake then you can dm me at @tm23twt username in twitter. Till then keep learning :)