notes on RL – Lillian Weng blog

RL Basics & Environment

The model defines the transition probability of the environment:

$$ P(s' \mid s, a) $$

The effective transition probability under our policy is:

$$ P^{\pi}(s' \mid s) = \sum_{a} \pi(a \mid s) \, P(s' \mid s, a) $$

If we know the model → we do model-based RL → Dynamic Programming.

Either model-free or try to learn the model explicitly as part of the algorithm.

Transition probability & reward dynamics.

Model-based vs Model-free

Model-based → rely on a model of the environment, either it is known or the algorithm learns it explicitly.

Model-free → no dependency on the model during learning. It is like ignoring the world, but the method tends to work. They just want to map states to the best action to maximize reward through trial & error.

Like Q-learning, reinforcement learning methods that learn the policy $\pi$ directly.

Probability Basics

Joint $P = P(A \cap B)$ like $P(s',r|s,a)$

Marginal $P(A) = \sum P(A,B) = \int P(A,B) \, db$

Conditional $P(A|B) = \frac{P(A \cap B)}{P(B)}$

State Transitions

$$ S_t \; \xrightarrow{A_t} \; [R_{t+1}, S_{t+1}] $$

State transition func can be defined as func of $P(s',r|s,a)$

So say vending machine state $(s)$, action $(a)$ is insert coin

Outcome A : get Soda $(s')$ & reward +1 (it worked)

B : get soda $(s')$ & reward +5 (machine glitched & got your coin back)

C : eat your coin (S Fail) & reward -1

now

$$ P(s', r \mid s, a) = P[ S_{t+1}=s' , R_{t+1}=r \mid S_t = s , A_t = a ] $$

$$ P_{ss'}^a = P(s'|s,a) = P[ S_{t+1}=s' \mid S_t = s , A_t = a ] = \sum_{r \in R} P(s',r \mid s,a) $$

this marginalizes out rewards

if reward was always same for specific move , this Summation is formality -that doesn't change numbers.

$$ R(s,a) = E[ R_{t+1} \mid S_t = s , A_t = a ] = \sum_{r \in R} r \sum_{s' \in S} P(s',r \mid s,a) $$

Policy & Value Functions

Now Policy → agent's behavior function $\pi$ tells which action to take in state $s$.

• Deterministic $\quad \pi(s) = a$

• Stochastic $\quad \pi(a|s) = P_\pi [ A=a \mid S=s ]$

Return & Value

$$ G_t = R_{t+1} + \gamma R_{t+2} + \cdots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} $$

State Value (s) $$ V_\pi(s) = E_\pi [ G_t \mid S_t = s ] $$

Action Value (s,a) $$ Q_\pi(s,a) = E_\pi [ G_t \mid S_t = s , A_t = a ] $$

$$ = R(s,a) + \gamma E_\pi [ V_\pi(S_{t+1}) \mid S_t = s , A_t = a ] $$

Bellman Equations

if we marginalize the action we again get value of state

$$ V_\pi(s) = \sum_{a \in A} Q_\pi(s,a) \, \pi(a|s) $$

Advantage (A) $$ A_\pi(s,a) = Q_\pi(s,a) - V_\pi(s) $$

$$ V(s) = E[ R_{t+1} + \gamma V(S_{t+1}) \mid S_t = s ] $$

$$ Q(s,a) = E[ R_{t+1} + \gamma E_{a \sim \pi} Q(S_{t+1}, a) \mid S_t = s , A_t = a ] $$ action is sampled from policy

$$ Q_\pi(s,a) = R(s,a) + \gamma \sum_{s' \in S} P^a_{ss'} V_\pi(s') $$

⇒ tp jus a way to cover all possible way /situations to reach $s'$

$$ V_\pi(s) = \sum_{a \in A} \pi(a|s) \left( R(s,a) + \gamma \sum_{s' \in S} P^a_{ss'} V_\pi(s') \right) $$

Optimality

$$ Q_\pi(s,a) = R(s,a) + \gamma \sum_{s' \in S} P^a_{ss'} \sum_{a' \in A} \pi(a'|s') Q_\pi(s',a') $$

$$ V_*(s) = \max_{\pi} V_\pi(s) \quad , \quad Q_*(s,a) = \max_{\pi} Q_\pi(s,a) $$

↓

among all policies pick the one that gives highest expected return. this is that policy

$$ \pi_* = \arg\max_{\pi} V_\pi(s) \quad , \quad \pi_* = \arg\max_{\pi} Q_\pi(s,a) $$

returns a policy

$$ V_{\pi_*}(s) = V_*(s) \qquad Q_{\pi_*}(s,a) = Q_*(s,a) $$

⇒ best possible strategy & best possible reward

So to find the optimal policy $(\pi_*)$ if agent learns optimal values say $(Q_*)$ then finding policy becomes easy. cause

then its just

$$ \pi_*(s) = \arg\max_a Q_*(s,a) $$

in optimal case

$$ V_*(s) = \max_{a \in A} Q_*(s,a) $$ cause its like now we know true Q values , no need to explore hence go deterministically

$$ Q_*(s,a) = R(s,a) + \gamma \sum_{s' \in S} P^a_{ss'} V_*(s') $$

⇒ again to cover all cases

$$ V_*(s) = \max_{a \in A} \left( R(s,a) + \gamma \sum_{s' \in S} P^a_{ss'} V_*(s') \right) $$

$$ Q_*(s,a) = R(s,a) + \gamma \sum_{s' \in S} P_{ss'}^a \max_{a' \in A} Q_*(s',a') $$

Policy Iteration (GPI)

In most cases we don't know $P_{ss'}^a$ or $R(s,a)$

$$ V(s) = \sum_{a} \pi(a|s) \sum_{s',r} P(s',r \mid s,a) \left( r + \gamma V(s') \right) $$

$$ V_{t+1}(s) = E_\pi [ r + \gamma V_t(s') \mid S_t = s ] $$ → not change how agent behaves just makes Values more accurate for current policy $\pi$

$$ Q_\pi(s,a) = \sum_{s',r} P(s',r \mid s,a) (r + \gamma V_\pi(s')) $$

$$ = E [ R_{t+1} + \gamma V_\pi(S_{t+1}) \mid S_t = s , A_t = a ] $$

→ this change agent behavior

When we evaluate $V_{t+1}$ we update knowledge of current scenario & when we improve $Q_\pi \to \pi'$ we update our behavior on that knowledge.

$$ \pi'(s) = \arg\max_a Q_\pi(s,a) $$

So basically we have value of states, Q value & policy now if we say updated our Policy so we will update our State values , that would lead to Q value updates , then our new policy will become this argmax & then again loop.

This is Policy iteration & a common algorithm Generalized Policy iteration

$$ \pi_0 \to V^{\pi_0} \to \pi_1 \to V^{\pi_1} \to \pi_2 \to \dots \to V_* $$

MC vs TD Learning

Now Monte Carlo is model free with more variance as one episode can skew things. In Temporal diff we don't wait for whole episode

So in TD learning we use bootstrapping

$$ V(S_t) \leftarrow V(S_t) + \alpha \left( R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right) $$

$$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left( R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right) $$

SARSA & Q-Learning

now SARSA is on-policy TD Control (optimal policy in TD learning)

Q learning is off-policy TD control

So in SARSA, Q learning

$$ A_{t+1} = \arg\max_{a \in A} Q(S_{t+1}, a) $$

↓

$\varepsilon$-greedy in action so sometimes even the action with less Q value might get selected.

Q learning

$$ Q(S_t, A_t) \leftarrow \dots \left( R_{t+1} + \gamma \max_{a \in A} Q(S_{t+1}, a) \right) \dots $$

key diff is that Q learning doesn't follow current policy to pick the second action $A_{t+1}$ , it estimates the best one directly.

Deep Q-Networks (DQN)

But now Q learning table will easily become large hence now we use model to predict Q value as $$ Q(s,a;\theta) $$

Training could be hard & unstable due to deadly triad. Non linear func approx, off policy, bootstrapping

So in DQN LOSS is like

$$ L(\theta) = E_{(s,a,r,s') \sim U(D)} \left[ \left( r + \gamma \max_{a'} Q(s',a';\theta^-) - Q(s,a;\theta) \right)^2 \right] $$

↓

uniform distribution of replay memory

n-step TD Learning

in generalized n-step TD learning

$$ V(S_t) \leftarrow V(S_t) + \alpha \left( G_t^n - V(S_t) \right) $$

↓

$$ G_t^n = R_{t+1} + \gamma R_{t+2} + \cdots + \gamma^n V(S_{t+n}) $$

Policy Gradient (PG)

now as of now we learn state/action value then to select action accordingly , now in PG directly policy.

$$ \pi(a|s ; \theta) $$ policy with $\theta$ param

$$ J(\theta) $$ : expected total reward , we want to max this

Discrete →

$$ J(\theta) = V_{\pi_\theta}(S_1) $$

initial state return so goal is to find $\theta$ that maximize value of initial state (here or this for Starting state is same)

Continuous / Stationary case → many Starting state or agent run forever or continuous

(stationary distribution) $$ d^{\pi_\theta}(s) $$ ← So we track how often agent visit each state

prob that agent will be in states if it follow $\pi_\theta$ for long time

$$ J(\theta) = \sum_{s \in S} d^{\pi_\theta}(s) V_{\pi_\theta}(s) $$ this is for each state

$$ = \sum_{s} d^{\pi_\theta}(s) \sum_{a \in A} \pi(a|s,\theta) Q_\pi(s,a) $$

Policy Gradient Theorem

now Policy gradient theorem allow us to calculate gradient

$$ \nabla J(\theta) $$

without actually calc. derivative of state distribution

$$ d^{\pi_\theta}(s) $$

Cause when we change $\theta$ to improve $(\pi)$ we accidentally change where we go $$ d^{\pi_\theta}(s) $$ cause they are dependent.

now this $$ d^{\pi_\theta}(s) $$ is problem & we'll proof that it is replaced via $$ d(s) $$ just distribution not dep. on $\theta$

$$ J(\theta) = E_{\pi_\theta}[r] = \sum_{s \in S} d^{\pi_\theta}(s) \sum_{a \in A} \pi(a|s;\theta) R(s,a) $$

$$ J(\theta) = \sum_{s \in S} d(s) \sum \dots $$

$$ \nabla J(\theta) = \sum_{s \in S} d(s) \sum_{a \in A} \pi(a|s;\theta) \nabla_\theta \pi(a|s;\theta) Q_\pi(s,a) $$

$$ \nabla J(\theta) = \sum_{s \in S} d(s) \sum_{a \in A} \pi(a|s;\theta) \nabla_\theta \ln \pi(a|s;\theta) Q_\pi(s,a) $$

$$ = E_{\pi_\theta} \left[ \nabla \ln \pi(a|s;\theta) Q_\pi(s,a) \right] $$

REINFORCE (Monte Carlo PG)

Reinforce / Monte carlo PG → since whole episode we have $G_t$

$$ \nabla J(\theta) = E_{\pi_\theta} \left[ \nabla_\theta \ln \pi_\theta(a_t|s_t) G_t \right] $$

So

$$ \theta \leftarrow \theta + \alpha \gamma^t G_t \nabla_\theta \ln \pi_\theta (A_t | S_t) $$

→ like $G_t$ at state $t = R_{t+1} + \gamma R_{t+2} + \cdots$

now here again variance

Advantage & Baselines

so we use advantage baseline which is

$$ A(s,a) = Q(s,a) - V(s) $$

now its not like we can't make it REINFORCE with baseline like

$$ \nabla J(\theta) = E_{\pi_\theta} \left[ \nabla_\theta \ln \pi_\theta (a_t | s_t) (G_t - V(s_t)) \right] $$

monte carlo with baseline

Actor-Critic Methods

Now in Actor critic Temporal difference comes into play

Now instead of $G_t$ or baseline,

$$ A(s_t, a_t) \approx r_t + \gamma V(s_{t+1}) - V(s_t) $$ bootstrap

Critic is value network $$ V(s) $$ can be $$ Q(a|s;\omega) $$ or $$ V(s;\omega) $$

Actor is policy network $$ \pi(a|s;\theta) $$ updates in direction suggested by critic

now policy param $\theta$

$$ \theta \leftarrow \theta + \alpha_\theta \, Q(s,a;\omega) \, \nabla_\theta \ln \pi(a|s;\theta) $$

& for critic value param $\omega$

$$ \omega \leftarrow \omega + \alpha_\omega \, G_{t:t+1} \, \nabla_\omega Q(s,a;\omega) $$

↓

$$ G_{t:t+1} = r_t + \gamma Q(s',a';\omega) - Q(s,a;\omega) $$

this eqn is basically

$$ \omega_{\text{new}} \leftarrow \omega_{\text{old}} + \alpha \frac{\partial L}{\partial \omega} $$

now

$$ L(\omega) = \frac{1}{2} (G_{t:t+1})^2 $$

basically MSE of TD error

now gradient

$$ \frac{\partial L}{\partial \omega} = \text{error} \times \frac{\partial (Q(s,a;\omega))}{\partial \omega} $$

So that gives eqn

$$ \omega \leftarrow \omega + \alpha_\omega \, (\text{TD error}) \, \nabla_\omega Q(s,a;\omega) $$

A2C / A3C (Async Methods)

in A3C there is Async Asynchronous , so like multiple workers and like global param update async settings same as A2C.

Evolution Strategies

Evolution strategies : Think of this like distribution of $d$ dimension now a sample would give us a neural network params. So we basically have mean + variance × epsilon

Where we get high fitness score that becomes our answer.

$$ J(\mu) = E_{\theta \sim p_\mu} [F(\theta)] $$

$$ \nabla_\mu E_{\theta \sim \mathcal{N}(\mu,\sigma^2)} [F(\theta)] = \nabla_\mu \int F(\theta) p_\mu(\theta) d\theta $$

$$ = \int F(\theta) \nabla_\mu p_\mu(\theta) d\theta $$

$$ \nabla_\mu p_\mu(\theta) = p_\mu(\theta) \nabla_\mu \ln p_\mu(\theta) $$

$$ = \int F(\theta) p_\mu(\theta) \nabla_\mu \ln p_\mu(\theta) d\theta $$

$$ = E_\theta \left[ F(\theta) \nabla_\mu \log p_\mu(\theta) \right] $$

$$ \nabla_\mu \log p_\mu(\theta) = \frac{\theta - \mu}{\sigma^2} $$

$$ = E_\theta \left[ F(\theta) \frac{\theta - \mu}{\sigma^2} \right] $$

$$ \theta = \mu + \sigma \varepsilon \quad , \quad \varepsilon \sim \mathcal{N}(0,I) \quad , \quad \theta \sim \mathcal{N}(0,\sigma^2) $$

$$ = \frac{1}{\sigma} E_{\varepsilon \sim \mathcal{N}(0,I)} \left[ F(\mu + \sigma \varepsilon) \varepsilon \right] $$

then same

$$ \mu_{\text{new}} = \mu_{\text{old}} + \alpha \cdot \text{Gradient} $$

Off-policy PG & Importance Sampling

• In value learned methods we can use diff policy / replay buffers cause they don't depend on the policy , only need transition.

like

$$ Q(s,a) \leftarrow r + \gamma \max_{a'} Q(s',a') $$

while policy grad :

$$ \nabla_\theta J(\theta) = E_{(s,a) \sim \pi_\theta} \left[ Q^\pi(s,a) \nabla_\theta \log \pi_\theta(a|s) \right] $$

So expectation over current policy $\pi_\theta$ , (but replay buffer of old transitions $\sim \pi_{\text{old}}$ so distribution mismatch

So that's why in PG if using off-policy we use importance sampling where

$$ \frac{\pi_\theta(a|s)}{\beta(a|s)} $$ new / old = importance weighting

$$ \nabla_\theta J(\theta) = E_{(s,a) \sim \mu} \left[ \frac{\pi_\theta(a|s)}{\mu(a|s)} Q(s,a) \nabla_\theta \log \pi_\theta(a|s) \right] $$

Policy Gradient Summary

• Now a Summary

Reinforce →

$$ \nabla_\theta J(\theta) = E \left[ G_t \nabla_\theta \log \pi_\theta (a_t | s_t) \right] $$

$$ \theta \leftarrow \theta + \alpha G_t \nabla_\theta \log \pi_\theta (a_t | s_t) $$

no bias high variance due to full return

now Q value

Actor Critic →

$$ \nabla_\theta J(\theta) = E \left[ Q_\omega (s_t, a_t) \nabla_\theta \log \pi_\theta (a_t | s_t) \right] $$

$$ \theta \leftarrow \theta + \alpha_\theta \, Q_\omega(s_t, a_t) \, \nabla_\theta \log \pi_\theta (a_t | s_t) $$

now for value → how much to correct → TD error

$$ \delta_t = r_t + \gamma Q_\omega(s',a') - Q_\omega(s,a) $$

grad direction

$$ \nabla_\omega Q_\omega(s_t, a_t) $$

$$ \omega \leftarrow \omega + \alpha_\omega \, \delta_t \, \nabla_\omega Q_\omega(s,a) $$

See critic min error

$$ \min_\omega ( V_\omega(s_t) - \text{target} )^2 $$

so grad

$$ \nabla_\omega L = 2 \delta_t \nabla_\omega V_\omega(s_t) $$

Actor max reward

$$ \max_\theta E \left[ A_t \log \pi_\theta (a_t | s_t) \right] $$

grad

$$ \nabla_\theta J = A_t \nabla_\theta \log \pi_\theta (a_t | s_t) $$

So

$$ L_{\text{critic}} = (-V(s_t) + r_t + \gamma V(s_{t+1}))^2 $$

$$ \delta_t = r_t + \gamma V(s_{t+1}) - V(s_t) $$

$$ L_{\text{critic}} = \delta_t^2 $$

in Advantage Actor critic (A2C)

$$ \theta \leftarrow \theta + \alpha_\theta \, A_t \, \nabla_\theta \log \pi_\theta (a_t | s_t) $$

$$ A_t = Q(s_t,a_t) - V(s_t) $$

$$ \omega \leftarrow \omega + \alpha_\omega \, (\delta_t)^2 $$

$$ \delta_t = A_t \approx r_t + \gamma V(s_{t+1}) - V(s_t) $$

return our content in same HTML foramt

Deterministic PG (DPG)

now in DPG see the policy outputs one action a = μ(s) not stochastic like π(·|s) like distribution.

So then where randomness come from -that is maybe env is stochastic , reward is remember that coke machine or sometimes we add noise.

But still how gradient of action prob when it outputs a single action ???

also PG is naturally on policy but DPG is off policy as action is deterministic. cause in sampling trajectory we add noise ( μ(s) = π_θ(s) + ε )

most correct PG theorem form is

$$ \nabla_\theta J(\theta) = \sum_s p^\pi(s) \sum_a \pi(a|s) \nabla_\theta \log \pi(a|s) Q(s,a) $$

$$ = E_{(s,a)\sim\pi_\theta} \left[ Q^{\pi_\theta}(s,a) \nabla_\theta \log \pi(a|s) \right] $$

in DPG

$$ \nabla_\theta J(\theta) = E_{s \sim p^\mu} \left[ \nabla_\theta \pi_\theta(s) \; \nabla_a Q^\pi(s,a) \Big| a = \pi_\theta(s) \right] $$

$$ p^\mu(s) = \sum_{t=0}^{\infty} \gamma^t P(s_t = s \mid \mu), a_t = \pi_\theta(s_t) $$

If you confused see its simple

PG

$$ \nabla_\theta J(\theta) = \sum_s p^\pi(s) \sum_a \pi(a|s) \nabla_\theta \log \pi(a|s) Q^\pi(s,a) $$

here

$$ p^\pi(s) = \sum_{t=0}^{\infty} \gamma^t P(s_t = s) $$

now stationary dist is the one that sums up to 1 which is prob. distribution rule

$$ p^\pi(s) = \frac{d^\pi(s)}{1-\gamma} $$

here $\frac{1}{1-\gamma}$ is constant factor

but in DPG states comes from $p^\mu$ not $p^\pi$ So off policy hence we must mention which distribution we are integrating over

$$ \nabla_\theta J(\theta) = \int p^\mu(s) \; \nabla_a Q^\mu(s,a) \; \nabla_\theta \mu_\theta(s) \Big| _{a=\mu_\theta(s)} \; ds $$

$\mu$ → behavior policy $\mu_\theta$ → actor & target policy deterministic action

$$ \nabla_\theta Q(s, \mu_\theta(s)) = \frac{\partial Q(s,a)}{\partial a} \cdot \frac{\partial \mu_\theta(s)}{\partial \theta} \Bigg| _{a=\mu_\theta(s)} $$

$$ \nabla_\theta J(\theta) = E_{s \sim p^\mu} \left[ \nabla_a Q^\mu(s,a) \; \nabla_\theta \mu_\theta(s) \Big| _{a=\mu_\theta(s)} \right] $$

DDPG (DPG + DQN)

the DDPG or DPG + DQN

now that gradient of Q means slightly change action → how much Q value change → slightly change actor param → how much action change

here

$$ \nabla_a Q \cdot \nabla_\theta \mu $$

means how should i change actor params so that action moves uphill in Q

So here instead of using prob. we directly pushes actions uphill.

Stochastic PG

$$ \nabla_\theta J(\theta) = E \left[ Q(s,a) \nabla_\theta \log \pi_\theta(a|s) \right] $$

inc prob. of good actions

DPG

$$ \nabla_\theta J(\theta) = E \left[ \nabla_a Q(s,a) \cdot \nabla_\theta \mu_\theta(s) \right] $$

move actions in direction that improves Q.

D4PG

now D4 PG → distributed distribution deep deterministic PG

here what the words mean : many agents, instead of a Q value we get Z distribution, neural net, single action

Multi-Agent RL (MADDPG)

in MADDPG → Single env many agents like - think of two robots carrying table one pushes forward others backward , the table don't move.

Multi agent rl when state transition depend on joint action.

like single agent

$$ P(s' \mid s, a) $$

multi agent (all action at once)

$$ P(s' \mid s, a_1, a_2, \dots, a_N) $$

now agent i only controls $a_i$ , not the rest actions $a_{-i}$

but next state depends on them so from $a_i$ agent perspective ,

next state distribution given only its action must average over what other agents might do. So marginalization, also see agent work like

$$ P(s' \mid s, a_i, a_{-i}) $$

$$ P(s' \mid s, a_i) = \sum_{a_{-i}} P(s' \mid s, a_i, a_{-i}) P(a_{-i} \mid s) $$

$$ P(a_{-i} \mid s) = \prod_{j \ne i} \pi_j(a_j \mid s_j) $$

other agent policies

$$ \pi_j(a_j \mid s_j) $$

now each actor so like multiple actors

but our marginalization would cause non stationarity as policy will change So critic directly sees all actions

$$ Q_i^\mu (s, a_1, a_2, \dots, a_n) $$

So critic is centralized but actors aren't, here o is for observation

$$ a_i = \mu_i(o_i) $$

$$ Q_i^\mu ( \vec{o}, a_1, \dots, a_n ) $$

↓

$$ \vec{o} = (o_1, \dots, o_n), \qquad \vec{\mu} = (\mu_1, \dots, \mu_n) by \vec\theta = (\theta_1, \dots, \theta_n ), \qquad \vec{a} = (a_1, \dots, a_n) $$

Actor update

$$ \nabla_{\theta_i} J(\theta_i) = E_{\vec{o}, \vec{a}} \left[ \nabla_{a_i} Q_i^\mu ( \vec{o}, a_1, \dots, a_n ) \; \nabla_{\theta_i} \mu_{\theta_i}(o_i) \Big| _{a_i=\mu_{\theta_i}(o_i)} \right] $$

Critic update

$$ L(\theta_i) = E_{\vec{o}, a_1, \dots, a_n, r_1, \dots, r_n, \vec{o}'} \left[ \left( Q_i^\mu ( \vec{o}, a_1, \dots, a_n ) - y \right)^2 \right] $$

↓

$$ y = r_i + \gamma Q_i^{\vec{\mu}'} ( \vec{o}', a_1', \dots, a_n' ) \Big| _{a_j'=\mu_j'(o_j')} $$

So here

$$ a_j' = \mu_j'(o_j') $$

in some MARL others agent policy is inaccessible

So agent i may not have access to policies $\mu_j$ or $\mu_j'$

So without knowing $\mu_j'$ we can't get $a_j'$ For target $Q_i^{\mu'}(\dots)$ blocking critic update.

So then each agent i learn approximation $\mu_{ji}$ for other agents j .

then loss is just log likelihood

$$ L(\phi_{ji}) = - E_{o_j, a_j} \left[ \log \mu_{ji}(a_j \mid o_j) + \lambda H(\mu_{ji}) \right] $$

PPO & TRPO

TRPO

now in TRPO we just clip or stop the $\pi_{\text{old}}$ to change or update too much / wildly

$$ J(\theta) = E_{s \sim p^{\pi_{\text{old}}}, \; a \sim \pi_{\text{old}}} \left[ \frac{\pi_\theta(a \mid s)} {\pi_{\text{old}}(a \mid s)} \; \hat{A}^{\pi_{\text{old}}}(s,a) \right] $$

$$ E_{s \sim p^{\pi_{\text{old}}}} \left[ D_{KL} \big( \pi_{\text{old}}(\cdot \mid s) \,\|\, \pi_\theta(\cdot \mid s) \big) \right] \le \delta $$

PPO

in PPO we use clipping as KL div might get complex

$$ r(\theta) = \frac{\pi_\theta(a \mid s)} {\pi_{\text{old}}(a \mid s)} $$

$$ J^{\text{TRPO}}(\theta) = E \left[ r(\theta) \hat{A}^{\pi_{\text{old}}}(s,a) \right] $$

$$ J^{\text{CLIP}}(\theta) = E \left[ \min\left( r(\theta)\hat{A}^{\pi_{\text{old}}}(s,a), \text{clip}(r(\theta),1-\epsilon,1+\epsilon)\hat{A}^{\pi_{\text{old}}}(s,a) \right) \right] $$

PPO Failure Modes

3 failure modes are → sensitive to initialization when local optimal action are close to initialization, discrete action space with sparse high PPO is unstable when rewards , standard PPO might stuck at suboptimal actions. Continuous action space PPO is unstable when rewards vanish outside suboptimal actions.

see

$$ \pi(a \mid s) \sim \mathcal{N}(\mu(s), \sigma(s)) $$

now this can be any numbers say $a_{\text{raw}} = 2.3$ but env clips to 1 now reward computed using $a_{\text{env}}$ but $\pi_{\text{old}}(a_{\text{raw}} \mid s)$ , $\pi_{\text{new}}(a_{\text{raw}} \mid s)$ see the mismatch. also $r(\theta)$ is sensitive when $a_{\text{raw}}$ lies in tail of logic formula of prob. dist.

when action clips to 1 reward is flat so high var × tiny adv $\hat{A} \approx 0$ , $r(\theta)$ fluctuates wildly so high var × tiny adv → terrible signal to noise

$$ D_{KL} (\pi_{\text{old}} \,\|\, \pi_\theta) $$ → penalty when $\pi_\theta$ puts low prob on actions that old policy liked

$$ D_{KL} (\pi_\theta \,\|\, \pi_{\text{old}}) $$ → penalty when $\pi_\theta$ puts high prob on actions old policy considered very unlikely

Phasic Policy Gradient (PPG)

now PPG → basically shared parameters value policy to value head Sometimes cause good interference , so train in phases hence the name Phasic PG.

Phase 1 for policy :- nothing changed same loss as prev just value head is frozen

$$ L_{\text{CLIP}}(\theta) = E \left[ \min\left( r(\theta)\hat{A}, \text{clip}(r(\theta),1-\epsilon,1+\epsilon)\hat{A} \right) \right] $$

Phase 2 Auxiliary phase :- here value improvement + distillation into policy.

$$ L_{\text{joint}} = L_{\text{aux}} + \beta_{\text{clone}} \; E_t \left[ KL \big( \pi_{\text{old}}(\cdot \mid s_t), \pi_\theta(\cdot \mid s_t) \big) \right] $$

↓

$$ L_{\text{aux}} = L_{\text{value}} = E_t \left[ \frac{1}{2} \left( V_\omega(s_t) - \hat{V}_t^{\text{targ}} \right)^2 \right] $$

ACER (Actor-Critic with Replay)

how very imp was ACER method uses Retrace in Q value like actor critic off-policy just uses sampling ratio then why it needs this thing ?

see to make a on policy things off policy we have to introduce the imp. sampling ratio right.

now see in actor critic off policy it is single step learning like

$$ Q(s_t, a_t) \leftarrow r + \gamma V(s_{t+1}) $$

but this is bootstrapped estimate means assuming our prediction is true . so there is bias & hence it might be slow & take more iterations to converge.

also in off policy AC the policies aren't that drifted

also actor critic is more sensitive to critic bias cause actor literally follows critic's advantage estimate. & if it had bias then it might yolo sometimes hence ACER wants better estimate. so that multi step or n-step learning gives but there is high variance.

now in 1 step we scale by ratio $\rho_t$ in n-step its like $\rho_t \times \rho_{t+1} \times \rho_{t+2} \dots$ . So n-step becomes instability source.

this becomes instability source.

in ACER we use replay buffer so policies can differ a lot.

and in pure off policy Q learning we just use

$$ \max_{a'} Q(s', a') $$

instead of

$$ \sum_{a'} \pi(a'|s') Q(s',a') $$

hence no imp sampling ratio needed. cause no policy evaluation at all.

In actor critic since 1 step just one ratio term but in ACER its a product hence problem.

Retrace & Importance Sampling

So in ACER → Multi step returns, off policy correction, Replay buffer

see

$$ \delta_t = R_t + \gamma E_{a \sim \pi} Q(S_{t+1}, a) - Q(S_t, A_t) $$

→ future update using the current policy $\pi$.

$$ Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \delta_t $$

so the update

$$ \Delta Q(S_t, A_t) = \alpha \delta_t $$

now introducing imp sampling

$$ \Delta Q^{\text{imp}}(S_t, A_t) = \gamma^t \prod_{1 \le \tau \le t} \frac{\pi(A_\tau | S_\tau)} {\beta(A_\tau | S_\tau)} \; \delta_t $$

now this would have high variance so

$$ \Delta Q^{\text{ret}}(S_t, A_t) = \gamma^t \prod_{1 \le \tau \le t} \min \left( c, \frac{\pi(A_\tau|S_\tau)} {\beta(A_\tau|S_\tau)} \right) \; \delta_t $$

so ACER uses this as target

$$ \left( Q^{\text{ret}}(s_t, a_t) - Q(s_t,a_t) \right)^2 $$

so acer pg

$$ \hat{g}_t^{\text{acer}} = \omega_t \left( Q^{\text{ret}}(S_t, A_t) - V_\omega(S_t) \right) \nabla_\theta \ln \pi_\theta(A_t | S_t) $$

$$ \omega_t = \frac{\pi(A_t | S_t)} {\beta(A_t | S_t)} $$

see Value is a baseline & its located on current policy , $Q(s,a)$ need correction cause we evaluating a specific action $a$ taken by behavior policy

& there is a $\omega_t$ cause at the actor level & at the critic level.

now this eqn also might have high variance as $\omega_t$ might be large.

so truncate aggressively + correct truncation bias

$$ \min(c, \omega_t) \left( Q^{\text{ret}} - V \right) \nabla \log \pi(A_t | S_t) $$

$$ E_{a \sim \pi_\theta} \left[ \max(0, \omega(a) - c) \frac{Q(s,a) - V(S_t)} {\omega(a)} \nabla \log \pi(a|S_t) \right] $$

alright see in the first term we reduced the thing $\omega - c$ so now we add that & see the action coming from behavior policy earlier here we make that $a \sim \pi_\theta$ cause we trying to estimate expectation under current policy $\pi$.

So

$$ E_\beta \left[ \omega(a) F(a) \right] = E_\pi \left[ F(a) \right] $$

Soft Actor-Critic (SAC)

now SAC (soft actor critic) → off policy

$$ J(\theta) = \sum_{t=1}^{T} E_{(s_t,a_t)\sim P_{\pi_\theta}} \left[ r(s_t,a_t) + \alpha H(\pi_\theta(\cdot|s_t)) \right] $$

← trained with policy $\pi_\theta$

see there is entropy maximization as well so it keeps on exploring , here policy is stochastic one as deterministic one not makes sense.

$$ H(\pi(\cdot|S_t)) = - E_{a\sim\pi} \left[ \log \pi(a|S_t) \right] $$

now in classic vs soft like how reward & value func would change

reward

$$ r_t \quad \rightarrow \quad r_t + \alpha H(\pi(\cdot|S_t)) $$

Value

$$ V^\pi(s) = E[R_t \mid S_t=s] \quad \rightarrow \quad E \left[ R_t + \alpha \sum_{k=t}^{\infty} H_k \;\middle|\; S_t=s \right] $$

Q func

$$ Q^\pi(s,a) = E \left[ R_t \;\middle|\; S_t=s, A_t=a \right] \quad \rightarrow \quad r(s,a) + \gamma E_{s'} \left[ V^\pi(s') \right] $$

Now these two are from classic one basically as reward changed so functions are also changing

$$ V^\pi(s) = E_{a\sim\pi} \left[ Q^\pi(s,a) \right] $$

$$ R_t = \sum_{k=t}^{\infty} \gamma^{k-t} r_k $$

soft return

$$ R_t^{\text{soft}} = \sum_{k=t}^{\infty} \gamma^{k-t} \left( r_k + \alpha H(\pi(\cdot|S_k)) \right) $$

optimal soft

$$ Q^*(s,a) = r(s,a) + \gamma E_{s'\sim p(\cdot|s,a)} \left[ V^*(s') \right] $$

$$ V^*(s) = E_{a\sim\pi^*} \left[ Q^*(s,a) - \alpha \log \pi^*(a|s) \right] $$

now soft RL definitions

$$ R_t^{\text{soft}} = \sum_{k=t}^{\infty} \gamma^{k-t} (r_k + \alpha H(\pi(\cdot|S_k))) $$

$$ V^\pi(s) = E_{a\sim\pi} \left[ Q^\pi(s,a) - \alpha \log \pi(a|s) \right] $$

$$ Q^\pi(s,a) = r(s,a) + \gamma E_{s'\sim p(\cdot|s,a)} \left[ V^\pi(s') \right] $$

SAC Objectives & Losses

now MSE losses

$$ J_V(\psi) = E_{s_t \sim D} \left[ \frac{1}{2} \left( V_\psi(s_t) - E_{a_t\sim\pi_\theta} [ Q_\omega(s_t,a_t) - \log \pi_\theta(a_t|s_t) ] \right)^2 \right] $$

$$ \nabla_\psi J_V(\psi) = \nabla_\psi V_\psi(s_t) \left( V_\psi(s_t) - Q_\omega(s_t,a_t) + \log \pi_\theta(a_t|s_t) \right) $$

D is replay buffer

soft Q func

$$ J_Q(\omega) = E_{(s_t,a_t)\sim D} \left[ \frac{1}{2} \left( Q_\omega(s_t,a_t) - (r(s_t,a_t) + \gamma E_{s_{t+1}} [V_\psi(s_{t+1})]) \right)^2 \right] $$

here $\Phi$ is target value func

Boltzmann Distribution & Equilibrium

now there is a thing called Boltzmann distribution related to equilibrium systems.

$$ P(E_i) = \frac{1}{Z} e^{-E_i / k_B T} $$

$$ Z = \sum_{i=1}^{N} e^{-E_i / k_B T} $$

each state i has a Energy that is $E_i$. probability of finding a particular state i with particular energy is proportional to exponential of energy divided by Boltzmann constant & temperature. $Z$ is partition function basically normalizing constant.

and as they are probabilities

$$ \sum_{i=1}^{N} P(E_i) = 1 $$

now again our reward was

$$ \mathbb{E}[Q(s,a)] + \alpha H(\pi) $$

solving a constrained problem so calculus tells optimal solution should satisfy

$$ \pi^*(a|s) \propto \exp\!\left( \frac{Q(s,a)}{\alpha} \right) $$

see this RHS is Boltzmann over Q this as target so so KL Div would make this as target so our policy go toward that

$$ \pi_{\text{new}} = \arg\min_{\pi' \in \Pi} D_{KL} \left( \pi'(\cdot|s) \,\bigg\|\, \frac{ \exp(Q^{\text{old}}(s,\cdot)/\alpha) }{ Z^{\text{old}}(s) } \right) $$

policy objective

$$ J_\pi(\theta) = \nabla_\theta D_{KL} \left( \pi_\theta(\cdot|s_t) \,\bigg\|\, \frac{ \exp(Q_\omega(s_t,\cdot)/\alpha) }{ Z_\omega(s_t) } \right) $$

$$ = E_{a \sim \pi_\theta} \left[ \log \pi_\theta(a_t|s_t) - \frac{1}{\alpha} Q_\omega(s_t,a_t) + \log Z_\omega(s_t) \right] $$

here $Z_\omega(s_t)$ is partition func to normalize the distribution , as it not depend on action so its grad will be zero & its complex to calc hence this is good.

also $\pi_\theta$ might not able to represent arbitrary distribution so we restrict to simpler one like Gaussian , GMM or some more.

as if Gaussian

$$ \pi_\theta(a|s) = \mathcal{N} \big( \mu_\theta(s), \sigma_\theta(s) \big) $$

hence if more expressive $\pi$ → better approximation of Boltzmann → good perform → hard optimization & compute.

see these loss was

$$ J_\pi(\theta) = \mathbb{E}_{a_t \sim \pi_\theta} \Big[ \alpha \log \pi_\theta(a_t|s_t) - Q_\omega(s_t,a_t) + \log Z_\omega(s_t) \Big] $$

here $\alpha$ this temp is like we want entropy to be like $\ge H_0$ a constant like to make it explorable

So now this is constraint optimization with constraint so we used lagrangian concept &

maximize expected return subject to entropy constraint

$$ \mathbb{E} \big[ -\log \pi_\theta(a_t|s_t) \big] \ge H_0 $$

Lagrangian:

$$ \mathcal{L} = \mathbb{E} \Big[ Q_\omega(s_t,a_t) - \alpha \big( \log \pi_\theta(a_t|s_t) + H_0 \big) \Big] $$

& so the temp ($\alpha$) eqn comes up

$$ J(\alpha) = \mathbb{E}_{a_t \sim \pi_t} \Big[ -\alpha \log \pi_t(a_t|s_t) - \alpha H_0 \Big] $$

so $\alpha$ automatically adjusts. if entropy low → increase $\alpha$, if entropy high → decrease $\alpha$

SVPG (Stein Variational PG)

why learn one policy , learn whole population of em that cooperate & repel

maintain set of policies $\{ \theta_i \}_{i=1}^N$

update combines:

1) policy gradient term → improve return

2) repulsive term → encourage diversity

update direction:

$$ \Delta \theta_i = \frac{1}{N} \sum_{j=1}^N \Big[ k(\theta_j,\theta_i) \nabla_{\theta_j} J(\theta_j) + \nabla_{\theta_j} k(\theta_j,\theta_i) \Big] $$

first term → pulls toward high reward, second term → pushes particles apart

so instead of single optimum you approximate distribution over good policies.

IMPALA & V-trace

IMPALA method → Importance Weighted Actor-Learner Architecture

V_θ → θ , π_φ → φ , old policy μ

multi-step corrected target:

$$ V_t = V_\theta(s_t) + \sum_{i=t}^{t+n-1} \gamma^{\,i-t} \Big( \prod_{j=t}^{i-1} c_j \Big) \delta_i $$

where

$$ \delta_i = \rho_i \big( r_i + \gamma V_\theta(s_{i+1}) - V_\theta(s_i) \big) $$

importance ratios:

$$ \rho_i = \min \left( \bar{\rho}, \frac{\pi_\phi(a_i|s_i)}{\mu(a_i|s_i)} \right) $$

$$ c_i = \min \left( \bar{c}, \frac{\pi_\phi(a_i|s_i)}{\mu(a_i|s_i)} \right) $$

single-step case:

$$ V_t = V_\theta(s_t) + \delta_t $$

and update:

$$ \theta \leftarrow \theta + \alpha \big( V_t - V_\theta(s_t) \big) \nabla_\theta V_\theta(s_t) $$

pure importance sampling multi-step TD:

$$ V_t = \rho_t \delta_t + \gamma \rho_t \rho_{t+1} \delta_{t+1} + \dots + \gamma^n V_\theta(s_{t+n}) $$

where

$$ \rho_t = \frac{\pi_\phi(a_t|s_t)}{\mu(a_t|s_t)} $$

variance explodes due to product of ratios

so instead rewrite using TD errors and clip ratios → V-trace correction

value update:

$$ \Delta \theta = \big( V_t - V_\theta(s_t) \big) \nabla_\theta V_\theta(s_t) $$

actor update:

$$ \Delta \phi = \rho_t \nabla_\phi \log \pi_\phi(a_t|s_t) \Big( r_t + \gamma V_\theta(s_{t+1}) - V_\theta(s_t) \Big) + \nabla_\phi H(\pi_\phi) $$

summary:

off-policy correction via clipped is stable multi-step learning decoupled actors generate data under μ learner updates toward π using V-trace

Introduction

RL Basics & Environment

Model-based vs Model-free

Probability Basics

State Transitions

Policy & Value Functions

Return & Value

Bellman Equations

Optimality

Policy Iteration (GPI)

MC vs TD Learning

SARSA & Q-Learning

Deep Q-Networks (DQN)

n-step TD Learning

Policy Gradient (PG)

Policy Gradient Theorem

REINFORCE (Monte Carlo PG)

Advantage & Baselines

Actor-Critic Methods

A2C / A3C (Async Methods)

Evolution Strategies

Off-policy PG & Importance Sampling

Policy Gradient Summary

Deterministic PG (DPG)

DDPG (DPG + DQN)

D4PG

Multi-Agent RL (MADDPG)

PPO & TRPO

TRPO

PPO

PPO Failure Modes

Phasic Policy Gradient (PPG)

ACER (Actor-Critic with Replay)

Retrace & Importance Sampling

Soft Actor-Critic (SAC)

SAC Objectives & Losses

Boltzmann Distribution & Equilibrium

TD3 (Twin Delayed DDPG)

SVPG (Stein Variational PG)

IMPALA & V-trace

Contents