TOC
0. Basic Idea1. Environment1.1 Create an “Ideal World”1.2 Quantify the Environment2. Agent2.1 An Intelligent Participant in the “Ideal World”2.2 How Intelligent Agents Act2.3 How Agents Become Intelligent3. Vanilla ModelFree RL Algorithms3.1 QLearning3.2 Policy Optimization3.3 OnPolicy v.s. OffPolicy4. Further Issues and Advanced Models4.1 Experience Replay4.2 Prioritized Experience Replay4.3 Overestimation, Target Network, Double DQN4.4 Dueling Network4.5 Deterministic Policy Gradient (DPG)4.6 Stochastic Policy for Continuous Action Space4.7 Trust Region Policy Optimization (TRPO)4.8 Proximal Policy Optimization (PPO)5. MultiAgent Reinforcement Learning5.1 Settings and Convergence5.2 Architectures of MARL6. Pipeline of Trading ExampleReferences
0. Basic Idea
1. Environment
1.1 Create an “Ideal World”
Take trading as an example. In this circumstance, the “Environment” is an exchange, the “Agent” is an trader. When we construct the environment, we are restricting the behaviors of agents like exchanges restrict traders in real life. For example, in Ashare market, exchanges restrict that traders cannot short selling and set the margin levels etc. We can quantity these restrictions and simulate a trading environment based on our assumptions.
Besides, the “Ideal World” also contains the interaction rules between the agent and the environment. For example, exchanges record the balance of traders’ accounts and whenever there is an transaction, the balance will change at the mean time.
In real world, exchanges can let different types of traders enter into the market (arbitragers, speculators, hedgers, simpletons, etc.) as long as they obey the trading rules and laws. In the environmnet module, we do not care about the agent types either even though our target is to try to get a smart agent after training. We are just setting up an objective world based on our assumptions.
1.2 Quantify the Environment
gym is a package helping us turn our “Ideal World” into codes. Actually, the package has two main functions.
Firstly, it’s a kind of “dataset” containing many environments (Atari, MuJoCo, Toy Text Classic Control, Box2D, etc.). It acts as an arena for different algorithms (used to train agents) just like ImageNet and MNIST datasets. Take the Atari games for example,
# gym official example import gym env = gym.make("LunarLanderv2", render_mode="human") observation, info = env.reset(seed=42) for _ in range(1000): action = policy(observation) # Userdefined policy function observation, reward, terminated, truncated, info = env.step(action) if terminated or truncated: observation, info = env.reset() env.close()
The scores of these games provide a measure of agents trained by different algorithms and thus provide us a reference for the comparison of different algorithms.
Secondly, it provides us the API (
gym.Env
) to custom our own environments. The API (class gym.Env
) is userfriendly and is compatible with other deep learning packages (pytorch, tensorflow, stable baselines, tianshou, etc.). It’s quite easy to set up our “Ideal World“ using gym.Env
by completing the four functions:class customEnvironment(Env): def __init__(self, df): ... # input data, inner attributes, etc self.action_space = # requisite, how agents can behavior self.observation_space = # requisite, what kinds of information can agents observe (provided by environment) def step(self, action): ... # interaction between agents and environments, like the settlement process between traders and exchanges return observation, reward, done, {} # return new information omitted by the environment, the reward (requisite attributes), whether the epsisode ends (one round game over) and other information if any. def reset(self): ... # Initialize the environment, like exchanges clear all records of existing transcations return observation def render(self, mode='human', close=False): ... # Print the information of each step when called. In Atari games, it plots the game widgets all the time
2. Agent
2.1 An Intelligent Participant in the “Ideal World”
Take trading as an example, whether simpletons or experienced fund managers, they can access the market information (observations like stocks’ prices, breaking news, fundamental messages, etc.) and the rewards of their actions (transcations with exchange). The difference between the simpletons and experienced fund managers, however, is how they take a use of these market information (observations & rewards). In other words, they have different policies () instructing them to take actions further (make transcations with exchange) after grabbing the information (observations & rewards). can be deterministic or stochastic . Just like experienced market makers won’t behavior immutably which may expose their market making patterns, we often use stochastic policies to get robust agents.
A simpleton may not reflect current situations nor evaluate actions it’s taking. While an experienced fund managers always do so. We can quantify the reflection and evaluation using value funcitons.
Notations:
 : status at time
 : action taken at time
 : reward of
Definitions:
 Discounted return: , where can be finite or infinite
 Actionvalue function: , where is the policy used
 is the expected discounted return when taking action in the circumstance of status
 For policy , evaluates how good it is for an agent to pick action while being in status
 Statevalue function: , (discrete action space) (continuous action space)
 is the expected discounted return w.r.t acitons under the circumstance of status
 For fixed policy , evaluates how good the situation is in status
 evaluates how good the policy is
 Optimal Actionvalue function:
 The expected discounted return of taking action in status under optimal policy in the environment
 Optimal Statevalue function:
 The expected discounted return of arbitrary actions in status under optimal policy in the environment
2.2 How Intelligent Agents Act
Firstly, suppose we have a good policy . Then once the intelligent agent observes status , it can random sample an action with .
Secondly, suppose we know the optimal actionvalue function . Then once we observe the status , we choose the action that maximizes the value
2.3 How Agents Become Intelligent
The above two ways intelligent agents act imply two main methods of ModelFree RL algorithms: Policy Optimation and QLearning. Actually, we can classify the RL algorithms as below
ModelFree v.s. ModelBased
The difference between training process of these two kinds of algorithms is whether there is an inner model simulating the environment. For ModelBased RL, it can estimate the observations & rewards of actions without even taking them thanks to the inner model. The inner model can be learned or given (like Alpha Zero). When the inner model is learnable, the real experience (observations & rewards) update the model each step as well.
ModelFree RL and ModelBased RL have different training paths. The explored space in one episode (one round game) of ModelFree RL (blue line) and ModelBased RL (blue line plus red lines) are shown below.
For ModelFree RL, they are comparably straight forward because there are no models or simulations. It uses the accurate environment instead of model predictions.
For ModelBased RL, they are more sample efficient which can decrease the chance of damaging hardward in some application scenario. However, the training process of ModelBased RL has higher computation cost.
3. Vanilla ModelFree RL Algorithms
3.1 QLearning
3.1.1 Gradient Descent
In this kind of algorithms, we try to approximate through training. Once we have it, we can choose the action that maximize during implementation, i.e. .
The core idea of updating our (denoted by while training) is using the equation below which is one kind of TD (temporary difference) algorithms.
where is the learning rate (a hyperparameter) and is the discounted rate.
The equation is actually, kind of, applying the idea of “gradient descend“. The prediction from our model is and the target (label) is which is called TD target. The reason why it can be viewed as a “label” is that it contains , which is an observed value instead of an approximation and thus making TD target closer to the real value than . The TD target comes from the Bellman Equation
The last equation is based on the Markov Process assumption, which means the information
which implies
The last equation is using Monte Carlo approximation.
When we use L2 loss function that is where is the label, the gradient is . Following gradient descent method, we get equation .
As for the formation of Qlearning, there are two specific versions. Frist, if state space (observation space) and action space are both discrete which can be listed in a table, we can use Tabular Version. Second, if state space or action space is continuous, we can construct a network to approximate . The network we construct is called Deep Q Network (DQN).
3.1.2 Tabular Version
Since is a function having only two inputs, we can draw a table and list all possible states and actions in it.
Pseudocode: Tabular Version Qlearning
Algorithm parameters: step size , small
Initialize , for all , arbitrarily except that
Loop for each episode:
Initialize
Loop for each step of episode:
Choose from using policy derived from (e.g. greedy)
Take action , observe
until is terminal
3.1.3 Deep Q Network (DQN)
Denote the network we construct by which is parameterized by and has inputs . The output is the (optimal) score of taking in status . One Mario game example is shown below where the DQN is a convolution network because inputs are images. We should design specifical networks for different issues. For example, RNN and LSTM architectures are better choices for timeseries inputs.
At this time the TD target is
And the loss function is
Using gradient descent method, we can update the parameters by
Pseudocode: DQN
Algorithm parameters: step size , small
Initialize with random parameters
Loop for each episode:
Initialize
Loop for each step of episode:
Choose from using policy derived from (e.g. greedy)
Predict the value:
Differentiate the value network:
Take action , observe
Calculate TD target:
Gradient descent:
until is terminal
3.2 Policy Optimization
3.2.1 Gradient Ascend
In this kind of algorithms, we try to approximate through training. Once we have it, we can randomly draw an action with . One Mario game example is shown below. The final layer is softmax which makes sum of outputs equal to 1.
Denote the network we construct to approximate by . Recall the Statevalue function is . We do not know either. There are two methods to calculate it, Reinforce method and ActorCritic method.
3.2.2 Reinforce Method
It means that we do not update the network until we go through one episode (complete one round game). At this time, is not regarded as a funtion parameterized by but determined by observed trajectories. This method is also called Reinforce method.
Assume one trajectory is , where is drawn from policy network and state transitions and rewards are determined by the environment. Then we can use the observation as an approximation of . Thus . Differentiate w.r.t we get
The last equation is using MC approximation, where is randomly sampled according to the PDF . Obviously, is an unbiased estimate of .
Pseudocode: Reinforce Algorithm: MonteCarlo PolicyGradient Control (episodic) for
Input: a differentiable policy parameterization
Algorithm parameter: step size
Initialize policy parameter
Loop for each episode:
Generate an episode , following
Loop for each step of episode:
The algorithm is trying to make bigger by gradient ascent. Bigger means the agent is performing better.
Besides, there is an empirical trick to train policy network efficiently. Using baseline , which does not vary with , the expected value of the update is unchanged while the variance can be smaller. Therefore, the updating algorithm can converge faster. can be learnable or not. Denote , replacing with yields the Reinforce Method with Baseline.
3.2.3 ActorCritic Method
It means we update the network with the progress of episode and is also an learnable network. That is where are parameters of network and respectively. In each step during one episode, we have two training tasks:
 Update policy network to increase the statevalue : It makes actor gradually performs better. Note that supervision is purely from the value network (critic) which may not be ground truth.
 Update value network to better estimate the return: It makes critic’s judgement more accurate. The supervision is purely from the rewards which is observed from the environment.
The training process is shown as below
Pseudocode: A2C
Input: a differentiable policy parameterization , a differentiable statevalue parameterization
Algorithm parameter: step size
Initialize policy parameter
Loop for each episode:
Initialize
Loop for each step of episode:
Sample randomly according to
Take action , observe
Sample randomly according to (do not perform )
Evaluate value network: and
Compute TD error: (using baseline — A2C)
Differentiate value network:
Update value network:
Differentiate policy network:
Update policy network:
until is terminal
3.3 OnPolicy v.s. OffPolicy
There is a kind of classification dividing Deep Reinfocement Learning algorithms into two categories: onpolicy algorithms and offpolicy algorithms.
As vanilla modelfree algorithms shown above, at each step during one episode, agent will choose an action to behave according to a policy. This policy is called behavior policy. For example, behavior policy of DQN is EpsilonGreedy policy where agent’s actions start off completely randomly and slowly begin to get more greedy and exploitative over time (not being greedy all the time). When calculating action value , agent will choose imagine an action to imitate according to a policy. This policy is called update policy. For DQN, the update policy assumes that we’re following a greedy policy that we are choosing actions which we believe will net us the most reward (being greedy all the time).
Onpolicy algorithms are those whose behavior policy is the same as its update policy. Offpolicy algorithms are those whose behavior policy is different from its update policy. Therefore, DQN is one kind of offpolicy algorithms while policy optimization algorithms are onpolicy algorithms.
4. Further Issues and Advanced Models
4.1 Experience Replay
Recall above all training algorithms, the data used in each step (during one episode) is no more than which is defined as a transition. It can be viewed as a single training data like a piece of photo in CV tasks.
For vanilla modelfree algorithms, there is a waste of experience because is only used once, followed by . What’s worse, and are strongly correlated. Consecutive states make the onpolicy algorithms using transitions sequentially harmful to the training process.
Experience replay is a trick which exploits a replay buffer storing transitions as below
During training, we can randomly sample a transition from replay buffer to calculate TD error and apply gradient descent.
4.2 Prioritized Experience Replay
Actually, not all transitions are equally important. Take Mario game for example, transitions below left are very common while transitions below right are very rare. If we let both be sampled at the same rate, the agent we train most likely not to pass the right scenario. Therefore, we should use importance sampling rather than uniform sampling.
 Option 1: Sampling probability
 Option 2: Sampling probability
 Transitions are sorted so that is in the descending order
 rank(t) is the rank of the tth transition
In sum, big shall be given high priotity.
Besides, we should scale learning rate to eliminate prediction bias. And if a transition is newly collected whose is unknown, we simply set its to the maximum and give it the highest priority.
The prioritized experience replay is shown as below.
4.3 Overestimation, Target Network, Double DQN
4.3.1 Overestimation
TD learning makes DQN overestimate action values. There are two reasons for it: Maximization and Bootstrapping.
Maximization
Let be observed real numbers. Add zeromean random noise to and obtain . The zeromean noise does not affect the mean that is . But the zeromean noise increases the maximum that is and decreases the minimum that is .
As for action values in DQN, let be true actionvalues. The noisy estimation made by DQN are and suppose they are unbiased that is . However, is typically an overestimation .
Therefore, is an overestimation of the true action value at time . The TD target, is thereby an overestimation. Since TD learning pushes towards which overestimates the true actionvalue, DQN’s estimations are also overestimated than true values.
Bootstrapping
In RL, bootstrapping means using an estimated value in the update step for the same kind of estimated value. TD learning performs bootstrapping because TD target in part uses which is an overestimation. When is used for updating , the overestimation is propagated back to DQN.
Maximization and Bootstrapping make DQN overestiamte action values. What’s worse, the overestimation is nonuniform (otherwise overestimation is not a problem). When we use transition to update , the TD target overestimates . TD algorithm pushes towards which make overestimates . The more frequently appears in the replay buffer, the worse overestimates .
4.3.2 Target Network
Target Network is proposed to avoid bootstrapping. Target network has the same network structure as the DQN while has different parameters: .
During training, we use to control the agent and collect experience: and use to compute TD target: . Note that in each step during one episode, we only update . There are two ways to update target network periodically:
However, target network cannot eliminate maximization.
4.3.3 Double DQN
When calculating TD target, we can divide it into to steps: choose the action and calculate action value.
Algorithms  Action Selection  Evaluation 
Naive DQN  Use DQN:
 Use DQN:

Target Network  Use Target Network:
 Use Target Network:

Double DQN  Use DQN:
 Use Target Network:

Double DQN is the best among the three though overestimation still happens. Double DQN can alleviate overestimation because:
4.4 Dueling Network
Dueling network changes the architecture of DQN to train a more robust agent. Define optimal advantage function: . It implies that
Thus
In the new architecture, we approximate by a neural network , approximate by a neural network . Thus, approximate by the dueling network:
which has two parameters: . Denote as .
A dueling network architecture for Mario game example is shown as below
The learning process of dueling network is the same as other DQNs. Tricks can be used in the same way like: Prioritized experience repaly, Double DQN, Multistep TD target, etc.
Although in equation , is equal to zero, it can help to overcome nonidentifiability. Thus, dueling network has robust performance. In practice, we can also repalce with which yields better performance in experiments.
4.5 Deterministic Policy Gradient (DPG)
ModelFree algorithms discussed above, however, all have discrete action space. Networks’ outputs are finite. For continuous action space, we can discretize the continuous dimensions. But this may result in dimension disaster which makes models hard to converge. For example, the robotic arm has two continuous dimensions. If we divide each dimensions into ten discrete parts (which obviously cannot perform well), there will be 20 dimensions.
DPG (or DDPG: Deep Deterministic Policy Gradient) is one kind of ActorCritic method. The architecture is shown below
Unlike ActorCritic algorithms mentioned above, DPG’s policy network (actor) is deterministic whose output is a determined value instead of probability distribution. The value network (critic) has two inputs . It’s output is a scalar evaluating how good the action is.
Pseudocode: DPG
Input: a differentiable policy parameterization , a differentiable statevalue parameterization
Algorithm parameter: step size
Initialize policy parameter
Loop for each episode:
Initialize
Loop for each step of episode:
Take action , observe
Value network makes prediction for time :
Value network makes prediction for time : where but is not taken
Compute TD error: (using baseline)
Differentiate value network:
Update value network:
Calculate DPG:
Update policy network:
until is terminal
Target Networks can be applied during the training of policy network to avoid bootstrapping. Target Networks have the same network structures as original network but with different parameters. At this time, value network makes a prediction for time : while target network makes a prediction for time :
The parameters of network is updated periodically.
Besides, tricks like experience replay, multstep TD target, etc. can also be applied during the training of DPG.
In the end, the comparisons between stochastic policy and deterministic policy are shown below.
4.6 Stochastic Policy for Continuous Action Space
For continuous action space, we can also design a different stochastic policy network. Let the degree of freedom be (the robotic arm above has two). Let be functions of state . Let be the ith elements of respectively. The policy function is then the PDF of multivariate normal:
We use networks to approximate . For , we use . For , denote . Approxiating with yields better performance in practice.
Once we get welltrained and , we can compute mean and log variance using the network when observing state :
Then compute . Finally, randomly sample action by .
To train the policy network , we should apply another network — Auxiliary Network (only used while training).
First, take log funtion of the policy network:
The whole network architecture is shown as below. Note that layers before dense layer should be designed according to specific problems, here is only a simple example.
From the definition: . Therefore,
 In Reinforce Method:
 In A2C Method:
4.7 Trust Region Policy Optimization (TRPO)
The Policy Optimization algorithms mentioned above have two main problems. First, samples are not exploited efficiently. Whether the offpolicy method (reinforce) or onpolicy method (actorcritic), each transition is only used once for updating parameter of policy network. Second, the convergence process is not smooth because each transition only updates policy network once and the effect of bad transition has further influence than traditional supervised learning. It will make agents we train less robust. TRPO and PPO below are proposed to avoid these two problems.
Importance sampling lemma: the below equation is obvious
In policy optimization algorithms, the object funtion is where . can also be if using policy optimization with baseline ( means advantage). Accoring to importance sampling lemma,
Applicaing MC approximation, we can approximate by
is one trajectory observed in one episode using .
Then we can update using Trust Region Method
The subject condition is to avoid too large step size where can be
 L2 regularization:
 KL divergence:
This restriction makes the convergence curve smoother, resulting in more robust model. Besides, the optimization task takes many iterations to complete which uses one trajectory (samples) multiple times (sample efficient). However, there will be more computation cost during the optimization process. It’s a tradeoff.
Pseudocode: TRPO
Input: a differentiable policy parameterization
Algorithm parameter: step size
Initialize policy parameter
Loop for each episode:
Generate an episode , following
Calculate
Approximation: , is parameter of last episode
Maximization (inner loop):
4.8 Proximal Policy Optimization (PPO)
PPO is an adjust of TRPO which is easier to calculate buy yields similar performance. The objective function of PPO is
is dynamic during training:
 When , increase
 When , decrease
PPO2 is another adjust algorithm, its objection funtion is
where is the advantage in baseline policy optimization and function has ouput as below left. ’s output is shown as below right.
Pseudocode: PPO
Input: a differentiable policy parameterization
Algorithm parameter: step size
Initialize policy parameter
Loop for each episode:
Generate an episode , following
Calculate
Calculate or
or
5. MultiAgent Reinforcement Learning
5.1 Settings and Convergence
When there are more than one agents in the system, agents are usually not independent. Every agent’s action can affect the next state and thus every agent can affect all the other agents. Unless the agents are independent of each other, singleagent RL methods do not work well for MARL. There are four types common settings:
 Fully cooperative, e.g. industrial robots.
 Fully competitive, e.g. predator and prey.
 Mixed cooperative & competitive, e.g. robotic soccer.
 Selfinterested, e.g. automated trading systems.
When no agent can get better expected return by improving its own policy, the MARL converges. If there is only one agent, convergence means the objective function does not increase any more. If there are muliple agents, Nash equilibrium means convergence.
5.2 Architectures of MARL
They are three typical architectures of MARL:
 Fully decentralized: Every agent uses its own observations and rewards to learn its policy. Agents do not communicate.
 Fully centralized: The agents send everything to the central controller. The controller makes decisions for all the agents.
 Centralized training with decentralized execution: A central controller is used during training. The contraoller is disabled after training.
In MARL, single agent may or may not have full knowledge of the state . Let be the ith agent’s observation. If it’s partial observation . Full observation means .
5.2.1 Fully Decentralized
During training, each agent interacts with environment independently and they do not share observations or actions. Take A2C for example, each agent has its own policy network (actor): and value network (critic): . At this time, the training process is the same as single agent setting.
During execution, each agent randomly samples an action from its own policy network.
This does not work well because the architecture ignores the influence between agents.
5.2.2 Fully Centralized
There is a policy network in central controller . During training, at each step in one episode, central controller receive observations and rewards of all agents and take them as inputs. Then central controller instructs actions to each agents to further interact with the environment.
During execution, central controller randomly samples actions from trained policy network and instructs each agent’s action.
In this architecture, communication and synchronization cost time and realtime decision is impossible.
5.2.3 Centralized Training with Decentralized Execution
In this architecture, each agent has its own policy network (actor): . The central controller has value networks (critic): . During training, the central controller knows all the agents’ observations, actions and rewards.
There are critic networks in central controller, specific for each agent. The central controller trains the critics for all . To update , TD algorithm take as inputs:
 All the actions:
 All the observation:
 The th reward:
Then each critic sends state value to its actor (agent’s policy network). Each agent locally trains the actor using policy gradient with input . Actors’ training can be local because they do not need inputs from other agents.
After training, the central controller is useless. During execution, each agent interacts with the environment independently by randomly sampling actions from its policy network .
Comparisons among these three architectures are shown below
5.2.4 Parameters Sharing
In architectures above, there are policy networks and value networks Parameter sharing means and for some and .
Whether there should be parameters sharing depends on application scenario. For example, soccer robots should not have shared parameters because they play different roles on the field. While Tesla atuodrived cars can have shared parameters because each car is selfinterested.
6. Pipeline of Trading Example
There are some popular packages implementing DRL algorithms like Stablebaselines3, Tianshou. Below is an example of PPO applied to train a trader agent.
References
 Prof. Shusen Wang: Deep Reinforcement Learning, Youtube.
 Prof. Hungyi Lee: DRL Lecture 2: Proximal Policy Optimization (PPO), Youtube.
 OpenAI: Spinning Up.
 Richard S. Sutton, Andrew G. Barto: Reinforcement Learning: An Introduction, 2nd edition.
 OpenAI: Gym Documentation.
 DLRRM: StableBaselines3 Docs.
 TSAIL (Tsinghua Machine Learning Group): Tianshou Docs.
Loading Comments...