Summary Review of Reinforcement Learning Algorithms
In the general framework of reinforcement learning (RL), exist an agent that each time step decides and executes an action a ∈ A to interact with the environment/simulator. Then the environment reacts by moving the agent from the present state of the environment sₜ∈ S, to the next state sₜ+1 ∈ S. During each step, the environment emits a scalar value signal, called reward rₜ ∈ R for reaching this state. Indicating on the agent if the chosen action results in a good or bad state. The interaction between the agent and the environment can continue forever or until some stop criteria would be satisfied. The main goal of the RL agent is to gather the largest amount of cumulative reward, which is called return R ∈ R.
where γ ∈ [0, 1] is the discount factor, Rt and rt are the return and reward function on time-step t. T ∈ R is the horizon of the MDP. To increase the return, the agent has to find an optimal policy. A policy is a function that outputs the action or the probability distribution over actions, in the case
of the policy being stochastic, that the agent should execute in the environment given its present state. Thus an optimal policy could be thought of as the best strategy employed by the agent to interact with the environment, in order to collect the biggest amount of cumulative rewards. Thus reinforcement learning algorithms try to discover the optimal policy, which increases the return of the agent during its interaction with the environment.
Moreover, the environment could be deterministic or stochastic. Deterministic means that the identical action of the agent been in the same state end up in identical next state. Stochastic environment, indicate that identical action of the agent in the same state won’t necessarily result in the same next state. Of course, these uncertainties have as result to make finding the optimal policy harder. The problem is commonly defined mathematically as a Markov decision process (MDP). An MDP is a way to represent the environment’s dynamics. It formulates the manner in which the environment will interact with the actions that the agent may take at a particular state.
An MDP, to be more precise, use the transition function. The transition probability function defines the resulting location of the agent on the environment. Having as input the present state of the agent and the actions that the policy has selected. In the scenario of a stochastic environment,
the transition function outputs the probability of transferring the agent to any of the next states. An MDP is also associated with the reward function. Collectively, the transition and reward functions are frequently referred to as the model of environment M = (p, r).
However, regularly doesn’t exist the transition and reward functions in all the environments. Hence, the agent can’t estimate the optimal policy using the MDP, because it is unknown. In the absence of those functions, when the MDP is unknown, the agent interacts with the environment and observes its responses in order to estimate the optimal policy. Over time, the agent gains an understanding of how the environment reacts to its actions, and it can thus begin to approximate the optimal policy. In this manner, the agent estimates the best way of behaving in an unfamiliar (or partially known) environment by interacting with it using a “trial-and-error” approach. The main split on RL is on-policy vs off-policy, in the context of this blog post are being reviewed and summarized several representative algorithms of RL.
1. On Policy Algorithms
In this section, we will focus on three on-policy learning algorithms which use policy gradient methods. Those methods depend on optimizing their unique policy with respect to the expected return using stochastic gradient ascent. Policy gradient methods can deal with continuous states and action space precisely the same way as the discrete ones. Convergence at least to a local optimum is guaranteed. Also, policy gradient methods could be used either in model-free or model-based environment settings as they’re a generic formulation.
1.1 REINFORCE
The REINFORCE [3] algorithm, also sometimes called as Monte-Carlo policy gradient or Vanilla Policy Gradient (VPG), is the simplest policy gradient method that relies on Qπ (s, a) ∈ R function. More complicated methods have been built and developed using REINFORCE, like the Trust Region Policy Optimization (TRPO) or the Proximal Policy Optimization (PPO).
In general, a policy gradient algorithm is a policy iteration approach, where policy is directly manipulated to reach the optimal policy that maximizes the expected return. This kind of algorithm is model-free reinforcement learning, so the environment dynamics or transition probability function aren’t known. Additionally, as it is depicted in (Figure 1), the use of the Mote-Carlo sampling from REINFORCE implies that the RL agent sample from the starting state to the goal state, rather than bootstrapping as Temporal Difference Learning and Dynamic programming doing.
In REINFORCE, as in other policy gradient algorithms, the gradient steps aim to minimize loss function by incrementally modifying the policy network parameters. Thus the resulting loss function in the original REINFORCE algorithm is derived as:
where the log term is the log probability of taking some action a at some state s, and Rₜ is the return, the sum of the discounted rewards from the time-step t up until the end of the episode. In practice, the above loss function isn’t typically used as its performance is restricted by the high variance in the return R over entire episodes. To combat this, the advantage estimate function is introduced in place of the return function R. The advantage function is given by:
where V (sₜ) is a learned value function that estimates the value of a given state Q(sₜ, aₜ) is the Q-function at state t, also called as action-value function, which is formulated as
which indicates the value of state s and an action an under the policy π. The r is the reward received from transitioning from state st, into state sₜ+1 , by taking action a, and γ is the discount rate. Adding the advantage in the loss function (Equation 2) then becomes:
Since the value function is learned over time as more updates are performed, it introduces some margin bias caused by the imperfect estimates but decreases the overall variance as well.
Aₜ^(1) ∈ R is high bias, low variance, while the Aₜ^(∞)∈ R is unbiased, high variance. A technique called Generalized Advantage Estimation (GAE) [5] is employed to compute the advantage in the loss term:
This introduces a hyperparameter λ ∈ [0, 1] that might be utilized to tune the amount of variance vs bias in each update, where λ=1 leads to maximum variance and nil bias, and λ=0 leads to opposite results.
1.2 Trust Region Policy Optimization (TRPO)
Trust Region Policy Optimization (TRPO) [6] is analogous to the natural policy gradient method since it uses gradient ascent to retrieve policies that increase the return. Also, is effective for optimizing large nonlinear policies such as neural networks. TRPO could be applied to environments with continuous or discrete action spaces. It updates the policy by taking the widest step possible that enhances its performance. However, first-order optimization isn’t very accurate for curved areas. It could get overconfident and make poor decisions that will ruin the progress of the training. For this reason, it should satisfy a special constraint on how close the new and old
policies can be. The constraint is expressed by the use of Kullback-Leibler (KL) Divergence [7], which is a measure of distance between two probability distributions. This imposes a constraint that the KL divergence is bounded at every point within the state space S. The loss function of TPRO algorithm measure how new policy π_θnew performs in comparison to the previous policy π_θold using data from the old policy:
where A(s,a) is advantage function (Equation 3) and D_KL (θ_old , θ_new ) is the mean of Kullback- Leibler divergence between policies across states visited by the old policy:
1.3 Proximal Policy Optimization (PPO)
Proximal policy optimization (PPO) [8] is an on-policy optimization algorithm that performs multiple epochs of stochastic gradient ascent in every policy update. PPO has the stability and reliability of TRPO. However, it is much easier to implement and has higher sample complexity. There are 2 primary variants of PPO: PPO-Penalty and PPO-Clip.
PPO-Penalty [8]: approximately solves a KL-constrained update, like TRPO. However, penalizes the KL-divergence among the objective function rather than creating a strict constraint. This variant of PPO automatically adjusts the penalty coefficient by updating parameter δ, over the course of training so that could be scaled appropriately.
PPO-Clip [8]: relies on specialized clipping within the objective function to eliminate incentives for the new policy to diverge from the old policy. As a result, it doesn’t use at all the KL constraint in the objective function.
The approach of PPO-Clip will be explained more in this section since it introduces novelty using the clipping while the PPO-Penalty indicates an improved implementation of TRPO where on each iteration compute and update δ of the KL-Divergence consistent with the curvature of the surface.
To optimize its policy, PPO alternate between sampling data generated from the current policy. Thus, for improving its policy on the sampled data, it is performing several epochs of stochastic gradient ascent. The novel objective function that enables multiple epochs of mini-batch updates is given by:
The goal for this objective function is as follows. The second term, clip(kₜ (θ), 1-e, 1+e)Aₜ^ (π_θold), modifies the surrogate objective function by clipping the probability ratio, which discards the incentive for moving kₜ outside of the interval [1-e, 1+e]. The probability ratio kt is clipped depending on whether the advantage is positive or negative.
When the output of the advantage function is positive (left plot on Figure 2), the objective function will increase if the action becomes more likely that is, since π_θnew (a|s) increases. Although the min places a proportional limit on how much the objective function can increase. When π_θnew (a|s)> (1+e ) π_θold (a|s), the min kicks in and this term hits a ceiling of (1+e) * Aₜ^(π_θold) (s, a). As a result, the new policy doesn’t profit by going away from the previous policy. In the opposite case of a negative advantage function (right plot on Figure 2), the objective function will increase if the action becomes less likely that it is since π_θnew (a|s) decreases. Having the max on this term introduces a limit of the amount that the objective could increase. Since π_θnew (a|s) < (1-e) π_θold (a|s), the max kicks in and this term hits a ceiling of (1-e) Aₜ^(π_θold) (s, a). Thus, again the new policy doesn’t profit by going far from the previous policy.
PPO uses a loss function that incorporates the policy surrogate and value function error terms. This objective can further be expanded by adding an entropy bonus term to provide sufficient exploration. Combining these terms, we obtain the following objective, which is maximized on each iteration:
where c1, c2 ∈ R are coefficients, S ∈ [0, 1] denotes the entropy bonus and Lₜ^(VF) ∈ R is a squared error loss (V_θ (st) − Vₜ ^(target) )^2. PPO tries to optimize a stochastic policy in an on-policy way. This suggests that it explores by sampling actions according to the most recent version of its stochastic policy. The amount of randomness in action selection depends on both initial conditions and training procedures. Over the course of training, the policy generally becomes less random, because the update rule encourages it to exploit rewards that it is already found. This could cause the policy to get trapped in local optima.
2. Off-Policy Algorithms
In this section, we will concentrate on 3 Markov Decision Process off-policy algorithms. Off-policy reinforcement learning algorithms are able to learn a new target policy by the historical data obtained from different behavior policies — under of non-stationary episodic Markov Decision Processes. This is convenient for having parallel learning (which could facilitate speeds up the convergence) or continuous exploration with the utilization of additional behavioral policies.
2.1 Q-learning
Q-learning [9] is a model-free reinforcement learning. It can also be viewed as a technique of asynchronous dynamic programming (DP) [1]. It provides agents with the aptitude of learning to act optimally in Markovian domains by experiencing the residuals of actions. For any finite Markov decision process (FMDP), Q-learning is able to retrieve an optimal policy in the sense of maximizing the expected value of the total reward over the successive steps, starting from the current state. It is important to note that Q-learning can only be used in environments with discrete action spaces. The theory of dynamic programming (DP) [22, 23] assures that exist at least one optimal stationary policy π^∗ such that
where p(s, a, s’) is the transition probability function and s, s’ depicts the state and next state accordingly. Using Dynamic programming (DP) instead of Q-learning will provide a variety of ways for calculating the optimal value function V^(∗) (s) and policy π^(∗) (a|s). However, DP assumes that the reward function and transition function are known. The task that is facing the Q-learning algorithm is to retrieve an optimal policy π^(∗) (a|s) without initially knowing reward and transition functions. Therefore the Q-function (or action-value) having the policy π(a|s) is derived as:
The optimal Q-function Q^(∗) (s, a) is formulated as:
In other words, the Q value function is the expected discounted reward for executing action a at state s by following policy π. The objective of the Q-learning algorithm is to estimate the Q values to retrieve the optimal policy. The Q-learning agent is following a 5 step process as is depicted in (Figure 3).
In addition, the Q-learning algorithm uses the action-replay process ARP (as described in the first implementation) where currently has the convenient name of Q-Table. Q-Table is a data structure used to compute the maximum expected future rewards for each action at any given state. Basically, this table will direct agent selection on the best action at each state. As a result, the Q-Learning algorithm tries to update the values of the Q-table. The equation to update the Q-table is:
where lr ∈ R is the learning rate and max Q’ (s’, a’ ) is the maximum expected future reward given the next state s’ and all possible actions on the next state. The optimal Q-function follows a fundamental identity rule known as the Bellman equation [10]. This is based on the following hypothesis: if the optimal value Q^(∗) (s’, a’) of the next state s’ was known for all possible actions a’, then the optimal strategy is to choose the action a’ maximizing the r + Q^(∗) (s’, a’).
2.2 Deep Q-learning (DQN)
Deep Q-learning [11] is a deep neural network algorithm that is trained with a variant of the Q-learning [9] algorithm using stochastic gradient descent to update its weights. The basic idea behind several reinforcement learning algorithms is to estimate the action-value function Q(s, a), by using the Bellman equation as an iterative update
Such value iteration algorithms converge to the optimal action-value function as i →∞ [1]. This approach is completely impractical in a real case scenario because the action-value function is estimated separately for each sequence, with no generalization. Instead, a function approximator is commonly used to estimate the action-value function Q(s, a; θ) ≈ Q∗ (s, a). In the machine learning community this is often a linear function approximator, but a non-linear function approximation, such as a neural network, may be used instead. A neural network function approximator with weights θ is considered as a Q-network. The Q-network is trained to minimize below loss functions L(θ) during each iteration
where
is the target for each iteration i. Note that DQN is a model-free algorithm as it solves the reinforcement learning task directly using samples from the emulator, rather than explicitly constructing an estimate of the model.
It is also off-policy since it learns a greedy strategy to select the next action
a = max_a Q(s, a; θ), while following a behavior distribution that ensures sufficient exploration of the state space. In practice, the behavior distribution is commonly chosen by an e-greedy strategy that follows the greedy strategy with probability 1 -e and selects a random action with probability e ∈ [0, 1]. DQN can’t be straightforwardly applied to continuous domains since it depends on finding the action that maximizes the Q(s, a) function, which in the continuous-valued case requires an iterative optimization process at every step.
Additionally, to reduce the issues of correlated data and non-stationary distributions, DQN use an experience replay mechanism [12] at each time-step eₜ = (sₜ, aₜ, rₜ, sₜ+1 ) from a data-set D = (e_1, …, e_N ), N ∈ N. Agent randomly samples previous transitions and thus smooth the training
distribution over a large number of previous behaviors. After using the experience replay buffer, the agent chooses and executes an action based on an e-greedy policy. Randomizing the samples breaks the strong correlations between them, therefore reducing the variance of the updates. In general, DQN only stores the last N experience pairs in replay buffer and samples uniformly at random from D when updating. Uniform sampling offers the same weight to all transitions in the replay buffer.
2.3 Combining Improvements in Deep Reinforcement Learning (Rainbow)
Combining Improvements in Deep Reinforcement Learning (Rainbow) [13] algorithm is a Q-learning-based algorithm by Deepmind that combines six independent enhancements (double Q-learning, prioritized replay, dueling networks, noisy nets, multi-step learning, and distributional RL) on the DQN algorithm. Each of these features addresses a limitation and improves the overall performance of the agent on the benchmark suite of 57 Atari 2600 games from the Arcade Learning Environment [14]. The use of Double Q-learning is to reduce overestimation bias by decoupling the maximization performed to calculate the target value
the term hat(θ) represents the parameters of a target network; a periodic copy of the online network which isn’t directly optimized. So, it is use the Q network (θ) to choose the best action with the argmax, and then the target network (hat(θ)) to get the value estimate. This decorrelates the noise
and avoids amplification.
DQN samples uniformly from the replay buffer which is wasteful. For this reason, RAINBOW is using a prioritized replay buffer. it is better to sample transitions with probability pₜ ∈ [0, 1] relative to the last encountered absolute TD error where the predicted reward diverges greatly from the expected reward and more recent transitions have a higher priority
where ω is a hyper-parameter that determines the shape of the distribution. Note that stochastic transitions might also be favored to be stored on replay buffer, even when there’s little left to learn about them.
The dueling network that exists in RAINBOW is a neural network architecture designed for value-based RL. Dueling DQN alters the neural network’s architecture by splitting it into two separate estimators after convolution. A primary estimator is a single number, the state value V(s). The second estimator produces one number for each action, indicating how much worse an action is in comparison to the best action in a state. In Q-Learning, the target value is calculated solely on the current reward. In Multi-step Q-Learning, rewards from n ∈ N steps are added together, and the Q function value is only added at the end.
Multi-step targets with suitably tuned n usually lead to faster learning [1]. Noisy nets propose a noisy linear layer that combines a deterministic and a noisy stream for better exploration of e-greedy policies
where e^b ∈ R and e^w ∈ R are random variables and denote the element-wise product. This transformation can then be used in place of the standard linear y = b + W x. The main idea of Distributional RL is to work directly with the complete distribution of the return rather than with its expectation. Let’s suppose that the random variable Z(s, a) be the return obtained by starting from state s, performing action a from the current policy. Then the Q-function gets the form of:
Instead of trying to minimize the error of Double Q-learning (Equation 19), which is basically a distance between expectations. Rainbow minimizes a distributional error using Kullbeck-Leibler divergence between the two distributions. So, Rainbow define the target distribution as
dₜ (n) =(rₜ^(n)+ γₜ^(n) z, p_(hat(θ)) (s’, a’)). The loss functions that it use is:
where Φz is the projection onto z. Also, the KL loss might be more robust to noisy stochastic environments because the loss can continue to decrease, even when the returns are not deterministic. Rainbow variants prioritize transitions by the KL loss since this is what the algorithm is minimizing:
3. Actor-Critic, Off Policy Algorithms
Actor-critic methods are Temporal-Difference (TD) [1] methods that have a separate memory structure to explicitly represent the policy independent of the value function. The policy structure is known as the actor because it is used to select actions. The estimated value function is known as the critic because it criticizes the actions made by the actor. Typically, the critic uses an approximation architecture and simulation to learn a state-value function V(s), which is then used to update the actor’s policy parameters and determine whether things have gone better or worse than expected. Such methods, as long as they are gradient-based, may have desirable convergence properties, in contrast to critic-only methods for which convergence is guaranteed in rather limited settings. They also hold the promise of delivering faster convergence (due to variance reduction) than actor-only methods.
3.1 Deep Deterministic Policy Gradient (DDPG)
Deep Deterministic Policy Gradient (DDPG) [2] is a model-free, off-policy actor-critic algorithm using deep function approximators. It could learn policies in high-dimensional, continuous action spaces. It uses off-policy data and the Bellman equation to learn the Q-function and based on that learn the policy. DDPG is based on the deterministic policy gradient (DPG) [15] algorithm. The DPG algorithm maintains a parameterized actor function μ(s|θμ ) which specifies the current policy by deterministically mapping states to a specific action. The critic Q(s, a) is learned using the Bellman equation as in Q-learning. The Bellman equation is the starting point for learning an approximator of Q^(∗) (s, a). As it happens in DQN, DDPG supposes the approximator is a neural network Q(s, a; θ) or Q_θ (s, a) with parameters θ. Then it is using the mean-squared Bellman error (MSBE) function, which indicates roughly how closely Q_θ comes to satisfying the Bellman
equation:
All standard algorithms that train a deep neural network to approximate the optimal Q-function Q∗(s, a) make use of an experience replay buffer. This is the set D = (e_1 , …, e_N ), N ∈ N of previous experiences, at each time-step eₜ = (sₜ , aₜ , rₜ , sₜ+1 ). In order for the algorithm to have stable behavior, the replay buffer should be large enough to contain a wide range of experiences. However, it may not always be good to keep everything.
Furthermore, Q-learning algorithms make use of target networks. The term in the loss function (Equation 26) r+γ maxa0 Q_θ (s’, a’) is called target. Since by minimizing the mean-squared Bellman error (MSBE) loss, DDPG is attempting to make the Q-function more similar to this target. The target, however, is dependent on the same parameters θ that the algorithm is trained on. The solution is to use a group of parameters that comes close to θ by using a second neural network called a target network. The parameters of the target network are denoted θ_targ. To be noted that the target neural network has a time delay from the Q_θ network when is needed to update its parameters θ_targ. The target network is simply copied over from the main network every some-fixed-number of steps in DQN-based algorithms. Polyak averaging is used in DDPG style algorithms to update the target network once per main network update.
where ρ is a hyperparameter between 0 and 1 (usually close to 1). DDPG wants to learn a deterministic policy μ_φ (s) which gives the action that maximizes Q_θ (s, a). Because the action space is continuous but assumes that the Q-function is differentiable with respect to action, DDPG performs a stochastic gradient ascent (with respect to policy parameters only) to solve
Putting it all together, Q-learning in DDPG is performed by minimizing the following MSBE loss with stochastic gradient descent:
where μ_φ_targ is the target policy. To construct an exploration policy μ’, DDPG add noise sampled from a noise process N to it is actor policy
N can be chosen to suit the environment.
3.2 Twin Delayed Deep Deterministic (TD3)
In Q-learning, the value estimate is updated with a greedy target y = r + max_a’ Q(s’ , a’ ). However, if the target is susceptible to error (bias) e ∈ R, then the maximum over the value along with its error will generally be greater than the true maximum,
This is a common mode of failure for the DDPG algorithm. The learned Q-function starts grossly overestimating Q-values. Then leads to sub-optimal policies updates and divergent behavior, since it tries to exploit the errors in the Q-function. Twin Delayed DDPG (TD3) [16] is an algorithm built on the DDPG using Double Q-learning that addresses this issue by introducing three new features: clipped double-Q learning, target policy smoothing, and “delayed” policy update. Together, these three features result in substantially improved performance of DDPG.
Using Clipped Double Q-learning, the value target can’t introduce any additional overestimation, as in the case of using just the standard Q-learning target approach. However, this update rule may induce an underestimation bias. This is far preferable to overestimation bias. Both Q-functions are using a single target. This target is calculated using whichever of the two Q-functions
will give the lower target value:
and then both Loss functions are learned by using regression on this target:
Using the smaller Q-function value for the target and regressing towards it helps to avoid Q-function overestimation bias. Also, as in DDPG the actor policy is learned just by maximizing Q_φ1 :
However, the delayed policy updates feature of TD3 means that the actor policy is updated less frequently than the Q-functions. By sufficiently delaying the policy updates TD3 reduces the likelihood of repeating updates with respect to an unchanged critic. The less frequent policy updates will occur in the actor policy. Will result to estimate a value function V (s) with lower variance, and in principle, should emerge in higher quality policy updates.
Target policy smoothing essentially acts as a regularizer for the algorithm. Using the clipped noise, the target action is clipped to fall within the valid action range (all valid actions a satisfy a_Low ≤ a ≤ a_High ). As a result, the target actions are as follows:
This addresses a specific failure mode in DDPG. If the Q-function approximator develops an inaccurate sharp peak for some actions, the policy will rapidly exploit that peak, resulting in brittle or incorrect behavior. This can be avoided by smoothing the Q-function over similar actions, which is what target policy smoothing is intended to do.
3.3 Soft Actor-Critic (SAC)
Soft Actor-Critic (SAC) [17] is an algorithm that optimizes a stochastic policy in an off-policy way. It is bridging the gap between stochastic policy optimization and DDPG-style methods. It is not a direct successor to TD3, despite the fact that both were released around the same time, but it integrates the clipped double-Q trick. The central feature of SAC is entropy regularization which provides a substantial improvement in exploration and robustness. The policy is trained to maximize a trade-off among expected return and entropy, which is a measure of the policy’s randomness. Entropy is closely related to the exploration-exploitation trade-off: increasing entropy leads to more exploration, which can speed up learning later on. It can also keep the policy from convergent to a local optimum. Entropy is a quantity that indicates how random a random variable is. Assume x ∈ R is a random variable with a probability mass function or density function P. The entropy H of x is calculated from its distribution P as follows:
In an entropy-regularized reinforcement learning environment, the policies are designed to maximize the expected future return, as well as the expected future entropy. As a result, this will maximize the value function V^π (s). The equation of the value function that includes the entropy is derived as:
where β > 0 is a temperature parameter that determines the relative importance of the entropy term against the reward. Thus, parameter β controls the stochasticity of the optimal policy. More exploration corresponds to a higher β. Lower β values indicate greater exploitation. Subsequently, the Q function including the entropy term from every timestep except the first one will be:
SAC also uses mean-squared Bellman error (MSBE) as a loss function for its policy π_θ. The loss function is incorporating a clipped double-Q-function Q_φ1, Q_φ2. During training, SAC selects the minimum Q-value between the two Q approximators. As a result, the loss function for the Q-networks in SAC is:
where the target is given by:
ã’ indicates that the next actions have to be sampled from the policy and not from the replay buffer D. SAC uses the reparameterization trick [18] to optimize the target policy. This trick is used to make sure that sampling from the policy πθ (·|s) is a differentiable process so that there are no problems in backpropagating the errors. Moreover, independent noise is added to achieve
better exploration:
tanh in the SAC policy ensures that actions are bounded to a finite range, while σ_θ (s) describes the standard deviation which is parameterized by θ.
References
[1] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.
[2] Timothy P Lillicrap et al. “Continuous control with deep reinforcement learning”. In: arXiv preprint arXiv:1509.02971 (2015).
[3] Ronald J Williams. “Simple statistical gradient-following algorithms for connectionist reinforcement learning”. In: Machine learning 8.3–4 (1992), pp. 229–256.
[4] Lilian Weng. Difference between Monte-Carlo, Temporal-Difference Learning, and Dynamic programming. url: https : / / lilianweng . github . io / lil — log / 2018 / 02 / 19 / a — long -peek-into-reinforcement-learning.html.
[5] John Schulman et al. “High-dimensional continuous control using generalized advantage estimation”. In: arXiv preprint arXiv:1506.02438 (2015).
[6] John Schulman et al. “Trust region policy optimization”. In: International conference on machine learning. PMLR. 2015, pp. 1889–1897.
[7] S. Kullback and R. A. Leibler. “On Information and Sufficiency”. In: The Annals of Mathematical Statistics 22.1 (1951), pp. 79–86. doi: 10.1214/aoms/1177729694. URL: https://doi.org/10.1214/aoms/1177729694.
[8] John Schulman et al. “Proximal policy optimization algorithms”. In: arXiv preprint arXiv:1707.06347 (2017).
[9] Christopher JCH Watkins and Peter Dayan. “Q-learning”. In: Machine learning 8.3–4 (1992), pp. 279–292.
[10] Richard Bellman. “Dynamic programming”. In: Science 153.3731 (1966), pp. 34–37.
[11] Volodymyr Mnih et al. “Playing atari with deep reinforcement learning”. In: arXiv preprint arXiv:1312.5602 (2013).
[12] Long-Ji Lin. Reinforcement learning for robots using neural networks. Tech. rep. Carnegie-Mellon Univ Pittsburgh PA School of Computer Science, 1993.
[13] Matteo Hessel et al. “Rainbow: Combining improvements in deep reinforcement learning”. In: Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 32. 1. 2018.
[14] Marc G. Bellemare et al. “The Arcade Learning Environment: An Evaluation Platform for General Agents”. In: CoRR abs/1207.4708 (2012). arXiv: 1207.4708. url: http://arxiv. org/abs/1207.4708.
[15] David Silver et al. “Deterministic policy gradient algorithms”. In: International conference on machine learning. PMLR. 2014, pp. 387–395.
[16] Scott Fujimoto, Herke Hoof, and David Meger. “Addressing function approximation error in actor-critic methods”. In: International Conference on Machine Learning. PMLR. 2018, pp. 1587–1596.
[17] Tuomas Haarnoja et al. “Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor”. In: International Conference on Machine Learning. PMLR. 2018, pp. 1861–1870.
[18] Diederik P Kingma and Max Welling. “Auto-encoding variational bayes”. In: arXiv preprint arXiv:1312.6114 (2013).