Summary Review of Imitation Learning Algorithms.
Reinforcement learning has been successfully applied in domains where the reward function is well defined. But in real-world scenarios, typically it is not clear at all what the reward function should be. Also, manually designing and tweaking a reward function for a task is an incredibly difficult task, which comes with several drawbacks and potential for errors. Instead of attempting to learn from sparse rewards or a hand-crafted reward function, the agent in imitation learning (IL) is provided with a collection of demonstrations. Then it attempts to learn the best policy by imitating the expert’s decisions. The main component of IL is the environment, which is a Markov Decision Process (MDP) with an unknown reward function r(s, a). The agent takes various actions in this environment based on its policy π(a|s). For retrieving a better policy, the agent is using also expert’s demonstrations/trajectories τ = (s_0, a_0 , s_1 , a_1 , . . . ), where the actions are based on the expert optimal policy π^(*) (a|s). Imitation learning or learning from expert trajectories can be implemented in the following different ways.
Behavioral Cloning: Behavior cloning (BC) [1] is the most basic form of an imitation learning technique, focusing on learning the expert’s policy through supervised learning techniques. The agent is provided with expert demonstrations as state-action pairs. It is assumed that these pairs are independent and identically distributed (i.i.d.) samples, to be able to be used in supervised learning methods. Behavioral cloning can be extremely effective in some applications. On the other hand, might be quite problematic in the vast majority of cases. The primary issue is the i.i.d. assumption. While supervised learning assumes that the state-action pairs are distributed independently and identically. The MDP breaks the i.i.d. assumption since the action in a given state generates the following state. This also implies that errors made by the agent in previous states are accumulated over to subsequent states. As a result, a slip made by the agent can easily place him in a state that the expert has never visited and therefore the agent was never trained on. In such states, the behavior of the agent is undefined and might result in fatal failures.
Direct Policy Learning: Direct policy learning (DPL) [2] is a better version of behavioral cloning. This method presumes that the algorithm has access to interactive expert demonstrations during training. Similar to behavioral cloning, the agent collects some of the expert’s demonstrations and uses supervised learning techniques to learn a policy π(a|s). The agent will then roll out this policy in the environment in which it has been trained, and the expert will be consulted to evaluate the roll-out trajectory. As a result, the agent receives more training data, which is then fed back to the supervised learning method. This loop is repeated until convergence is reached.
Inverse Reinforcement learning: This is the approach that will be further discussed in this blog post. Inverse reinforcement learning (IRL) [3] attempts to estimate the reward function of the environment based on the expert’s demonstrations. Then, using reinforcement learning to retrieve the optimal policy. This method learns a cost function that prioritizes some trajectories over others. Due to their bi-level optimization nature, many existing IRL methods must solve a series of computationally expensive reinforcement learning problems. As a result, they frequently fail to scale to large high-dimensional environments. Having the knowledge of the reward function is in some way more robust than the knowledge of expert policy. For example, if some part of the environment changes, behavior cloning will lead to sub-optimal behavior, while having knowledge of the reward function contains enough information to robustly adapt the agent’s behavior to still achieve the task. Additionally, in literature, IRL and inverse optimal control (IOC) are often used as synonyms. The most important difference is that optimal control talks about the cost function c(s, a) while IRL talks about reward function r(s, a).
1. Generative Adversarial Imitation Learning (GAIL)
Generative Adversarial Imitation Learning (GAIL) [4] is a model-free imitation learning algorithm, in the category of Inverse Reinforcement learning, that obtains significant performance gains over existing model-free methods in imitating expert trajectories in large high-dimensional environments. The algorithm is a combination of imitation learning and generative adversarial networks (GAN) [5]. GAIL characterize the given policy by running reinforcement learning on a cost function c ∈ C = RS×A learned by maximum causal entropy IRL
where H(π), E_π [−log π(a|s)] is the γ-discounted causal entropy [6] of the policy π, the c(s, a) is the cost for state action pair and Π is the set of all stationary stochastic policies. Maximum causal entropy IRL looks for a cost function c that assigns low cost to the expert policy and high cost to other policies, thereby allowing the expert policy to be found via a certain reinforcement learning procedure (RL: cost →policy):
This RL procedure maps a cost function to high-entropy policies that minimize the expected cumulative cost. Having such a large dimensional space of C, IRL can easily overfit when provided with a finite dataset. For this reason, it uses a convex cost function regularizer ψ: R^(S×A )→ hat(R), where hat(R) denote the extended real numbers R ∪ ∞. So (Equation 3) is the IRL cost function, which finds a cost function such that the expert performs better than all other policies, using the cost regularization term ψ (IRL: demonstration →cost) :
As a total output, the agent is trying to learn a policy given by RL(c̃), this is the policy given by running reinforcement learning on the output of IRL. The optimal policy is obtained by running RL(IRLψ (π_E ))= RL ◦ IRLψ (π_E )
where ρ_π = π(a|s)Σ_(τ=0 -> τ=∞) γ^(t) P (st = s|π) is occupancy measure of a policy (distribution of state-action pair encountered while navigating the environment with this policy π). The ψ∗ is a convex conjugate of ψ, which regularized inverse reinforcement learning, implicitly, seeks a policy whose occupancy measure is close to the expert. If cost regularization didn’t exist at all, then the recovered policy will exactly match the expert occupancy measure. The expert trajectory distribution will be provided only as a finite set of samples, so in large environments, most of the expert occupancy measure values will be zero. As a result, having an exact occupancy measure matching will force the learned policy to never visit these unseen state-action pairs simply due to a lack of data.
Using linear cost function regularizer ψ with some chosen features of the states (e.g. Apprenticeship Learning [3]), having as a goal to find the policy performing better than the expert min_π max_(c ∈ C) E_π [c(s, a)] − E_(π_E) [c(s, a)] which leads to an imitation learning algorithm that exactly matches occupancy measures. However linear cost function regularizer is unmanageable in large environments and needs carefully tuning. For this reason, GAIL use the below regularizer which places a low penalty on cost functions c that assigns an amount of negative cost to expert state-action pairs
An interesting property of GA regularizer is that it is an average over expert data and therefore can adjust to arbitrary expert datasets. On the other hand, a regularizer using linear apprenticeship is always fixed and cannot adapt to the data. The choice of regularizer ψ∗_(GA) draws the connection between imitation learning and generative adversarial networks, which trains a generative model G by having the discriminator D: S × A →(0, 1) to distinguish the distribution of data generated by G and the true data distribution. When D cannot distinguish data generated by G from the true data, then G has successfully matched the true data.
Equation 6 is the optimal negative log loss of the binary classification problem of distinguishing between state-action pairs of π and π_E. In this setting, the learner’s occupancy measure will be analogous to the data generated by G and the expert’s occupancy measure is analogous to the true data distribution. The discriminator function can be interpreted as a local cost function providing learning signals to the policy. It turns out that this optimal loss is the Jensen-Shannon divergence D_(JS) (ρ_π − ρ_(π_E)) = D_(KL) (ρ_π ||(ρ_π + ρ_(π_E) )/2) + D_(KL) (ρ_(π_E) ||(ρ_π + ρ_(π_E) )/2), which is a squared metric between distributions. Treating the causal entropy H as a policy regularizer, controlled by λ > 0, we obtain a new imitation learning algorithm:
which minimizes a true metric between occupancy measures. Unlike linear apprenticeship learning algorithms, it can imitate expert policies exactly. GAIL relies crucially on the TRPO policy step to decrease (Equation 7) with respect to π(a|s). TRPO is a natural gradient step constrained to ensure that π_(θ_i +1) (a|s) doesn’t stray too far from π_(θ_i) (a|s), as measured by KL divergence between the two policies averaged over the states in the sampled trajectories. This carefully constructed step scheme ensures that divergence doesn’t occur due to high noise in estimating the gradient.
IRL was traditionally defined as the act of finding a cost function such as the expert policy. Although, now IRL could be viewed as a procedure that tries to induce a policy that matches the expert’s occupancy measure (generative model).
2 Sample-Efficient Imitation Learning via Generative Adversarial Nets
(SAM)
Sample-Efficient Imitation Learning via Generative Adversarial Nets (SAM) [7] is a model-free imitation learning algorithm that combines deterministic policy gradients with an adversarial reward learning procedure having a significant increase in sample efficiency. Sample-efficient indicates limiting the number of agent-environment interactions. The interactions are decreased since the agent keeps experiential data in memory to perform policy updates. Therefore requires considerably fewer new samples (interactions) per iteration, as it can re-exploit the stored transitions. In addition, it doesn’t rely on any prepossessing to permit gains in sample efficiency. The SAM algorithm is based on DDPG and GAN algorithms, which bring the sample efficiency of off-policy actor-critic methods and leverage gradient information. GANs and actor-critic architectures can be both framed as bilevel optimization problems, each involving two competing components. In one problem, the policy is trained against the reward, while in the other, the policy is trained against the critic.
SAM is composed of three interconnected learning networks: a reward network (parameter φ), a policy network (parameter θ), and a critic network (parameter ψ). The reward network with parameter φ, operating as the discriminator. The cross-entropy loss for the reward network is derived as:
where R_(GP) (φ) is a penalty to improve the stability on the discriminator gradient and D_(φ) is the decoder. The reward network is trained on each iteration, firstly on the most recently collected mini-batch of π_θ (a|s). Then on mini-batches sampled from the replay buffer, D. As for the loss function is optimized by the critic L(ψ), which relies only on samples from replay buffer D. The loss functions incorporate three components: a 1-step Bellman residual L1(ψ), an n-step Bellman residual Ln (ψ) and a weight decay regularizer R_(WD(ψ))
where v ∈ [0, 1] is a hyperparameter that determines how much decay is used. Note that in SAM decay regularizer is used only on critics, not in the policy. Since with the use of the policy on the regularizer, the policy will have an appearance in two different optimization problems. The 1-step and n-step look-ahead are defined as:
where E_(s_t ∼ρ^β, α_t ∼β [·]) indicates transitions that have been sampled from the replay buffer D, using the off-policy β. Applying multi-step targets with suitably tuned n ∈ N often leads to faster learning and better stability as we have seen in RAINBOW (link). The target policy μ_θ is updated by taking gradient ascent:
this policy gradient indicates the regions of the parameter space that have high similarity with the demonstrator trajectories. The exploration problem is treated as in DDPG, applying adaptive perturbation noise or sampled from an Ornstein-Uhlenbeck process.
References
[1] Caude Sammut. “Behavioral Cloning”. In: Encyclopedia of Machine Learning. Ed. by Claude Sammut and Geoffrey I. Webb. Boston, MA: Springer US, 2010, pp. 93–97. isbn: 978–0–387- 30164–8. doi: 10.1007/978–0–387–30164–8_69. url: https://doi.org/10.1007/978-0-387–30164–8_69.
[2] Jessica Chemali and Alessandro Lazaric. “Direct policy iteration with demonstrations”. In: Twenty-Fourth International Joint Conference on Artificial Intelligence. 2015.
[3] Pieter Abbeel and Andrew Y Ng. “Apprenticeship learning via inverse reinforcement learning”. In: Proceedings of the twenty-first international conference on Machine learning. 2004, p. 1.
[4] Jonathan Ho and Stefano Ermon. “Generative adversarial imitation learning”. In: arXiv preprint arXiv:1606.03476 (2016).
[5] Ian J Goodfellow et al. “Generative adversarial networks”. In: arXiv preprint arXiv:1406.2661 (2014).
[6] Michael Bloem and Nicholas Bambos. “Infinite time horizon maximum causal entropy inverse reinforcement learning”. In: 53rd IEEE Conference on Decision and Control. IEEE. 2014, pp. 4911–4916.
[7] Lionel Blondé and Alexandros Kalousis. “Sample-efficient imitation learning via generative adversarial nets”. In: The 22nd International Conference on Artificial Intelligence and Statistics. PMLR. 2019, pp. 3138–3148.