Introduction to Machine learning (ML) and Reinforcement Learning (RL)
Definition of Machine Learning
Machine learning (ML) [1] is the process of creating mathematical models using data X and labels Y (in the supervised problem settings) is defined as :
where n, m are the number of samples and A, B represent the dimensionality of data and labels respectively. To help a computer learn without direct human instruction. Machine learning aims to identify patterns within given data, by using those patterns in new unfamiliar data the algorithms could produce accurate predictions and thus create a trained model.
where θ = {θ0, …θk } ∈ H with k, H ∈ N is the set of possible parameters of the hypothesis hθ that is mapping x to y, as in:
optimized w.r.t an objective/loss function:
The loss functions quantify the variation between the predictions of the model being trained and the actual problem instances. Given feedback to the model how it should change the parameters θ for better mapping of x to y. Data abundance and the computational capacity of modern hardware has allowed such model to provide results of high accuracy.
Machine learning and Reinforcement learning Terms
Supervised versus Unsupervised: The fundamental difference between supervised and unsupervised learning is that supervised learning [2] uses ground truth labels, or in other words, having prior knowledge of what the output values of our samples should be. Therefore, the goal of supervised learning is to define a function that, given a set of input data samples and desired outputs/labels, approximates the relationship between an input and an output. Contrary unsupervised learning [3], is using no label information during training. Allowing the algorithm to act on the new data without the support of label guidance. Here the task of an algorithm is to group unsorted information consistent with some similarities, patterns, and differences of the data. Thus the goal is to infer the natural structure which is present within a collection of data points. The two main forms of supervised learning algorithms are:
- Regression: mapping an input signal to a continuous output interval.
- Classification: the output belongs to a discrete space that represents a set of predefined outcomes.
On the other hand, the two major forms of unsupervised learning algorithms are:
- Clustering: Finding “clusters” of observations in a dataset that are identical to each other.
- Association: looks for relationships that appear very frequently between a large collection of data points and uses them to discover the rules that govern them.
Online versus Offline learning: Many learning systems are often categorized as online or offline(or somewhere in-between). Online learning algorithms [4] work with data as soon as they’re available. Strictly online algorithms improve incrementally from each chunk of recent data when it arrives, then discard that data and don’t use it again. It is not a requirement, but it is commonly desirable for an online algorithm to forget older examples over time, so it can adapt to nonstationary populations. Stochastic gradient descent with back-propagation as employed in neural networks is an example of online learning. On the other hand, offline learning algorithms[5] work with data in bulk, from a dataset. Strictly offline learning algorithms need to be re-run from scratch to be able to learn from new or updated data. Support vector machines and random forests are strictly offline algorithms (although nowadays exist online variants of them [6]).
Online algorithms are more general and it is quicker to construct an offline algorithm from a strictly online one, just having a stored dataset. Nonetheless, the reversal is not feasible for a strictly offline algorithm. This doesn’t necessarily make them superior, often compromises are made in terms of sample efficiency, CPU cost, or accuracy when using an online algorithm. Approaches like mini-batches in neural network training are viewed as attempts to seek out common ground between online and offline algorithms.
In addition, experience replay, a typical Reinforcement learning technique, employed in Deep-Q-Networks among others, is an in-between approach. Although you need to store all the experience necessary to fully train an agent, in theory, typically you store a rolling history and sample from it.
A Markov Decision Process (MDP) [7], is a discrete-time stochastic control process. Is formalized as a tuple (S, A, ρ0, p, r, γ), where S and A respectively denote the state and action spaces. The dynamics of MDP are defined by a transition distribution with conditional density p(s_t+1|s_t, a_t ), along with ρ0 ∈ [0, 1], the density of the distribution from which the initial state is sampled. Finally, γ ∈ [0, 1] denotes the discount factor, and r: S × A → R is the reward function. Although the analysis usually involves an infinite-horizon MDP, the MDPs employed in practice are episodic, with γ = 0 at episode termination.
Reinforcement learning (RL)[8] is the field of sequential decision-making
under uncertainty. This family of algorithms consists of an agent (decision maker), an environment typically stated in the form of a Markov decision process (MDP), and a policy that indicates the best action that an agent should take given its current state. The agent interacts with a previously
unknown environment and receives rewards upon interaction according to its policy. The agent’s goal is to retrieve a policy that gathers the largest amount of cumulative reward. Reinforcement learning algorithms aim to find a balance between exploration (of uncharted territory) and exploitation (of current knowledge).
Imitation Learning (IL) [11] is a sub-section of reinforcement learning
which addresses the problem of an agent learning to act in an environment in order to imitate the behavior of an expert demonstrator. No direct supervision is provided to the agent — it is never directly told what the optimal action is — nor does it receive a reinforcement signal from the environment upon interaction. Instead, the agent is provided with a set of trajectories to use as a guide for its learning process.
Policy: A policy π: A × S → [0, 1] express the sequential decision-making process of the agent by defining a parameterized policy π_θ, for example, modeled via a neural network with parameter θ. We denote π_θ (a_t | s_t ) as the conditional probability density concentrated at the action a_t ∈ A when the agent is in state s_t ∈ S.
On-policy versus Off-policy: On-policy algorithms work with an individual policy, often illustrated as π, and request any observations (state, action, reward, next state) to be generated using that policy. Off-policy algorithms work with two policies. The two policies types are the target and behavior policies. The policy which is being learned, called the target policy usually denoted as π, and the policy being followed that generates the observations called as behavior policy, usually denoted as μ: A × S → [0, 1] [12] or β: A × S → [0, 1] [8] within the literature. A common scenario for off-policy learning is to form observations by policy β, which has an exploring policy, then target policy π, which has a greedy policy, pick the finest actions which are selected by policy β on every time step. The relationship between on-policy and off-policy learning is that off-policy is usually a strict generalization of on-policy. We could make any off-policy algorithm into an identical on-policy one by setting π = β.
Model-based versus Model-free: A model-based M = (p, r) reinforcement learning algorithm [8] tries to retrieve the optimal policy π^∗ (a_t|s_t) using transition
and reward functions
The agent might have access only to an approximation of those two functions, which could be learned during its interactions with the environment to enhance its estimate of the optimal policy. In other words, the policy defines the behavior of the agent, whereas, the transition probability and reward functions explain the dynamics of the environment. However, the optimal policy might never be found by the agent due to the approximations that exist for those two functions.
On the contrary, a model-free algorithm [8] tries to retrieve the optimal policy without using or estimating the dynamics (transition and reward functions) of the environment. In practice, it either estimates a value function V_π (s) ∈ R as,
or the policy π(a|s) directly from the interactions with the environment. A value function could be expressed as a function that evaluates the goodness/badness of states since it predicts the future reward from a given state. From the optimal value function Vπ^∗(s), an optimal policy π^∗ (a_t |s_t ) can be derived with respect to the value function.
Prediction versus Control The prediction problem in RL [8] is to estimate the value of a distinct state or state/action pair, given an environment and a policy. Nonetheless, the (optimal) control problem in RL is to discover the best fitting policy on the given environment. Solving the control problem when using value-based methods involves both estimating the value of being during a specific state (i.e. solving the prediction problem), and adjusting the policy to create higher-value choices supported by those estimates. This is called generalized policy iteration.
References
[1] Tom M Mitchell et al. “Machine learning”. In: (1997).
[2] Pádraig Cunningham, Matthieu Cord, and Sarah Jane Delany. “Supervised learning”. In:Machine learning techniques for multimedia. Springer, 2008, pp. 21–49.
[3] Trevor Hastie, Robert Tibshirani, and Jerome Friedman. “Unsupervised learning”. In: The elements of statistical learning. Springer, 2009, pp. 485–585.
[4] David Saad. On-line learning in neural networks. 17. Cambridge University Press, 2009.
[5] Michel Benaim. “The “off line learning approximation” in continuous time neural networks: An adiabatic theorem”. In: Neural Networks 6.5 (1993), pp. 655–665.
[6] Seyda Ertekin, Leon Bottou, and C Lee Giles. “Nonconvex online support vector machines”. In: IEEE Transactions on Pattern Analysis and Machine Intelligence 33.2 (2010), pp. 368–381.
[7] Ronald A Howard. “Dynamic programming and markov processes.” In: (1960).
[8] Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.