Chapter 3: Reinforcement learning
Reinforcement learning, sometimes known as RL, is a subfield of machine learning that investigates how intelligent agents should behave in an environment in order to maximize the concept of cumulative reward. One of the three fundamental models of machine learning is known as reinforcement learning. The other two models are supervised learning and unsupervised learning.
The key difference between supervised learning and reinforcement learning is that the former does not need labelled input/output pairings to be given, and the latter does not require sub-optimal behaviors to be explicitly corrected. Instead, the emphasis is on striking a healthy balance between exploiting existing resources and discovering new ones (in unexplored terrain) (of current knowledge).
Because many reinforcement learning algorithms designed for this context make use of dynamic programming methods, the environment is generally expressed in the form of a Markov decision process, also known as an MDP. The primary distinction between traditional methods of dynamic programming and reinforcement learning algorithms is that the latter do not presume prior knowledge of an exact mathematical model of the MDP and aim to solve large-scale MDPs, which are beyond the scope of traditional methods due to the impracticality of exact solutions.
Reinforcement learning is investigated across a wide variety of fields as a result of its generic nature. Some examples of these fields include game theory, control theory, operations research, information theory, simulation-based optimization, multi-agent systems, swarm intelligence, and statistics. In the academic literature pertaining to operations research and control, reinforcement learning is referred to as either neuro-dynamic programming or approximation dynamic programming. The problems that are of interest in reinforcement learning have also been studied in the theory of optimal control. The theory of optimal control is primarily concerned with the existence and characterization of optimal solutions, as well as algorithms for the exact computation of those solutions, and is less concerned with learning or approximation, particularly in the absence of a mathematical model of the environment. Reinforcement learning is a method that may be used in the fields of economics and game theory to explain how equilibrium might emerge from limited rationality.
The most fundamental kind of learning, known as reinforcement learning, is depicted as a Markov decision process (MDP):
a collection of states held by both the environment and the agents, S; a group of the agent's acts, denoted by "A.";
is the probability of transition (at time
) from state
to state
under action
.
is the immediate reward after transition from
to
with action
.
The goal of reinforcement learning is for the agent to learn an optimal, or nearly optimal, policy that maximizes the "reward function" or another user-provided reinforcement signal that accumulates from the immediate rewards. This can be accomplished by learning an optimal policy that maximizes the "reward function" or another user-provided reinforcement signal. This is analogous to mechanisms that seem to take place in animal psychology. For instance, biological brains are programmed to understand signals like pain and hunger as negative reinforcements. On the other hand, biological brains are also hardwired to interpret pleasure and the consumption of food as positive reinforcements. Animals may acquire the ability to participate in activities that maximize the benefits they get under certain conditions. This suggests that animals are capable of acquiring skills via positive reinforcement.
A simple artificial intelligence that uses reinforcement learning would interact with its surroundings in discrete time increments.
At each and every time t, the agent receives the current state
and reward
.
It then chooses an action
from the set of available actions, which is then discharged onto the surrounding environment.
The environment moves to a new state
and the reward
associated with the transition
is determined.
The goal of a reinforcement learning agent is to learn a policy:
,
which maximizes the expected cumulative reward.
When the issue is presented in the form of an MDP, it is assumed that the agent directly observes the current state of the environment. When this is the case, we say that the issue has complete observability. If the agent can only see a subset of the states or if the seen states are tainted by noise, then the agent is said to have partial observability, and the issue must be formalized as a partially observable Markov decision process in order to be solved. In either scenario, there is a possibility that the agent's choice of actions will be limited. For instance, the state of an account balance might be limited to only exist in positive territory; if the value of the state at the moment is 3, and the state transition tries to lower the value by 4, the transition will not be permitted.
The concept of regret is brought about when the performance of the agent is contrasted to that of an agent who operates in the most effective manner possible. It is necessary for the agent to think about the long-term effects of its actions (i.e., maximize future revenue) in order for it to behave in a manner that is close to optimum, despite the fact that the immediate reward associated with this may be unfavorable.
Therefore, situations that involve a trade-off between the rewards received in the short term and the long term are especially well suited for reinforcement learning. It has been effectively used to a wide variety of challenges, including robot control and Go (AlphaGo).
Reinforcement learning is so effective because of two distinct factors: first, the utilization of samples to improve overall performance, and second, the utilization of function approximation to manage expansive situations. Reinforcement learning is applicable in vast contexts because of these two fundamental components in the following scenarios::
There is a model of the environment, but there is not yet a solution that can be analytically derived from it; The only model of the environment provided is a simulation of it (the subject of simulation-based optimization); Interacting with one's surroundings is the only method to acquire knowledge on that surroundings.
The first two of these challenges may be categorized as planning challenges (given that there is some kind of model available), however the third challenge could be categorized as a true learning challenge. Reinforcement learning, on the other hand, transforms both of these planning difficulties into machine learning problems.
The trade-off between exploration and exploitation has been examined in great depth via the use of the multi-armed bandit problem and for limited state space MDPs in Burnetas and Katehakis (1997).
Learning via reinforcement necessitates the use of ingenious exploration methods; poor performance might result from picking actions at random without making reference to an estimated probability distribution. There is a good deal of comprehension of the (small) finite MDP scenario. Nevertheless, owing to the absence of algorithms that scale effectively with the number of states (or scale to issues with infinite state spaces), basic exploration approaches are the most practically applicable.
One such method is
-greedy, where
is a parameter controlling the amount of exploration vs.
exploitation.
With probability
, The practice of exploitation is selected, And the agent will choose the course of action that it feels will have the most positive impact in the long run (ties between actions are broken uniformly at random).
Alternatively, with probability
, The option of exploration is taken, and the action is selected in a standardized manner at random.
is usually a fixed parameter but can be adjusted either according to a schedule (making the agent explore progressively less), either in an adaptable manner based on heuristics, or both.
Even if the difficulty of exploration is overlooked and even if the state was viewable (which will be assumed afterwards), the challenge of determining which behaviors lead to larger cumulative rewards using previous experience is still a concern.
The policy serves as a model for the agent's decision-making process about action selection:
The policy map gives the probability of taking action
when in state
.:?61? There are also deterministic policies.
The value function
is defined as, expected return starting with state
, i.e.
, and successively following policy
.
Hence, roughly speaking, the value function estimates "how good" it is to be in a given state.:?60?
where the random variable
denotes the return, It is defined as the total of future rewards after taking into account discounts:
where
is the reward at step
,
is the discount-rate.
Gamma is a negative number, Therefore, events that are very far in the future are given less weight than occurrences that are relatively close in the future.
The algorithm has to identify a strategy that will provide the highest possible predicted return. According to the theory behind MDPs, it is common knowledge that the search space may be limited to the group of policies...