Remark | Supervised Learning | Unsupervised Learning |
---|---|---|
Data | iid | iid |
Labels | Available | None |
Generalize to | New samples | New samples |
Data is in form of $\left(\underbrace{x}_{Input}, \underbrace{y}_{Output} \right)$, i.e. (sample, target/label)
Protocol summary $$ \text{Input} \longrightarrow \text{MODEL} \longrightarrow \text{Output} $$ or $$ y = f(x) $$
Independent and identically distributed samples
Reinforcement learning (RL) is a paradygm in machine learning that is routed in interaction, sequences and delayed consequences.
In constrast to supervised and unsupervised learning, RL involves building systems that continually interact with their environments during the learning process.
Image from RL book http://incompleteideas.net/book/the-book.html
State and action sets can be discrete or continuous. For example
After interacting for a while, the agent has a record of the returns ($G$) (total rewards collected). The goal/objective of the agent is to maximum the return.
NOTE: Reward is sometimes modelled as a cost, e.g. in robotics
Remark | Supervised Learning | Unsupervised Learning | Reinforcement Learning |
---|---|---|---|
Data | iid | iid | Correlated |
Labels | Available | None | Delayed, sparse |
Generalize to | New samples | New samples | New experience |
RL data is in form of (state, action, new_state, reward), $\left(\underbrace{S_t, A_t}_{\text{Inputs}}, \underbrace{S_{t+1}, R_t}_{\text{Outputs}} \right)$
Delayed labels/rewards $\longrightarrow$ consequence of an action now may only be realized later
Protocol summary $$ S_{t+1} = f_1 (S_t, A_t), \qquad R_t = f_2 (S_t, A_t) $$
Markov dependency between transitions.
Episodic tasks Interaction is easily broken down into segments, subsequences. For example play a game until win/loss Leads to notion of terminal states. Environment resets at the end of the episode.
Continual tasks No simple way of breaking into segments. For example process control, robot buttler's lifetime
In either case, we can collect a trajectory of the agent't interaction, as the sequence of states and actions $\xi = (S_t, A_t), \ldots, (S_{t+k}, A_{t+k})$
Recall, return is given by sum of rewards over a trajectory, $\xi_k$ $$G_{0:T} = R_0 + R_1 + \ldots + R_T = \sum_{(S_i, A_i) \in \xi_k} f_2(S_i, A_i) $$
Discounting Future rewards are worth less that current rewards. Done by multiplying rewards with a parameter $\gamma$ $$ G_{0:T} = \gamma^0 R_0 + \gamma^1 R_1 + \ldots + \gamma^n R_T = \sum_{(S_i, A_i) \in \xi_k} \gamma^i f_2(S_i, A_i) $$
Intepretations/remarks
Discounted rewards are well suited for continual tasks, while undiscounted ones suit episodic tasks.
The agent's choice of actions while interacting with the environment are provided by a policy $(\pi)$, i.e. a prescription of what to do when encountering each state.
The policy can be deterministic or stochastic.
In small environments, this can be a simple table/lookup for each state. For large environments we typically use some parametric function to represent the policy
The overaching goal of the agent is to maximize the expected returns.
To do this, the agent needs to use a good policy. The best such policy is called the optimal policy
The value of a state is the expected return, starting at that state, and using a given policy.
The policy is used to collect the returns, which are then averaged to get the expectation.
Defined formally with the state value function $$ v_\pi(s) = \mathbb{E}_\pi\left[ \sum_{k=0}^{\infty} R_{t+k+1} \vert S_t = s \right]$$
Additionally, the action-value (Q-function) is often used and defined as $$ q_\pi(s, a) = \mathbb{E}_\pi\left[ \sum_{k=0}^{\infty} R_{t+k+1} \vert S_t = s, A_t = a \right]$$
To find the optimal policy, we need to compare any given pair of policies.
A policy $\pi_1$ is better than $\pi_2$ if it generates better expected returns for all states in the environment. The optimal policy is the one which is better than or equal to all other policies possible for the envionment. There is a corresponding optimal value and action-value functions.
Solving for the optimal policy (and optionally value function) when given a model.
In RL we can learn a number of things
Model A function to generating transitions and rewards, i.e. $f_1$ and $f_2$. For example physics laws and a hand crafted reward mapping.
Solutions methods largely fall into the taxonomy1 above.
[Image from OpenAI Spinning Up https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html
Application of RL requires certain characteristics. However, satisfying these characteristics does not mean RL is the only viable approach.
Also, one can sometimes modify the formulation of a problem to fit RL, or to move from RL to other paradyms
We want to maximize energy output of a solar panel. Solar panel output is determined by a number of factors, including:
Total irradiance is then the combibnation of these, factorig in the angle of incidence of the sun's rays $$ R_t = \theta_d R_d + \theta_r R_f + \theta_r R_r $$
Image source: https://www.redarc.com.au
Source: Dave Abel: https://github.com/david-abel/solar_panels_rl
Action set: {title north, tilt south, tilt west, tilt east, do nothing} - Pick some fixed tilt value, e.g. $2^o$
State: (latitude, longitude, sun's position(azimutth angle, altitude angle), clouds)
Reward: Energy generated. Take into account energy used to move the panel
Q-learning intuition:
Recall the key pieces from before (value, policy, reward, etc). Q-learning is part of a family of algorithms of the form
$$ \text{New Estimate} = \text{Old Estimate} + \text{StepSize} (\text{Target} - \text{Old Estimate}) $$Target is the expected return of a state.
Addis Ababa: lat, long (9.033572, 38.763707)
, Time zone Africa/Addis_Ababa
Panel size: $1m \times 1m$, Run for a few days with simulated clouds and states
One you have a task you want to solve with RL. Here is a recommended recipe
gym
from OpenAIrllab
, simple_rl
, Github is your friendpytorch
and tensorflow
all have implementations of basic RL algorithmsControls concept | maps to | RL concept |
---|---|---|
Direct control | $\longrightarrow$ | Model free RL |
--- | --- | --- |
Indirect control | $\longrightarrow$ | Model based |
Indirect methods use system identification (aka model learning) to build a model, on which control laws (aka policies) are developed upon.
Also related to tracking (reference trajectory given)