(mini) Intro to Reinforcement Learning

Data Science Africa (DSA) 2019 Summer School, Addis Ababa

Billy Okal

Voyage Auto

Recap: Supervised and Unsupervised Learning

Remark Supervised Learning Unsupervised Learning
Data iid iid
Labels Available None
Generalize to New samples New samples

Data is in form of $\left(\underbrace{x}_{Input}, \underbrace{y}_{Output} \right)$, i.e. (sample, target/label)

Protocol summary $$ \text{Input} \longrightarrow \text{MODEL} \longrightarrow \text{Output} $$ or $$ y = f(x) $$

Independent and identically distributed samples


Reinforcement learning (RL) is a paradygm in machine learning that is routed in interaction, sequences and delayed consequences.

In constrast to supervised and unsupervised learning, RL involves building systems that continually interact with their environments during the learning process.

1

Image from RL book http://incompleteideas.net/book/the-book.html

  • Environment: The world that an agent lives and interacts with.
  • Agent: The main character. Constrast with 'model' in SL and UL
  • State, $(s)$, Observation: Full/complete description of a system/world vs partial description of the same system. The collection of all the states in the environment is the state set $(\mathcal{S} = {s_1, \ldots, s_n})$ .
  • Action, $(a)$: What the agent can do. Set of all actions in an environment is the action set $(\mathcal{A} = {a_1, \ldots, a_m})$

State and action sets can be discrete or continuous. For example

  • Transition, $(f_1(\ldots))$: The underlying dynamics of the environment, e.g. physiscs laws. Can be deterministic or stochastic. This is the first environment response/feedback to an agents interaction.
  • Reward, $(r = f_2(\ldots))$: The second environment response/feedback. It is a measure of the consequence of the agent's single interaction, i.e. how good or bad the state (and action) is. Can eb deterministic or stochastic.

After interacting for a while, the agent has a record of the returns ($G$) (total rewards collected). The goal/objective of the agent is to maximum the return.

NOTE: Reward is sometimes modelled as a cost, e.g. in robotics

Reinforcement Learning vs SL vs UL

Remark Supervised Learning Unsupervised Learning Reinforcement Learning
Data iid iid Correlated
Labels Available None Delayed, sparse
Generalize to New samples New samples New experience

RL data is in form of (state, action, new_state, reward), $\left(\underbrace{S_t, A_t}_{\text{Inputs}}, \underbrace{S_{t+1}, R_t}_{\text{Outputs}} \right)$

Delayed labels/rewards $\longrightarrow$ consequence of an action now may only be realized later

Protocol summary $$ S_{t+1} = f_1 (S_t, A_t), \qquad R_t = f_2 (S_t, A_t) $$

Markov dependency between transitions.

Further RL Concepts

Episodic tasks Interaction is easily broken down into segments, subsequences. For example play a game until win/loss Leads to notion of terminal states. Environment resets at the end of the episode.

Continual tasks No simple way of breaking into segments. For example process control, robot buttler's lifetime

In either case, we can collect a trajectory of the agent't interaction, as the sequence of states and actions $\xi = (S_t, A_t), \ldots, (S_{t+k}, A_{t+k})$

Recall, return is given by sum of rewards over a trajectory, $\xi_k$ $$G_{0:T} = R_0 + R_1 + \ldots + R_T = \sum_{(S_i, A_i) \in \xi_k} f_2(S_i, A_i) $$

Discounting Future rewards are worth less that current rewards. Done by multiplying rewards with a parameter $\gamma$ $$ G_{0:T} = \gamma^0 R_0 + \gamma^1 R_1 + \ldots + \gamma^n R_T = \sum_{(S_i, A_i) \in \xi_k} \gamma^i f_2(S_i, A_i) $$

Intepretations/remarks

  • Inflation. Future rewards are worth less than current rewards.
  • Tackles delayed_gratification
  • The agent could die with probability given by $1 - \gamma$

Discounted rewards are well suited for continual tasks, while undiscounted ones suit episodic tasks.

Policies

The agent's choice of actions while interacting with the environment are provided by a policy $(\pi)$, i.e. a prescription of what to do when encountering each state.

The policy can be deterministic or stochastic.

In small environments, this can be a simple table/lookup for each state. For large environments we typically use some parametric function to represent the policy

The RL Task

The overaching goal of the agent is to maximize the expected returns.

To do this, the agent needs to use a good policy. The best such policy is called the optimal policy

Value

The value of a state is the expected return, starting at that state, and using a given policy.

The policy is used to collect the returns, which are then averaged to get the expectation.

Defined formally with the state value function $$ v_\pi(s) = \mathbb{E}_\pi\left[ \sum_{k=0}^{\infty} R_{t+k+1} \vert S_t = s \right]$$

Additionally, the action-value (Q-function) is often used and defined as $$ q_\pi(s, a) = \mathbb{E}_\pi\left[ \sum_{k=0}^{\infty} R_{t+k+1} \vert S_t = s, A_t = a \right]$$

RL Task (continued)

To find the optimal policy, we need to compare any given pair of policies.

A policy $\pi_1$ is better than $\pi_2$ if it generates better expected returns for all states in the environment. The optimal policy is the one which is better than or equal to all other policies possible for the envionment. There is a corresponding optimal value and action-value functions.


Planning

Solving for the optimal policy (and optionally value function) when given a model.

Learning

In RL we can learn a number of things

  • The policy
  • Value and action-value functions
  • Environment models (e.g. transition function)

Solution Methods


Model A function to generating transitions and rewards, i.e. $f_1$ and $f_2$. For example physics laws and a hand crafted reward mapping.

Solutions methods largely fall into the taxonomy1 above.

  • Model-Based methods use the models above to find the optimal policy, often requiring fewer samples. Recent example is Alphazero
  • Model-Free methods directly try to find the optimal policy without using the model, often requiring more samples. Popular example is Q-learning

[Image from OpenAI Spinning Up https://spinningup.openai.com/en/latest/spinningup/rl_intro2.html

When to use RL?

Application of RL requires certain characteristics. However, satisfying these characteristics does not mean RL is the only viable approach.

  • Interaction
  • Sequential
  • Delayed, maybe sparse labels

Also, one can sometimes modify the formulation of a problem to fit RL, or to move from RL to other paradyms

Examples of RL Applications

  • Industrial/commercial:
    • Elevator dispatch,
    • Autonomous vehicles
    • Robotics (AVs, etc)
    • Recommendations like news feed
    • Traffic light control
    • Reactions in Chemistry
  • Games:
    • TD-gammon,
    • Alpha go, Alpha zero,
    • DOTA

DSA Example: Solar Panel Optimization

We want to maximize energy output of a solar panel. Solar panel output is determined by a number of factors, including:

  • Direct irradiance $R_d$
  • Diffuse irradiance $R_f$
  • Reflective irradiance $R_r$

Total irradiance is then the combibnation of these, factorig in the angle of incidence of the sun's rays $$ R_t = \theta_d R_d + \theta_r R_f + \theta_r R_r $$

Image source: https://www.redarc.com.au

Formulation details

Actions, transition, reward?

Is this task episodic or continual?

Action set: {title north, tilt south, tilt west, tilt east, do nothing} - Pick some fixed tilt value, e.g. $2^o$

State: (latitude, longitude, sun's position(azimutth angle, altitude angle), clouds)

Reward: Energy generated. Take into account energy used to move the panel

Simple Experiments

  • Baseline - Solar tracker
  • RL method: Use Q-learning as a first simple algorithm

Q-learning intuition:

Recall the key pieces from before (value, policy, reward, etc). Q-learning is part of a family of algorithms of the form

$$ \text{New Estimate} = \text{Old Estimate} + \text{StepSize} (\text{Target} - \text{Old Estimate}) $$

Target is the expected return of a state.

Preliminary Results

Addis Ababa: lat, long (9.033572, 38.763707), Time zone Africa/Addis_Ababa

Panel size: $1m \times 1m$, Run for a few days with simulated clouds and states


Beyond the Basics

One you have a task you want to solve with RL. Here is a recommended recipe

  1. Carefuly formulate the task in the language of RL. Identify the key pieces, episodic vs continual, states (discrete vs continuous), which models are available, and what is the reward function
  2. Start simple, build intuitions BEFORE investing a lot
  3. Always have a baseline to compare to. Remember sometimes you may not even need machine learning

Further Topics

  • Deep RL: trying to bring some of the progress with DL to RL
  • Bandits (popular in recommendations)
  • Formal framework for RL: MDPs, POMDPs
  • Multi-agent RL -- connections to Game theory and Mechanism Design
  • Imitation Learning
  • Inverse Reinforcement Learning

Get Started Resources

  • Simple environments and tasks
  • Always use visualization to peek into what is going on
  • Open source code
    • gym from OpenAI
    • rllab, simple_rl, Github is your friend
    • pytorch and tensorflow all have implementations of basic RL algorithms
  • Tutorials, blogs
    • Spinning Up from OpenAI - focus on Deep RL and policy optimization methods
  • Videos
    • Online edu platforms: coursera, udacity
    • Youtube, e.g. David Silver's class, Andrew Ng's class

Extra: RL vs Control?

Controls concept maps to RL concept
Direct control $\longrightarrow$ Model free RL
--- --- ---
Indirect control $\longrightarrow$ Model based

Indirect methods use system identification (aka model learning) to build a model, on which control laws (aka policies) are developed upon.

Also related to tracking (reference trajectory given)

In [ ]: