Markov Decision Processes

5 minute read

Updated: March 05, 2020

17. Markov Decision Processes

Learning outcomes

Reinforcement Learning

Training dog – to give the ball
- So first have the dog sit in front of you and show the empty hand and wait. The dog will do all kinds of stuff coz it has no idea what you wanna do .
- When it puts the paw, it rewards – to tell that it was the right thing to do.
- If you keep rewarding the dog it will to get the treat the dogs behaviour will be reinforced
Two imp phases & reinforcement learning
- Exploration – the agent doesn’t know what its supposed to do. When it gets the reward is when they know what to do
- Credit assignment – figuring out what generated the reward.
  - It’s difficult to see what was the right thing that created the reward – but there’s an algorithm

Building an RL agent

How do we go from problems in the world to algorithm
We have an agent robot wants to reach the treasure box – there are things in the environment that we need to avoid.
Negative reward, positive reward, etc
How do we formalize this as ML prob

Modelling

We are working on the agent and it is interacting with the environment
All the agent can do is choose action (from lists of actions) and environment will change as a consequence of the action, and this change is observed by the agent.
Agent will observe state and reward, and as a consequence of current state it choses next action

A mathematical model of the world

States – possible situation that agent can encounter – only part of the state if it is necessary for the agent to make decision
Actions – what the agent can do.
Transitions - how does the environment change as the consequence of the agent’s action.
- This has to be described in terms of what is the next state if I take action in this state.
Reward – you get to choose it. And if you give a negative reward, the agent is gonna avoid that.

More formally

Example: Tic Tac Toe

The Markov property

Called Markov coz we assume that the current state is enough to predict the next state and the reward
- Current state representation contains enough information to tell what to expect next
Markovian iff the state contains enough information to predict the next state and the reward

Goals

We are NOT maximizing for immediate reward – we are trying to find best behaviour on long term. Which is a return!!
- We may accept negative reward in short term
T is the max number of actions you take (unknown)
γ - Discount factor – it allows you to discount reward fact that its far away in the future

Behaviours

Behaviour is represented by policy – function that for each state tells us what action the agent is gonna take in that state
- Deterministic policy – if the agent is in the same state twice it will take the same action
- Stochastic policy – with some probability it will take x action.
When learning, you don’t wanna take same action every time coz you wanna explore diff options as well – it will never learn if the action is deterministic
Imp ! Exploration and exploitation trade-off – should I exploit knowledge I have or explore sth new? Need to keep exploring if you want to be optimal
Beginning you need some kind of randomness and have to explore

Value

We define the goal – to maximize the goal we long term reward.
This is the thing we are really maximizing for
- If you extract γ, what’s in parenthesis is the same as what we started with.
- Return = The recursive definition of itself = immediate reward+ γ (return from the next state)
- Value of a given state under a certain policy = expected value of the return if we start from that state and follow the policy
Value is also recursive.

Understanding the value

Γ is the discount factor coz its in the future
Environment is also probabilistic
You first choose the action, THEN the env choose the next state
Now the joint prob of action and the next state – compute the value of next state, you do this for every state

Dynamic programming

in ML we don’t model anything and we just learn everything from experience
- if we have this prob which tells us how the environment is gonna behave, then I have eq that defines value into an update rule bellman eq
- γ q is sum pollical
optimal policy – alwya s choose the best one
update rule
- until convergence 0 then we get the action value function

Bellman Equation

simulator – initialization doesn’t matter anyways
reward is 1 everywhere + γ (next state) which is zero.
Next state is now 2!! Coz value now + reward next which is now 1
So the next is 1+2+3+ bc they see the future state which is 1 and
The eq has reached the fixed point so this is the optimal value function
- Optimal ?? is not unique
The corner on left top changes every step – but there is updaed version to States, actions, rewards