Markov Decision Processes

5 minute read

Updated:

17. Markov Decision Processes

Learning outcomes

  • Model a task as a Markov decision process
  • Compute an optimal policy with Dynamic Programming

Reinforcement Learning

  • Training dog – to give the ball
    • So first have the dog sit in front of you and show the empty hand and wait. The dog will do all kinds of stuff coz it has no idea what you wanna do .
    • When it puts the paw, it rewards – to tell that it was the right thing to do.
    • If you keep rewarding the dog it will to get the treat the dogs behaviour will be reinforced
  • Two imp phases & reinforcement learning
    • Exploration – the agent doesn’t know what its supposed to do. When it gets the reward is when they know what to do
    • Credit assignment – figuring out what generated the reward.
      • It’s difficult to see what was the right thing that created the reward – but there’s an algorithm

Building an RL agent

  • How do we go from problems in the world to algorithm
  • We have an agent robot wants to reach the treasure box – there are things in the environment that we need to avoid.
  • Negative reward, positive reward, etc
  • How do we formalize this as ML prob

Modelling

  • image
  • We are working on the agent and it is interacting with the environment
  • All the agent can do is choose action (from lists of actions) and environment will change as a consequence of the action, and this change is observed by the agent.
  • Agent will observe state and reward, and as a consequence of current state it choses next action

A mathematical model of the world

  • image image

  • States – possible situation that agent can encounter – only part of the state if it is necessary for the agent to make decision
  • Actions – what the agent can do.
  • Transitions - how does the environment change as the consequence of the agent’s action.
    • This has to be described in terms of what is the next state if I take action in this state.
  • Reward – you get to choose it. And if you give a negative reward, the agent is gonna avoid that.

More formally

  • image

Example: Tic Tac Toe

  • If we wanted to play tic tac toe what would the variables be. you need to know about
    • S – states – 9 states coz you can rotate and shit – state representation is v imp – one state per decision – u’d have to learn it as many times as states
      • We now use CNN as well – so it is the feature. you can come up with more complex features etc
    • A – actions – you can put a cross in empty spaces so 9 actions
    • T- transition function – transition encodes opponent’s behaviour –the next state and depends on what they do
      • Environment is often unknown
    • Reward – by the end of the game you can tell

The Markov property

  • Called Markov coz we assume that the current state is enough to predict the next state and the reward
    • Current state representation contains enough information to tell what to expect next
  • Markovian iff the state contains enough information to predict the next state and the reward

Goals

  • We are NOT maximizing for immediate reward – we are trying to find best behaviour on long term. Which is a return!!
    • image
    • We may accept negative reward in short term
  • T is the max number of actions you take (unknown)
  • image
  • γ - Discount factor – it allows you to discount reward fact that its far away in the future

Behaviours

  • Behaviour is represented by policy – function that for each state tells us what action the agent is gonna take in that state
    • Deterministic policy – if the agent is in the same state twice it will take the same action
      • image
    • Stochastic policy – with some probability it will take x action.
      • image
  • When learning, you don’t wanna take same action every time coz you wanna explore diff options as well – it will never learn if the action is deterministic
  • Imp ! Exploration and exploitation trade-off – should I exploit knowledge I have or explore sth new? Need to keep exploring if you want to be optimal
  • Beginning you need some kind of randomness and have to explore

Value

  • We define the goal – to maximize the goal we long term reward.
  • This is the thing we are really maximizing for
  • image
    • If you extract γ, what’s in parenthesis is the same as what we started with.
    • Return = The recursive definition of itself = immediate reward+ γ (return from the next state)
  • image
    • Value of a given state under a certain policy = expected value of the return if we start from that state and follow the policy
  • Value is also recursive.

Understanding the value

  • image
  • Γ is the discount factor coz its in the future

  • image
  • Environment is also probabilistic

  • image
  • You first choose the action, THEN the env choose the next state
  • Now the joint prob of action and the next state – compute the value of next state, you do this for every state

Dynamic programming

  • in ML we don’t model anything and we just learn everything from experience
  • image
    • if we have this prob which tells us how the environment is gonna behave, then I have eq that defines value into an update rule bellman eq
    • γ q is sum pollical
  • optimal policy – alwya s choose the best one
  • update rule
    • until convergence 0 then we get the action value function

Bellman Equation

  • simulator – initialization doesn’t matter anyways
  • reward is 1 everywhere + γ (next state) which is zero.
  • Next state is now 2!! Coz value now + reward next which is now 1
  • So the next is 1+2+3+ bc they see the future state which is 1 and
  • The eq has reached the fixed point so this is the optimal value function
    • Optimal ?? is not unique
  • The corner on left top changes every step – but there is updaed version to States, actions, rewards

Leave a comment