ユニファ開発者ブログ

ユニファ株式会社プロダクトデベロップメント本部メンバーによるブログです。

Markov Decision Processes

This is Shakil from Product Engineering Department. This time I wanted to talk a bit about Markov Decision Processes which forms the basis of Reinforcement Learning.

What are MDPs (Markov Decision Processes)?

MDP provides a mathematical toolset to model Problems involving decision making in sequences where what the future state is going to be depends only on the actions taken from the present state.

Why are MDPs interesting?

MDPs can be used to model and solve a wide range of problems, from Self-Driving Cars, to Recommendation Systems to Complex Video Games like DOTA 2.

Open AI 5 by DeepMind defeated human world Champions of DOTA 2 in 2019 by using MDPs as one of the tools to model and master the game.

So how can we model a problem with MDPs?

in order to be modeled by a Markov Decision Process, a problem must must have the following properties.

  • There must be Observable states
  • There must be Actions an Agent/Actor can take to affect these Observed States
  • The Probabilities of Transitioning to Future Observable States must only be dependent on the Previous State
  • There must be a Reward Distribution to Indicate Goal States or Undesirable States

For example, in DOTA 2 a possible observable State  S_1, could be enemy having 9 remaining towers . An Action  A_1, could be attacking one of the towers and another State  S_2, would be the enemy having 8 towers remaining. Here, you can make a transition from  S_1 to  S_2 taking  A_1. Here, Transitioning to  S_2 depends on being in state  S_1 to begin with. When transitioning from  S_1 to  S_2 the acting agent can get a small reward  R_1, for destroying an enemy tower as it has brought it closer to winning the game.

Five Components of MDPs

  • States: What you can observe
  • Actions/Policy: Things you will be able to do in a state
  • Transitions Function: Probabilities of transitioning from one state to another
    •  Pr(S_{t+1} |  S_t,  A_t)
      • Where t, represents timestep
  • Rewards: Gold for reaching goals, Poison for falling into pit - R(S_t, A_t)
  • Discount Factor, γ: a value between 0 and 1
    • γ ∈ [0, 1]

States

A state is what an actor in an environment observes at a given timestep. The state should be defined such that it contains all the information that affects the transition to a future state from the current one. For example in the following Grid World environment, a state is one of the boxes in the grid. It can be represented as a coordinate like (0, 0).

f:id:unifa_tech:20220228142539p:plain
Shiba Inu Grid World

Policy/Action

An action Policy function gives us the probabilities the possible actions an Actor can take given a State,  S_t.

Formally Defined as:

- \pi(S_t) = Pr(A_t | S_t) - where  \pi is the Policy Function

For example in the following figure, the Question is a State, and the Options A B C D are the possible actions that the Actor can take in this state.

f:id:unifa_tech:20220228142521j:plain
Actions

Rewards

In MDPs, positive rewards are used to push the Actor towards Good States, and negative rewards are used to pull the Actor away from bad states. For example in the following, the Actor gets a cookie for good behaviour and scolding for bad behaviour.

Formally Defined as:

- Reward(t) = R(S_t, S_{t+1})

f:id:unifa_tech:20220228142445j:plain
Reward and Punishment

State Transition Probabilities

Often problems we model with MDPs are non-deterministic i.e. taking the same action in the same state can take us to different future states as illustrated in the example below.

To model this non-determinism that is often seen in real life scenarios, MDP has a State Transition Probability Function which gives us the probability of transitioning to a State,  S_2, given our current state  S_1 and current Action  A_1.

Formally Defined as:

- Transition Probabilities = Pr(S_{t+1} | S_t, A_t )

f:id:unifa_tech:20220228182856j:plain
Non Determinism In The Environment

Discount Factor

γ ∈ [0, 1] is a discount factor representing the difference in importance between present and future rewards.

Discount Factor is a value is between 0 and 1. Discount factor basically acts as a weight which is used to reduce the importance of States and rewards received that are further away from the current State. More about this is discussed in the Aggregate Reward/Return section.

Aggregate Reward / Return

The Aggregate Rewards or Return for a State S_tis the sum of total rewards that an Actor receives from an Environment, starting from state  S_t, till the end of an Episode or until an certain timestep,  t.

Rewards that are further timesteps away from the State, S_tfor which the Return is calculated, are weighted lower by multiplying the discount factor (which as mentioned as before should be between 0 and 1). The Discount Factor is exponentiated the further timesteps away from state  S_t the the reward was received. This is done so that the Actor gives more importance to States and rewards nearer to it.

Formally Aggregate Reward is defined as follows:

f:id:unifa_tech:20220228154453p:plain - Where  G_t is the Aggregate Return

In the Example Below, as the States get closer,  S_{t + 3}, to the actual reward in green, their Aggregate Reward increases to indicate that a real large reward is nearby. While states that are further away,  S_t, get a smaller portion of the same reward due to the weighting the discount rate applies to rewards that are further timesteps away.

f:id:unifa_tech:20220228142854p:plain

Conclusion

The purpose of modeling problems as MDP is to eventually find an optimal Action Policy for every State. An Optimal Action Policy is defined as a Policy that maximises the mean Cumulative reward that an actor receives over its lifetime.

unifa-e.com