ユニファ開発者ブログ

ユニファ株式会社プロダクトデベロップメント本部メンバーによるブログです。

Reinforcement Learning the Hello World Version

By Matthew Millar R&D Scientist at ユニファ

What is Reinforcement Learning?

Reinforcement learning (RI) is a subset of Artificial Intelligence (AI) that is an agent-based approach in the field of AI. Agent-based AI learns from interacting with their environment rather than a dataset. The environment is controlled or altered by the actions of the agent. The environment changes should have goals linked to them so that the agent knows the results of each action. From this goal, appropriate actions can be learned for many different types of tasks. RI is one of the three learning techniques in machine learning. The other two are supervised and unsupervised learning. RI does not require labeled data as in supervised or semi-supervised learning.

In general, RI is a Markov Decision Process (MDP) which takes both the environment and agent states denoted a  S. It then lists out a set of all possible actions available to an agent denoted as A. There is a probability equation that shows the likely hood of transitions between states   (s1 -> s2) when a certain action  (a) is performed. The requirement of a reward is also needed to drive the system forward to learn. The normal use is an immediate reward after the transition between states denoted as  Ra(s, sl). There also must be a set of rules that control what the agent receives as feedback, or what it can observe.

f:id:unifa_tech:20190617091506g:plain
Probability

The interaction between an agent and its environment happens in discrete time steps. Each time t step will have an observation (ot ) and a reward (rt), The agent will then chose an action from its predefined set of actions (at). This action is then sent into the environment which will move the environment to its next state (st+1). As the environment moves to its next state, a new reward is then generated (rt+1). This state change can then create a transition (st, at, st+1) which allows for the agent to learn from its actions.

The agent will have to learn what is optimal behavior which leads to teaching it to regret its actions. Regret compares the performance of the current agent to that of an “optimal” agent and looks at the differences between both the actions and the awarded rewards. This is where instant rewards can hinder the performance of a model. The Agent will have to consider the long-term outcomes of its actions. This means even if the instant rewards for a certain action may be negative, the total long-term rewards could be higher. This is all to maximize future outcomes and rewards. This allows for RI to be applied to both short term goals as well as long term solutions, but not at the same time normally.

For an agent to increase the total rewards for a task, a balance between exploration and exploitation needs to be considered. Exploitation is basically using what works. This means that the agent will continue to perform actions that give it the best rewards but may not be optimal or can max out the rewards for a given set of actions. This is where exploration comes in to manage this and to hopefully optimize the result by trying different actions at any given time step. In general, simple exploration models are preferred as there are very few exploration algorithms that can scale up with a very complex environment and action sets. One way is to choose the best long-term action and then randomly choose an action at different times to see if these actions can improve on the already defined “optimal” long-term option.

There are many options to control the learning process for RI agents. The idea of how to define what actions a beneficial is still difficult to deal with correctly. The simplest method if the action set is finite is to use a policy. A policy maps the probability of action to a certain change in the state of the environment. These policies are the base of all other learning algorithms. The state-value function uses the policies by looking at the expected return of each action from the initial state and successfully follow the policy. This method evaluates the benefit of being in certain states over others. Monte Cario (MC) method can be used to mimic a policy evaluation and improvement. MC evaluation step looks at every state and action pair and averages all the returns from each state and action pair. This can compute a very accurate Q-table which can fully evaluate the policy. The improvement step uses a greedy policy and looks at each state-action pair and then computes an action that will maximize the return for each state. These are just two methods to control the learning of an RI agent.

Simple Example:

This was a very quick overview of the main principles of RI. Now let’s get to our first project.
So, let’s start with the hello world of OpenAI Gym. Note that this is just a first step on our way to the final project that I have planned.
The first step is to get all the dependencies.

Tensorflow: pip install --upgrade tensorflow
Keras: pip install Keras
Numpy: pip install numpy
OpenAI Gym: pip install gym.

Now that we have all the requirements lets get down to the nuts and bolts.
Let’s define our rewards method

Calculate Rewards (reward, gamma value)
		The new reward is reward + (previous reward * gamma value) 

This method lets us calculate the reward for every action.
Now let use a custom log loss function as well

Custom Loss (True Value, Predicted Value)
		Log(True * (true – Predicted) + (1 – ture) * (true + predicted)
		Take the mean of the results of this

Now a scoring function to see how accurate out AI is for testing.

Evaluate Model (Model, number of tests)
		Model score = 0
		Total reward = [] # Holds all rewards for each iteration
		For each iteration or run though the environment
			Sum of rewards
			Have the agent run though the environment collecting rewards
			Add the collected rewards to the Sum of rewards
			After the iteration add the final rewards 
		After each run through to add the final rewards to the total reward list
		Then you can find the mean, median, or the mode score or any other method you want

Now the keras model part

# Set up Keras model
action_count = env.action_space.n
input_shape = layers.Input(shape=env.reset().shape)
adv = layers.Input(shape=[1])

x = layers.Dense(8,activation="relu",
                 kernel_initializer=glorot_uniform(seed=42))(input_shape)
out = layers.Dense(action_count,activation="softmax",
                   kernel_initializer=glorot_uniform(seed=42))(x)

# Crate the training model and prediction model
train_model = Model(inputs=[input_shape, adv], outputs=out)
train_model.compile(loss=my_ri_loss, optimizer=Adam(lr))
test_model = Model(inputs=[input_shape], outputs=out)

This will allow us to then try out the test environment that is in OpenAI gym.
Let us start with

env = gym.make("CartPole-v0")

Putting it all together should give you this.

f:id:unifa_tech:20190614170044p:plain
CartRI
With the output of

Average reward for training episode 100: 19.30 Test Score: 13.60 Loss: 0.030333 
Average reward for training episode 200: 22.45 Test Score: 10.50 Loss: 0.022194 
…..
The average reward for training episode 8200: 25.68 Test Score: 191.40 Loss: -0.003017 
Solved in 8199 episodes!

Conclusion:

This is the end of this necessary explanation of RI. We learned the basics of reinforcement learning and how to code this up using a simple example in python. Now we can start to get into more complex models and even latter teach something to run!

Part two so soon!?!?

Well yes as this was a simple version and easy to follow. Let’s look at using a retro game and start playing stuff like Space Invaders and Mario!
The first thing we need to do is install Game-Retro using
pip3 install gym-retro
And that’s it, so to test this out we will use the built in rom from Airstriker-Genesis, but other roms can be downloaded and used here as well.

import retro
def main():
    env = retro.make(game='Airstriker-Genesis')
    obs = env.reset()
    while True:
        obs, rew, done, info = env.step(env.action_space.sample())
        env.render()
        if done:
            obs = env.reset()
    env.close()
if __name__ == "__main__":
    main()

And that will give you the below screen output.

f:id:unifa_tech:20190614170236p:plain
ShooterRI

Now that we know that it is working, we need to import other games into the environment. The simplest way is to import all the ROMS that came with Gym-Retro by going into the terminal and typing this

python3 -m retro.import /path/to/your/ROMs/directory/

But you will need to find ROMs that work with gym-retro.

So our next steps that we will go after is to start with Unity3D and see how to implement RI in a 3D environment over these simpler 2D games. Soon we will look at teaching an AI to walk in the future.