Difference between revisions of "Light Avoiding SBOT"

From RoboWiki
Jump to: navigation, search
m (New page: In our project we use SBOT robot. Goal for this project is to make this robot learn the best possibility, how he could run away from light. Sbot has two sensors from which we can obtain ...)
 
 
(29 intermediate revisions by the same user not shown)
Line 1: Line 1:
In our project we use SBOT robot. Goal for this project is to make this robot learn the best possibility, how he could run away from light.
+
Authors: Martin Ďurčo, Dominik Kapišinský
 +
 
 +
In our project we use SBOT robot. Goal for this project is to make this robot learn the best possibility, how he could run away from light.[[Image:light_sbot_1.jpg|thumb|250px|Light avoiding SBot]]
 
   
 
   
Sbot has two sensors from which we can obtain an information about intensity of light.
+
Sbot has two sensors from which we can obtain an information about intensity of light. His enviroment is dark and we use spot light for simulating reinforcement learning. Measurements from two light sensors are inputs, which RL algorithm will use to decide, which action from set of actions it has to make. For each of this sensors we have 3 states - dark, middle ang light. [[Image:light_sbot_2.jpg|thumb|250px|Light sensors]][[Image:light_sbot_3.jpg|thumb|250px|Light avoiding SBot]]
 +
Set of actions consists of these three actions: moreLeft, moreRight and Forward. We have also 27 rules (3*3*3 = actions*states_first_sensor*states_second_sensor).  Each of 27 rules will have assigned a weight (or quality) and RL algorithm will make some corrections of these weights, based on reward, which Sbot get from previous action. In the end of training, the best rules will be those with the highest weights.
  
 
==Reinforcement learning==
 
==Reinforcement learning==
Line 11: Line 14:
 
Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs.
 
Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs.
  
===About Algorithm===
+
===Introduction===
  
 
The basic reinforcement learning model consists of:  
 
The basic reinforcement learning model consists of:  
Line 36: Line 39:
  
 
The first two of these problems could be considered planning problems (since some form of the model is available), while the last one could be considered as a genuine learning problem. However, under a reinforcement learning methodology both planning problems would be converted to machine learning problems.
 
The first two of these problems could be considered planning problems (since some form of the model is available), while the last one could be considered as a genuine learning problem. However, under a reinforcement learning methodology both planning problems would be converted to machine learning problems.
 +
 +
==Results (this test is on youtube)==
 +
First sensor state, Second sensor state | FORWARD value, moreLEFT value, moreRIGHT value
 +
 +
After 1. minute - Reinforcement Learning was ON
 +
 +
0,0 | 442, 193, 268<br>
 +
0,1 | 189, 200, 200<br>
 +
0,2 | 200, 349, 184<br>
 +
1,0 | 189, 200, 200<br>
 +
1,1 | 233, 200, 200<br>
 +
1,2 | 184, 229, 200<br>
 +
2,0 | 237, 200, 229<br>
 +
2,1 | 211, 200, 200<br>
 +
2,2 | 171, 197, 299<br><br>
 +
 +
After 2 minutes - Reinforcement Learning was OFF
 +
 +
0,0 | 442, 193, 268<br>
 +
0,1 | 189, 200, 200<br>
 +
0,2 | 200, 349, 184<br>
 +
1,0 | 189, 200, 200<br>
 +
1,1 | 233, 200, 200<br>
 +
1,2 | 184, 229, 200<br>
 +
2,0 | 237, 200, 229<br>
 +
2,1 | 211, 200, 200<br>
 +
2,2 | 171, 197, 299<br><br>
 +
 +
After 3 minutes - Reinforcement Learning was again ON
 +
 +
0,0 | 505, 271, 316<br>
 +
0,1 | 276, 200, 225<br>
 +
0,2 | 310, 357, 127<br>
 +
1,0 | 196, 201, 205<br>
 +
1,1 | 200, 200, 200<br>
 +
1,2 | 217, 251, 132<br>
 +
2,0 | 154, 113, 465<br>
 +
2,1 | 204, 225, 200<br>
 +
2,2 | 203, 207, 264<br>
 +
 +
==Video==
 +
 +
http://www.youtube.com/watch?v=KFvERq0Zsfc
 +
 +
==Source Code==
 +
 +
Source code: [[Media:Light_avoiding_source.rar|Light_avoiding_source.rar]]

Latest revision as of 18:43, 17 June 2012

Authors: Martin Ďurčo, Dominik Kapišinský

In our project we use SBOT robot. Goal for this project is to make this robot learn the best possibility, how he could run away from light.
Light avoiding SBot
Sbot has two sensors from which we can obtain an information about intensity of light. His enviroment is dark and we use spot light for simulating reinforcement learning. Measurements from two light sensors are inputs, which RL algorithm will use to decide, which action from set of actions it has to make. For each of this sensors we have 3 states - dark, middle ang light.
Light sensors
Light avoiding SBot

Set of actions consists of these three actions: moreLeft, moreRight and Forward. We have also 27 rules (3*3*3 = actions*states_first_sensor*states_second_sensor). Each of 27 rules will have assigned a weight (or quality) and RL algorithm will make some corrections of these weights, based on reward, which Sbot get from previous action. In the end of training, the best rules will be those with the highest weights.

Reinforcement learning

Reinforcement learning is an area of machine learning in computer science, concerned with how an agent ought to take actions in an environmentso as to maximize some notion of cumulative reward. In the operations research and control literature the field where reinforcement learning methods are studied is called approximate dynamic programming.

In machine learning, the environment is typically formulated as a Markov decision process (MDP), and many reinforcement learning algorithms for this context are highly related todynamic programming techniques. The main difference to these classical techniques is that reinforcement learning algorithms do not need the knowledge of the MDP and they target large MDPs where exact methods become infeasible.

Reinforcement learning differs from standard supervised learning in that correct input/output pairs are never presented, nor sub-optimal actions explicitly corrected. Further, there is a focus on on-line performance, which involves finding a balance between exploration (of uncharted territory) and exploitation (of current knowledge). The exploration vs. exploitation trade-off in reinforcement learning has been most thoroughly studied through the multi-armed bandit problem and in finite MDPs.

Introduction

The basic reinforcement learning model consists of:

   * a set of environment states S 
   * a set of actions A 
   * rules of transitioning between states 
   * rules that determine the scalar immediate reward of a transition 
   * rules that describe what the agent observes 

The rules are often stochastic. The observation typically involves the scalar immediate reward associated to the last transition. In many works, the agent is also assumed to observe the current environmental state, in which case we talk about full observability, whereas in the opposing case we talk about partial observability. Sometimes the set of actions available to the agent is restricted (e.g., you cannot spend more money than what you possess).

A reinforcement learning agent interacts with its environment in discrete time steps. At each time t, the agent receives an observation ot, which typically includes the reward rt. It then chooses an action at from the set of actions available, which is subsequently sent to the environment. The environment moves to a new state st+1 and the reward rt+1 associated with the transition (st, at, st+1) is determined. The goal of a reinforcement learning agent is to collect as much reward as possible. The agent can choose any action as a function of the history and it can even randomize its action selection.

When the agent's performance is compared to that of an agent which acts optimally from the beginning, the difference in performance gives rise to the notion of regret. Note that in order to act near optimally, the agent must reason about the long term consequences of its actions: In order to maximize my future income I better go to school now, although the immediate monetary reward associated with this might be negative.

Thus, reinforcement learning is particularly well suited to problems which include a long-term versus short-term reward trade-off. It has been applied successfully to various problems, including robot control, elevator scheduling, telecommunications, backgammon and checkers (Sutton and Barto 1998, Chapter 11).

Two components make reinforcement learning powerful: The use of samples to optimize performance and the use of function approximation to deal with large environments. Thanks to these two key components, reinforcement learning can be used in large environments in any of the following situations:

   * A model of the environment is known, but an analytic solution is not available 
   * Only a simulation model of the environment is given (the subject of simulation-based optimization) 
   * The only way to collect information about the environment is by interacting with it 

The first two of these problems could be considered planning problems (since some form of the model is available), while the last one could be considered as a genuine learning problem. However, under a reinforcement learning methodology both planning problems would be converted to machine learning problems.

Results (this test is on youtube)

First sensor state, Second sensor state | FORWARD value, moreLEFT value, moreRIGHT value

After 1. minute - Reinforcement Learning was ON

0,0 | 442, 193, 268
0,1 | 189, 200, 200
0,2 | 200, 349, 184
1,0 | 189, 200, 200
1,1 | 233, 200, 200
1,2 | 184, 229, 200
2,0 | 237, 200, 229
2,1 | 211, 200, 200
2,2 | 171, 197, 299

After 2 minutes - Reinforcement Learning was OFF

0,0 | 442, 193, 268
0,1 | 189, 200, 200
0,2 | 200, 349, 184
1,0 | 189, 200, 200
1,1 | 233, 200, 200
1,2 | 184, 229, 200
2,0 | 237, 200, 229
2,1 | 211, 200, 200
2,2 | 171, 197, 299

After 3 minutes - Reinforcement Learning was again ON

0,0 | 505, 271, 316
0,1 | 276, 200, 225
0,2 | 310, 357, 127
1,0 | 196, 201, 205
1,1 | 200, 200, 200
1,2 | 217, 251, 132
2,0 | 154, 113, 465
2,1 | 204, 225, 200
2,2 | 203, 207, 264

Video

http://www.youtube.com/watch?v=KFvERq0Zsfc

Source Code

Source code: Light_avoiding_source.rar