Difference between revisions of "Line Following Sbot Using Reinforcement Learning"

From RoboWiki
Jump to: navigation, search
 
(20 intermediate revisions by the same user not shown)
Line 1: Line 1:
 
[[Image:sbot_linefollow.jpg|thumb|250px|Sbot following line]]
 
[[Image:sbot_linefollow.jpg|thumb|250px|Sbot following line]]
Our goal is to make Sbot learn to follow line. There are some algorithms to do that, but I'll focus on [http://en.wikipedia.org/wiki/Reinforcement_learning Reinforcement learning] algorithm.
+
Our goal is to make Sbot learn to follow line. There are some algorithms to do that, but we'll focus on [http://en.wikipedia.org/wiki/Reinforcement_learning Reinforcement learning] algorithm.
  
Sbot has two sensors, from which we can obtain an information about intensity of reflected light. In our environment line is black while background is white, so we can reduce these information from sensors into two states - "on line" and "out of line". These are inputs, which RL algorithm will use to decide, which action from set of actions it has to make. Set of actions consists of these three actions: moreLeft, moreRight and balance. So there are 12 different rules (if ''condition'' then ''action'') that can be constructed. We have 4 different conditions (2 inputs, each can be in two states), so there are 81 (3^4) different policies, from which we will try to find the best one. Each of 12 rules will have assigned a weight (or quality) and RL algorithm will make some corrections of these weights, based on reward, which Sbot get from previous action. In the end of training, the best rules will be those with the highest weights.
+
Sbot has three sensors (we added one sensor for our purposes), from which we can obtain an information about intensity of reflected light. In our environment line is black while background is white, so we can reduce these information from sensors into two states - "on line" and "out of line". Measurements from two border sensors are inputs, which RL algorithm will use to decide, which action from set of actions it has to make and middle sensor is used to provide reinforcement. Set of actions consists of these three actions: moreLeft, moreRight and balance. So there are 12 different rules (if ''condition'' then ''action'') that can be constructed. We have 4 different conditions (2 inputs, each can be in two states), so there are 81 (3^4) different policies, from which we will try to find the best one. Each of 12 rules will have assigned a weight (or quality) and RL algorithm will make some corrections of these weights, based on reward, which Sbot get from previous action. In the end of training, the best rules will be those with the highest weights.
  
 
Before programming, we have to choose the way, how Sbot will get a reward. We have these options:  
 
Before programming, we have to choose the way, how Sbot will get a reward. We have these options:  
 
# Place some photocells on the track along the line to monitor the progress of the robot: Each time robot reaches the next photocell, it will receive positive reinforcement signal.
 
# Place some photocells on the track along the line to monitor the progress of the robot: Each time robot reaches the next photocell, it will receive positive reinforcement signal.
 
# Mount overhead camera about line, place some color shape on top of the robot, so its position can be recognize by camera easily and setup a sequence of reward locations. When the robot reaches these positions, it receive positive reinforcement signal.  
 
# Mount overhead camera about line, place some color shape on top of the robot, so its position can be recognize by camera easily and setup a sequence of reward locations. When the robot reaches these positions, it receive positive reinforcement signal.  
# Without any others devices, the robot can utilize its own sensors to determine if it is following line. We can measure the velocity of both wheels and see if the bottom light sensors can see bottom line frequently. We can experiment with reward function. We are inspired by the work of Chavas et al [1], where they use the following fitness function for trainig the robot to avoid obstacles: <br/><math>
+
# Add third sensor between two mounted sensors and measurement from that middle sensor will serve as a binary reward - if sensor can see the line, robot get a reward, otherwise not.  
  f = \sum_t (0.5+\frac{v_l(t) + v_r(t)} {4 \cdot V_{max}}) \cdot (1-\frac{|v_l(t) - v_r(t)|} {2 \cdot V_{max}}) \cdot (1 - \frac{\sum_{front} p_i(t)} {4 \cdot P_{max}})
 
</math><br/>Where <math>v_l(t)</math> and <math>v_r(t)</math> were the velocities of the left and right wheels, respectively; <math>V_{max}</math> was the maximum absolute velocity; <math>p_i(t)</math> was the proximity measure returned by each sensor <math>i</math> among the four front sensors; <math>P_{max}</math> was the largest measured value that can be returned.<br/>For our task we'll use a modified version of this function:<br/><math>
 
  f = \sum_{t_1} (k_1+\frac{v_l(t_1) + v_r(t_1)} {4 \cdot v_m}) \cdot (k_2-\frac{|v_l(t_1) - v_r(t_1)|} {2 \cdot v_m}) \cdot (
 
  1 - k_3 \cdot \prod_{t_2} (1 - s_l(t_2)) \cdot ( 1 - s_r(t_2)))
 
</math><br/>Where <math>k_i</math> are some constants; <math>s_l(t)</math> and <math>s_r(t)</math> are measurements on left and right sensor in the time <math>t_2</math>
 
  
 +
The last option seems to be the simpliest one, so we will use it in our project.
  
The last option seems to be the simpliest one, because of its independence on other devices, so we will use it in our project.
+
==Algorithm==
 +
Well-known problem in reinforcement learning is the exploration vs. exploitation trade-off. In our program, we select random action with probability <math>p</math> which is compute as follows:
  
=== References ===
+
<math>p = p_{min} + \frac{\sum_{i=1}^w r_i}{r_{max}} + (1-p_{min})</math>
[1] Chavas J. et al (1999) Incremental Evolution of Neural Controllers for Robust Obstacle-Avoidance in Khepera. ''First European Workshop on Evolutionary Robotics, Paris 1998''
+
 
 +
where <math>p_{min}</math> is constant and means minimal probability of selection of random action; <math>r_i</math> is reward given i-steps back; <math>w</math> is size of time window and <math>r_{max}</math> is the highest possible sum of rewards for last <math>w</math> steps.
 +
 
 +
If probability p is greater than a random value between 0 and 1, random action will be chosen. Otherwise, we'll select a new action stochastically, from the rules with actual state in condition.
 +
 
 +
After action is performed, we take measurement from middle sensor, compute reward and save it into the array of last w rewards (of course, the oldest reward is replaced). Then we have to update weights of rules. If reward was 1, weight will be updated by formula:
 +
 
 +
<math>Q(s,a) = Q(s,a)\cdot(1-\alpha)\cdot \alpha</math>
 +
 
 +
where <math>\alpha</math> is constant (in our program set to 0.2).
 +
If given reward was 0, weight will be updated according state where last action lead. If it's good state, last used rule was good too, although didn't get reward. So weight is updated by formula:
 +
 
 +
<math>Q(s,a) = \max_a Q(s,a) \cdot \alpha</math>
 +
 
 +
We wait after each update 80 ms, so the sbot will produce cca. 10 actions per second.
 +
 
 +
Note that actions left and right stop one wheel while other is running. Result is, that robot moves forward while turn left or right, too. So we needn't compute reward with respect to robot's last actions.
 +
 
 +
Robot learns only if state is changed. This is important for correct learning process. Otherwise we can get different results from sensors in the same state, sbot is confused and learning is inefficient.
 +
 
 +
Random actions are necessary in learning process. But if we want from sbot to follow line, each random action can lead him out of the line. So we decide separate training and testing phase. And while in training phase robot didn't follow line all the way, in testing phase (after 20-30 trails) it will be able to follow line all the way.
 +
 
 +
Video (from mobile phone = poor quality):
 +
* [[Media:sbotrl1.3gp|Robot follows line after it has been trained on this track using the RL algorithm described above]]
 +
* [[Media:sbotrl2.3gp|Same robot and program, follows line on a track different than it was trained on]]
 +
 
 +
Source code: [[{{ns:media}}:sbot.zip]]

Latest revision as of 15:47, 27 November 2009

Sbot following line

Our goal is to make Sbot learn to follow line. There are some algorithms to do that, but we'll focus on Reinforcement learning algorithm.

Sbot has three sensors (we added one sensor for our purposes), from which we can obtain an information about intensity of reflected light. In our environment line is black while background is white, so we can reduce these information from sensors into two states - "on line" and "out of line". Measurements from two border sensors are inputs, which RL algorithm will use to decide, which action from set of actions it has to make and middle sensor is used to provide reinforcement. Set of actions consists of these three actions: moreLeft, moreRight and balance. So there are 12 different rules (if condition then action) that can be constructed. We have 4 different conditions (2 inputs, each can be in two states), so there are 81 (3^4) different policies, from which we will try to find the best one. Each of 12 rules will have assigned a weight (or quality) and RL algorithm will make some corrections of these weights, based on reward, which Sbot get from previous action. In the end of training, the best rules will be those with the highest weights.

Before programming, we have to choose the way, how Sbot will get a reward. We have these options:

  1. Place some photocells on the track along the line to monitor the progress of the robot: Each time robot reaches the next photocell, it will receive positive reinforcement signal.
  2. Mount overhead camera about line, place some color shape on top of the robot, so its position can be recognize by camera easily and setup a sequence of reward locations. When the robot reaches these positions, it receive positive reinforcement signal.
  3. Add third sensor between two mounted sensors and measurement from that middle sensor will serve as a binary reward - if sensor can see the line, robot get a reward, otherwise not.

The last option seems to be the simpliest one, so we will use it in our project.

Algorithm

Well-known problem in reinforcement learning is the exploration vs. exploitation trade-off. In our program, we select random action with probability <math>p</math> which is compute as follows:

<math>p = p_{min} + \frac{\sum_{i=1}^w r_i}{r_{max}} + (1-p_{min})</math>

where <math>p_{min}</math> is constant and means minimal probability of selection of random action; <math>r_i</math> is reward given i-steps back; <math>w</math> is size of time window and <math>r_{max}</math> is the highest possible sum of rewards for last <math>w</math> steps.

If probability p is greater than a random value between 0 and 1, random action will be chosen. Otherwise, we'll select a new action stochastically, from the rules with actual state in condition.

After action is performed, we take measurement from middle sensor, compute reward and save it into the array of last w rewards (of course, the oldest reward is replaced). Then we have to update weights of rules. If reward was 1, weight will be updated by formula:

<math>Q(s,a) = Q(s,a)\cdot(1-\alpha)\cdot \alpha</math>

where <math>\alpha</math> is constant (in our program set to 0.2). If given reward was 0, weight will be updated according state where last action lead. If it's good state, last used rule was good too, although didn't get reward. So weight is updated by formula:

<math>Q(s,a) = \max_a Q(s,a) \cdot \alpha</math>

We wait after each update 80 ms, so the sbot will produce cca. 10 actions per second.

Note that actions left and right stop one wheel while other is running. Result is, that robot moves forward while turn left or right, too. So we needn't compute reward with respect to robot's last actions.

Robot learns only if state is changed. This is important for correct learning process. Otherwise we can get different results from sensors in the same state, sbot is confused and learning is inefficient.

Random actions are necessary in learning process. But if we want from sbot to follow line, each random action can lead him out of the line. So we decide separate training and testing phase. And while in training phase robot didn't follow line all the way, in testing phase (after 20-30 trails) it will be able to follow line all the way.

Video (from mobile phone = poor quality):

Source code: Media:sbot.zip