Difference between revisions of "Line Following Sbot Using Reinforcement Learning"

From RoboWiki
Jump to: navigation, search
 
(11 intermediate revisions by the same user not shown)
Line 20: Line 20:
 
If probability p is greater than a random value between 0 and 1, random action will be chosen. Otherwise, we'll select a new action stochastically, from the rules with actual state in condition.
 
If probability p is greater than a random value between 0 and 1, random action will be chosen. Otherwise, we'll select a new action stochastically, from the rules with actual state in condition.
  
After action is performed, we take measurement from middle sensor, compute reward and save it into the array of last w rewards (of course, the oldest reward is replaced). Then we have to update weights of rules. If reward was positive, weight will be updated by formula:
+
After action is performed, we take measurement from middle sensor, compute reward and save it into the array of last w rewards (of course, the oldest reward is replaced). Then we have to update weights of rules. If reward was 1, weight will be updated by formula:
  
<math>Q(s,a) = {Q(s,a)}\dot{(1-\alpha)}\dot{\alpha}</math>
+
<math>Q(s,a) = Q(s,a)\cdot(1-\alpha)\cdot \alpha</math>
  
where <math>\alpha</math> is constant (in our program set to 0.2)
+
where <math>\alpha</math> is constant (in our program set to 0.2).
 +
If given reward was 0, weight will be updated according state where last action lead. If it's good state, last used rule was good too, although didn't get reward. So weight is updated by formula:
 +
 
 +
<math>Q(s,a) = \max_a Q(s,a) \cdot \alpha</math>
 +
 
 +
We wait after each update 80 ms, so the sbot will produce cca. 10 actions per second.
 +
 
 +
Note that actions left and right stop one wheel while other is running. Result is, that robot moves forward while turn left or right, too. So we needn't compute reward with respect to robot's last actions.
 +
 
 +
Robot learns only if state is changed. This is important for correct learning process. Otherwise we can get different results from sensors in the same state, sbot is confused and learning is inefficient.
 +
 
 +
Random actions are necessary in learning process. But if we want from sbot to follow line, each random action can lead him out of the line. So we decide separate training and testing phase. And while in training phase robot didn't follow line all the way, in testing phase (after 20-30 trails) it will be able to follow line all the way.
 +
 
 +
Video (from mobile phone = poor quality):
 +
* [[Media:sbotrl1.3gp|Robot follows line after it has been trained on this track using the RL algorithm described above]]
 +
* [[Media:sbotrl2.3gp|Same robot and program, follows line on a track different than it was trained on]]
 +
 
 +
Source code: [[{{ns:media}}:sbot.zip]]

Latest revision as of 14:47, 27 November 2009

Sbot following line

Our goal is to make Sbot learn to follow line. There are some algorithms to do that, but we'll focus on Reinforcement learning algorithm.

Sbot has three sensors (we added one sensor for our purposes), from which we can obtain an information about intensity of reflected light. In our environment line is black while background is white, so we can reduce these information from sensors into two states - "on line" and "out of line". Measurements from two border sensors are inputs, which RL algorithm will use to decide, which action from set of actions it has to make and middle sensor is used to provide reinforcement. Set of actions consists of these three actions: moreLeft, moreRight and balance. So there are 12 different rules (if condition then action) that can be constructed. We have 4 different conditions (2 inputs, each can be in two states), so there are 81 (3^4) different policies, from which we will try to find the best one. Each of 12 rules will have assigned a weight (or quality) and RL algorithm will make some corrections of these weights, based on reward, which Sbot get from previous action. In the end of training, the best rules will be those with the highest weights.

Before programming, we have to choose the way, how Sbot will get a reward. We have these options:

  1. Place some photocells on the track along the line to monitor the progress of the robot: Each time robot reaches the next photocell, it will receive positive reinforcement signal.
  2. Mount overhead camera about line, place some color shape on top of the robot, so its position can be recognize by camera easily and setup a sequence of reward locations. When the robot reaches these positions, it receive positive reinforcement signal.
  3. Add third sensor between two mounted sensors and measurement from that middle sensor will serve as a binary reward - if sensor can see the line, robot get a reward, otherwise not.

The last option seems to be the simpliest one, so we will use it in our project.

Algorithm

Well-known problem in reinforcement learning is the exploration vs. exploitation trade-off. In our program, we select random action with probability <math>p</math> which is compute as follows:

<math>p = p_{min} + \frac{\sum_{i=1}^w r_i}{r_{max}} + (1-p_{min})</math>

where <math>p_{min}</math> is constant and means minimal probability of selection of random action; <math>r_i</math> is reward given i-steps back; <math>w</math> is size of time window and <math>r_{max}</math> is the highest possible sum of rewards for last <math>w</math> steps.

If probability p is greater than a random value between 0 and 1, random action will be chosen. Otherwise, we'll select a new action stochastically, from the rules with actual state in condition.

After action is performed, we take measurement from middle sensor, compute reward and save it into the array of last w rewards (of course, the oldest reward is replaced). Then we have to update weights of rules. If reward was 1, weight will be updated by formula:

<math>Q(s,a) = Q(s,a)\cdot(1-\alpha)\cdot \alpha</math>

where <math>\alpha</math> is constant (in our program set to 0.2). If given reward was 0, weight will be updated according state where last action lead. If it's good state, last used rule was good too, although didn't get reward. So weight is updated by formula:

<math>Q(s,a) = \max_a Q(s,a) \cdot \alpha</math>

We wait after each update 80 ms, so the sbot will produce cca. 10 actions per second.

Note that actions left and right stop one wheel while other is running. Result is, that robot moves forward while turn left or right, too. So we needn't compute reward with respect to robot's last actions.

Robot learns only if state is changed. This is important for correct learning process. Otherwise we can get different results from sensors in the same state, sbot is confused and learning is inefficient.

Random actions are necessary in learning process. But if we want from sbot to follow line, each random action can lead him out of the line. So we decide separate training and testing phase. And while in training phase robot didn't follow line all the way, in testing phase (after 20-30 trails) it will be able to follow line all the way.

Video (from mobile phone = poor quality):

Source code: Media:sbot.zip