Riadenie pomocou Reinforcement learningu - Filip Pavlove

From RoboWiki
Revision as of 02:27, 6 May 2021 by Robot (talk | contribs) (Q-learning)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Goal

The main goal of this project is the usage of reinforcement algorithms for control of lunar lander. (see https://gym.openai.com/envs/LunarLander-v2/)

The landing pad is always at coordinates (0,0). Coordinates are the first two numbers in the state vector. The reward for moving from the top of the screen to landing pad and zero speed is about 100..140 points. If lander moves away from landing pad it loses reward back. Episode finishes if the lander crashes or comes to rest, receiving additional -100 or +100 points. Each leg ground contact is +10. Firing main engine is -0.3 points each frame. Solved is 200 points. Landing outside landing pad is possible. Fuel is infinite, so an agent can learn to fly and then land on its first attempt. Four discrete actions available: do nothing, fire left orientation engine, fire main engine, fire right orientation engine.

Methodology

Two algorithms were used.

 Q-learning: https://link.springer.com/article/10.1007/BF00992698
 Deep Q-learning: https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf

Implementation of both algorithms is in Python. Deep learning methods were implemented in PyTorch framework.

Q-learning

The state-space of the lander consists of six continuous (x-position, y-position, x-velocity, y-velocity, angle, angular-velocity), and two boolean values (left-leg and right-leg contact with the ground).

Continuous states of the agent were discretized into several "buckets". Each value of q-table represents the maximum expected future rewards for each action at each state. The total amount of (state, action) pairs in the lunars q-table is 12*10*5*5*6*5*2*2*4=1440000. Bounds of states were derived from the distributions of the observed values:

Hist x positions.png Hist angle.png Hist y velocity.png