LCS in context of ML- Martin Mihál

From RoboWiki
Revision as of 22:38, 29 May 2017 by Robot (talk | contribs) (Data & Results)
Jump to: navigation, search

Goal of project

Goal of the project is show possibilities of LCS algorithm in non-robotic enviroment - in area of - we can say - "pure" machine learning. I'll use data of activity of user of given website(http://www.publico.es/) from Piano(http://piano.io/) system and the goal of whole process will be define set of rules which best define users which will come back to website - that means they are "loyal" in some way and we can try to target them with locking of some content and deliver them offer. On the other site, users which are not loyal yet(they won't probablycoma back) - we want to give them freedom in browsing of given website and and develop interest or "addiction" and give them offer later.

Overview

I'll use data of activity of user from period(in this experiment month) A and we want to predict if user will have any interaction in following period B. Between period A and B are some days because we want to track "long-term" loyality no "from-day-to-day" loyality.

Preparing of data

In the process of preparing data are important to take in consideration 2 factors:

  1. Understanding of data we have and problem we want solve
  2. Type of data which used algorithm need


As a target value I will use number of pageviews(PV) in period B.

For describing users I'm gonna use following data from period A(all data are in binar value (1/0) because that's format of data which LCS need, except number of PV in period A which I will use in different way). All selected variables has background in data - for example we want track if people from different cities/countries has different properties, we check also how many days there is between first and last PV in period A, if they visit particular sections etc. :

  1. jan_visits - only non-binary feature - number of PV in period A
  2. stayMoreThan7 - if the difference between first and last PV was more than 7 days in period A
  3. stayMoreThan14 - if the difference between first and last PV was more than 14 days in period A
  4. stayMoreThan21 - if the difference between first and last PV was more than 21 days in period A
  5. stayMoreThan28 - if the difference between first and last PV was more than 28 days in period A
  6. fromMobile - if user is from mobile or tablet device
  7. fromDesktop - if user is from desktop device
  8. HadDirectVisits - if user visit website direct at least once (write url in browser)
  9. HadSocialVisits - if user come from social website like FB, Twitter etc at least one
  10. HadExternalVisits - if user come from different source at least one
  11. HadSearchVisits - if user come from search engine at least one
  12. isIOSuser - if user is iOS user
  13. visitSection13 - of user visit at least 13th most popular section of website
  14. visitSection12 - of user visit at least 12th most popular section of website
  15. visitSection11- of user visit at least 11th most popular section of website
  16. visitSection10 - of user visit at least 10th most popular section of website
  17. visitSection9 - of user visit at least 9th most popular section of website
  18. visitSection8 - of user visit at least 8th most popular section of website
  19. visitSection7 - of user visit at least 7th most popular section of website
  20. visitSection6 - of user visit at least 6th most popular section of website
  21. visitSection5 - of user visit at least 5th most popular section of website
  22. visitSection4 - of user visit at least 4th most popular section of website
  23. visitSection3 - of user visit at least 3th most popular section of website
  24. visitSection2 - of user visit at least 2th most popular section of website
  25. visitSection1- of user visit at least 1th most popular section of website
  26. fromUK - if user is from UK
  27. fromBrazil - if user is from Brazil
  28. fromUSA - if user is from USA
  29. fromPortugal - if user is from Portugal
  30. fromOther - if user is not from Portugal or US or Brazil or UK (most user are from those countries)
  31. isAdblockUser - if user is using adbocker
  32. fromLisbao - if user is from Lisbao city
  33. fromBraga - if user is from Braga city
  34. fromCoimbra if user is from Coimbra city
  35. feb_visits - TARGET VALUE - HOW MANY PV HAS GIVEN USER IN FEBUARY

LCS

I'm going to try solve this problem with LCS, more precisely with XCS version of LCS. I don't want to go deep into topic about LCS here so I'm providing bunch of links which help in uderstanding of that:

  1. https://www.youtube.com/watch?v=CRge_cZ2cJc&t=619s
  2. http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.35.3146&rep=rep1&type=pdf
  3. https://pythonhosted.org/xcs/


Basicaly result of LCS is set of rule which define value in given column(# means doesn't matter that value).

For example rule "1##################1#######0##1###" in my data means "NOT FROM USA AND IS FROM LISBAO CITY AND IS iOS USER AND HAS DIFF BETWEEN 1. AND LAST DAYS AT LEAST 14 DAYS - OTHER VALUE DOESN'T MATTER"


For every rule from rule population we have:

  1. action - in our case 1/0 if user will come back or not
  2. fittnes - fittnes of given rule in
  3. experience -how old is rule in learning process


process of learning is doing by creation of new rules and test it, mutate them etc(check mentioned sources above) - the important aspect of learning is rewarding for given rules - it's doing by value FEB_VISIT - so if rule which are meet has lot of PV in period B, it will be stronger.

After all, we have set of rules and we will predict for every user 1 or 0:

  1. predict 1 - if MORE ACTION OF RULES ARE SAYING 1(vote are multiply by it's fittnes)
  2. predict 0 - if MORE ACTION OF RULES ARE SAYING 0(vote are multiply by it's fittnes)

Actually, it's kind of ensemble learning.

Data & Results

Useri.png

As I mentioned only not-binary variable is 'number of PV in period A'. I will use it in following way:

  1. FILTER USER WITH AT LEAST X PV(X > PV IN PERIOD A)
  2. DIVIDE TO TRAIN & TEST DATA
  3. RUN LCS ON TRAIN DATA
  4. RESULT FOR TEST & TRAIN DATA WITH Y BEST(BY FITNESS) LCS RULES

I set after experiments and trade off with time Y(number of used rules) to 100

1, for x > 3:

test data => TRUE POSITIVE(TP) = 1820, FALSE POSITIVE(FP) = 886, TRUE NEGATIVE(TN) = 622, FALSE NEGATIVE (FN) = 854 => Precision is 0.6726 train data => TRUE POSITIVE(TP) = 7480, FALSE POSITIVE(FP) = 3308, TRUE NEGATIVE(TN) = 2708, FALSE NEGATIVE (FN) = 3360


2, for x > 5: test data => TRUE POSITIVE(TP) = 850, FALSE POSITIVE(FP) =376 , TRUE NEGATIVE(TN) = 233, FALSE NEGATIVE (FN) = 261 => Precision 0.6933 train data => TRUE POSITIVE(TP) = 3412, FALSE POSITIVE(FP) = 1505, TRUE NEGATIVE(TN) = 970, FALSE NEGATIVE (FN) = 958


1, for x > 7:

test data => TRUE POSITIVE(TP) = 684, FALSE POSITIVE(FP) = 569, TRUE NEGATIVE(TN) = 148, FALSE NEGATIVE (FN) = 84 => Precision 0.5459 train data => TRUE POSITIVE(TP) = 2638, FALSE POSITIVE(FP) = 2205, TRUE NEGATIVE(TN) = 664, FALSE NEGATIVE (FN) = 443


Best precision for this experiment set, almost70 %, is to have at least 5 PV and I'm providing also 15 most improtant rule for that:

  1. 111#100###00#0000##1##1###01#01#00
  2. 00000110#00##00####0#0#0#0#0100#00
  3. 1#####00#00#0###111#111000####100
  4. 0#####0###0#000#0#01####00#1001#0#
  5. #0000###00##00000#1##1##00##0##000
  6. 1110###10#0#0#000011#1#10##10##000
  7. 1#1###1##0000##010#1#1##0001#00#00
  8. #110#110000##00010#011###001000000
  9. 1100##1#0###00#0###0#1110#0100#0##
  10. 1110###10000#0#00###0#110#01##1000
  11. 11###011#0#0#000#####1#100#1001000
  12. #000#0#0#0000####0#00#1001#0000##0
  13. 1#101001####00000010001100#1#000##
  14. 1100###1##0#0000##0#1#1##0001##0#0
  15. #00#1#00##0#0#000##00#0#0##0000#00
  16. 11#01#1#0#0##0#00###1011#00100#00#
  17. #0#00##00#0##000#00#10##00010##000

Code

CODE HERE

Conclusion

LCS provide interesting approach to solving machine learning problem. The main advantage of this approach is extremely straightforward representation of rules from rule population on comparison of often used neural network or random forrest.