Difference between revisions of "LCS in context of ML- Martin Mihál"

Latest revision as of 23:47, 29 May 2017

Goal of project

Goal of the project is show possibilities of LCS algorithm in non-robotic enviroment - in area of - we can say - "pure" machine learning. I'll use data of activity of user of given website(http://www.publico.es/) from Piano(http://piano.io/) system and the goal of whole process will be define set of rules which best define users which will come back to website - that means they are "loyal" in some way and we can try to target them with locking of some content and deliver them offer. On the other site, users which are not loyal yet(they won't probablycoma back) - we want to give them freedom in browsing of given website and and develop interest or "addiction" and give them offer later.

Overview

I'll use data of activity of user from period(in this experiment month) A and we want to predict if user will have any interaction in following period B. Between period A and B are some days because we want to track "long-term" loyality no "from-day-to-day" loyality.

Preparing of data

In the process of preparing data are important to take in consideration 2 factors:

Understanding of data we have and problem we want solve
Type of data which used algorithm need

As a target value I will use number of pageviews(PV) in period B.

For describing users I'm gonna use following data from period A(all data are in binar value (1/0) because that's format of data which LCS need, except number of PV in period A which I will use in different way). All selected variables has background in data - for example we want track if people from different cities/countries has different properties, we check also how many days there is between first and last PV in period A, if they visit particular sections etc. :

jan_visits - only non-binary feature - number of PV in period A
stayMoreThan7 - if the difference between first and last PV was more than 7 days in period A
stayMoreThan14 - if the difference between first and last PV was more than 14 days in period A
stayMoreThan21 - if the difference between first and last PV was more than 21 days in period A
stayMoreThan28 - if the difference between first and last PV was more than 28 days in period A
fromMobile - if user is from mobile or tablet device
fromDesktop - if user is from desktop device
HadDirectVisits - if user visit website direct at least once (write url in browser)
HadSocialVisits - if user come from social website like FB, Twitter etc at least one
HadExternalVisits - if user come from different source at least one
HadSearchVisits - if user come from search engine at least one
isIOSuser - if user is iOS user
visitSection13 - of user visit at least 13th most popular section of website
visitSection12 - of user visit at least 12th most popular section of website
visitSection11- of user visit at least 11th most popular section of website
visitSection10 - of user visit at least 10th most popular section of website
visitSection9 - of user visit at least 9th most popular section of website
visitSection8 - of user visit at least 8th most popular section of website
visitSection7 - of user visit at least 7th most popular section of website
visitSection6 - of user visit at least 6th most popular section of website
visitSection5 - of user visit at least 5th most popular section of website
visitSection4 - of user visit at least 4th most popular section of website
visitSection3 - of user visit at least 3th most popular section of website
visitSection2 - of user visit at least 2th most popular section of website
visitSection1- of user visit at least 1th most popular section of website
fromUK - if user is from UK
fromBrazil - if user is from Brazil
fromUSA - if user is from USA
fromPortugal - if user is from Portugal
fromOther - if user is not from Portugal or US or Brazil or UK (most user are from those countries)
isAdblockUser - if user is using adbocker
fromLisbao - if user is from Lisbao city
fromBraga - if user is from Braga city
fromCoimbra if user is from Coimbra city
feb_visits - TARGET VALUE - HOW MANY PV HAS GIVEN USER IN FEBUARY

LCS

I'm going to try solve this problem with LCS, more precisely with XCS version of LCS. I don't want to go deep into topic about LCS here so I'm providing bunch of links which help in uderstanding of that:

Basicaly result of LCS is set of rule which define value in given column(# means doesn't matter that value).

For example rule "1##################1#######0##1###" in my data means "NOT FROM USA AND IS FROM LISBAO CITY AND IS iOS USER AND HAS DIFF BETWEEN 1. AND LAST DAYS AT LEAST 14 DAYS - OTHER VALUE DOESN'T MATTER"

For every rule from rule population we have:

action - in our case 1/0 if user will come back or not
fittnes - fittnes of given rule in
experience -how old is rule in learning process

process of learning is doing by creation of new rules and test it, mutate them etc(check mentioned sources above) - the important aspect of learning is rewarding for given rules - it's doing by value FEB_VISIT - so if rule which are meet has lot of PV in period B, it will be stronger.

After all, we have set of rules and we will predict for every user 1 or 0:

predict 1 - if MORE ACTION OF RULES ARE SAYING 1(vote are multiply by it's fittnes)
predict 0 - if MORE ACTION OF RULES ARE SAYING 0(vote are multiply by it's fittnes)

Actually, it's kind of ensemble learning.

Data & Results

As I mentioned only not-binary variable is 'number of PV in period A'. I will use it in following way:

FILTER USER WITH AT LEAST X PV(X > PV IN PERIOD A)
DIVIDE TO TRAIN & TEST DATA
RUN LCS ON TRAIN DATA
RESULT FOR TEST & TRAIN DATA WITH Y BEST(BY FITNESS) LCS RULES

I set after experiments and trade off with time Y(number of used rules) to 100

1, for x > 3:

test data => TRUE POSITIVE(TP) = 1820, FALSE POSITIVE(FP) = 886, TRUE NEGATIVE(TN) = 622, FALSE NEGATIVE (FN) = 854 => Precision is 0.6726
train data => TRUE POSITIVE(TP) = 7480, FALSE POSITIVE(FP) = 3308, TRUE NEGATIVE(TN) = 2708, FALSE NEGATIVE (FN) = 3360

2, for x > 5:

test data => TRUE POSITIVE(TP) = 850, FALSE POSITIVE(FP) =376 , TRUE NEGATIVE(TN) = 233, FALSE NEGATIVE (FN) = 261 => Precision 0.6933
train data => TRUE POSITIVE(TP) = 3412, FALSE POSITIVE(FP) = 1505, TRUE NEGATIVE(TN) = 970, FALSE NEGATIVE (FN) = 958

1, for x > 7:

test data => TRUE POSITIVE(TP) = 684, FALSE POSITIVE(FP) = 569, TRUE NEGATIVE(TN) = 148, FALSE NEGATIVE (FN) = 84 => Precision 0.5459
train data => TRUE POSITIVE(TP) = 2638, FALSE POSITIVE(FP) = 2205, TRUE NEGATIVE(TN) = 664, FALSE NEGATIVE (FN) = 443

Best precision for this experiment set, almost 70 %, is to have at least 5 PV and I'm providing also 15 most improtant rule for that:

111#100###00#0000##1##1###01#01#00
00000110#00##00####0#0#0#0#0100#00
1#####00#00#0###111#111000####100
0#####0###0#000#0#01####00#1001#0#
#0000###00##00000#1##1##00##0##000
1110###10#0#0#000011#1#10##10##000
1#1###1##0000##010#1#1##0001#00#00
#110#110000##00010#011###001000000
1100##1#0###00#0###0#1110#0100#0##
1110###10000#0#00###0#110#01##1000
11###011#0#0#000#####1#100#1001000
#000#0#0#0000####0#00#1001#0000##0
1#101001####00000010001100#1#000##
1100###1##0#0000##0#1#1##0001##0#0
#00#1#00##0#0#000##00#0#0##0000#00
11#01#1#0#0##0#00###1011#00100#00#
#0#00##00#0##000#00#10##00010##000

Algorithm parameters

After experiments I end up with following parameters:

algorithm.wildcard_probability = .8
algorithm.exploration_probability = .1
algorithm.discount_factor = .05
algorithm.do_ga_subsumption = True
algorithm.do_action_set_subsumption = True
algorithm.max_population_size = 300
algorithm.exploration_probability = .1
algorithm.ga_threshold = 1
algorithm.crossover_probability = .5
algorithm.do_action_set_subsumption = True
algorithm.do_ga_subsumption = False

Code

CODE HERE Media:Projekt.7z

Conclusion

LCS provide interesting approach to solving machine learning problem. The main advantage of this approach is extremely straightforward representation of rules from rule population on comparison of often used neural network or random forrest.

@@ Line 13: / Line 13: @@
 As a target value I will use number of pageviews(PV) in period B.
-For describing users I'm gonna use following data from period A(all data are in binar value (1/0) because that's format of data which LCS need, except number of PV in period A which I will use in different way)
+For describing users I'm gonna use following data from period A(all data are in binar value (1/0) because that's format of data which LCS need, except number of PV in period A which I will use in different way). All selected variables has background in data - for example we want track if people from different cities/countries has different properties, we check also how many days there is between first and last PV in period A, if they visit particular sections etc. :
-==Data selection==
+#jan_visits - only non-binary feature - number of PV in period A
-TODO
+#stayMoreThan7 - if the difference between first and last PV was more than 7 days in period A
+#stayMoreThan14 - if the difference between first and last PV was more than 14 days in period A
+#stayMoreThan21 - if the difference between first and last PV was more than 21 days in period A
+#stayMoreThan28 - if the difference between first and last PV was more than 28 days in period A
+#fromMobile - if user is from mobile or tablet device
+#fromDesktop - if user is from desktop device
+#HadDirectVisits - if user visit website direct at least once (write url in browser)
+#HadSocialVisits - if user come from social website like FB, Twitter etc at least one
+#HadExternalVisits - if user come from different source at least one
+#HadSearchVisits - if user come from search engine at least one
+#isIOSuser - if user is iOS user
+#visitSection13 - of user visit at least 13th most popular section of website
+#visitSection12 - of user visit at least 12th most popular section of website
+#visitSection11- of user visit at least 11th most popular section of website
+#visitSection10 - of user visit at least 10th most popular section of website
+#visitSection9 - of user visit at least 9th most popular section of website
+#visitSection8 - of user visit at least 8th most popular section of website
+#visitSection7 - of user visit at least 7th most popular section of website
+#visitSection6 - of user visit at least 6th most popular section of website
+#visitSection5 - of user visit at least 5th most popular section of website
+#visitSection4 - of user visit at least 4th most popular section of website
+#visitSection3 - of user visit at least 3th most popular section of website
+#visitSection2 - of user visit at least 2th most popular section of website
+#visitSection1- of user visit at least 1th most popular section of website
+#fromUK - if user is from UK
+#fromBrazil - if user is from Brazil
+#fromUSA - if user is from USA
+#fromPortugal - if user is from Portugal
+#fromOther - if user is not from Portugal or US or Brazil or UK (most user are from those countries)
+#isAdblockUser - if user is using adbocker
+#fromLisbao - if user is from Lisbao city
+#fromBraga - if user is from Braga city
+#fromCoimbra if user is from Coimbra city
+#feb_visits - TARGET VALUE - HOW MANY PV HAS GIVEN USER IN FEBUARY
-==LCS algorithm==
+==LCS ==
-TODO
+I'm  going to try solve this problem with LCS, more precisely with XCS version of LCS. I don't want to go deep into topic about LCS here so I'm providing bunch of links which help in uderstanding of that:
+#https://www.youtube.com/watch?v=CRge_cZ2cJc&t=619s
+#http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.35.3146&rep=rep1&type=pdf
+#https://pythonhosted.org/xcs/
-==Results==
-TODO
+Basicaly result of LCS is set of rule which define value in given column(# means doesn't matter that value).
+For example rule "1##################1#######0##1###" in my data means "NOT FROM USA AND IS FROM LISBAO CITY AND IS iOS USER AND HAS DIFF BETWEEN 1. AND LAST DAYS AT LEAST 14 DAYS - OTHER VALUE DOESN'T MATTER"
+For every rule from rule population we have:
+#action - in our case 1/0 if user will come back or not
+#fittnes - fittnes of given rule in
+#experience -how old is rule in learning process
+process of learning is doing by creation of new rules and test it, mutate them etc(check mentioned sources above) - the important aspect of learning is rewarding for given rules - it's doing by value FEB_VISIT - so if rule which are meet has lot of PV in period B, it will be stronger.
+After all, we have set of rules and we will predict for every user 1 or 0:
+# predict 1 - if MORE ACTION OF RULES ARE SAYING 1(vote are multiply by it's fittnes)
+# predict 0   - if MORE ACTION OF RULES ARE SAYING 0(vote are multiply by it's fittnes)
+Actually,  it's kind of ensemble learning.
+==Data & Results==
+[[Image:Useri.png]]
+As I mentioned only not-binary variable is 'number of PV in period A'. I will use it in following way:
+# FILTER USER WITH AT LEAST X PV(X > PV IN PERIOD A)
+# DIVIDE TO TRAIN & TEST DATA
+# RUN LCS ON TRAIN DATA
+# RESULT FOR TEST & TRAIN DATA WITH Y BEST(BY FITNESS) LCS RULES
+I set after experiments and trade off with time Y(number of used rules) to 100
+, for x > 3:
+#test data => TRUE POSITIVE(TP) = 1820, FALSE POSITIVE(FP) = 886, TRUE NEGATIVE(TN) = 622, FALSE NEGATIVE (FN) = 854 => Precision is 0.6726
+#train data => TRUE POSITIVE(TP) = 7480, FALSE POSITIVE(FP) = 3308, TRUE NEGATIVE(TN) = 2708, FALSE NEGATIVE (FN) = 3360
+, for x > 5:
+#test data => TRUE POSITIVE(TP) = 850, FALSE POSITIVE(FP) =376 , TRUE NEGATIVE(TN) = 233, FALSE NEGATIVE (FN) = 261 => Precision	0.6933
+#train data => TRUE POSITIVE(TP) = 3412, FALSE POSITIVE(FP) = 1505, TRUE NEGATIVE(TN) = 970, FALSE NEGATIVE (FN) = 958
+, for x > 7:
+#test data => TRUE POSITIVE(TP) = 684, FALSE POSITIVE(FP) = 569, TRUE NEGATIVE(TN) = 148, FALSE NEGATIVE (FN) = 84 => Precision	0.5459
+#train data => TRUE POSITIVE(TP) = 2638, FALSE POSITIVE(FP) = 2205, TRUE NEGATIVE(TN) = 664, FALSE NEGATIVE (FN) = 443
+Best precision for this experiment set, almost 70 %, is to have at least 5 PV and I'm providing also 15 most improtant rule for that:
+# 111#100###00#0000##1##1###01#01#00
+# 00000110#00##00####0#0#0#0#0100#00
+# 1#####00#00#0###111#111000####100
+# 0#####0###0#000#0#01####00#1001#0#
+# #0000###00##00000#1##1##00##0##000
+# 1110###10#0#0#000011#1#10##10##000
+# 1#1###1##0000##010#1#1##0001#00#00
+# #110#110000##00010#011###001000000
+# 1100##1#0###00#0###0#1110#0100#0##
+# 1110###10000#0#00###0#110#01##1000
+# 11###011#0#0#000#####1#100#1001000
+# #000#0#0#0000####0#00#1001#0000##0
+# 1#101001####00000010001100#1#000##
+# 1100###1##0#0000##0#1#1##0001##0#0
+# #00#1#00##0#0#000##00#0#0##0000#00
+# 11#01#1#0#0##0#00###1011#00100#00#
+# #0#00##00#0##000#00#10##00010##000
+==Algorithm parameters==
+After experiments I end up with following parameters:
+# algorithm.wildcard_probability = .8
+# algorithm.exploration_probability = .1
+# algorithm.discount_factor = .05
+# algorithm.do_ga_subsumption = True
+# algorithm.do_action_set_subsumption = True
+# algorithm.max_population_size = 300
+# algorithm.exploration_probability = .1
+# algorithm.ga_threshold = 1
+# algorithm.crossover_probability = .5
+#algorithm.do_action_set_subsumption = True
+#algorithm.do_ga_subsumption = False
+==Code==
+CODE HERE [[Media:Projekt.7z]]
+==Conclusion==
+LCS provide interesting approach to solving machine learning problem. The main advantage of this approach is extremely straightforward representation of rules from rule population on comparison of often used neural network or random forrest.

Difference between revisions of "LCS in context of ML- Martin Mihál"

Latest revision as of 23:47, 29 May 2017

Contents

Goal of project

Overview

Preparing of data

LCS

Data & Results

Algorithm parameters

Code

Conclusion

Navigation menu

Views

Personal tools

Navigation

Search

Tools