Issuu on Google+

International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) ISSN 2249-6831 Vol. 3, Issue 3, Aug 2013, 209-216 © TJPRC Pvt. Ltd.

A NOVEL RECOMMENDATION SYSTEM USING ROUGH SET CLUSTERING AND CLOSED SEQUENTIAL PATTERN MINING MITCHELL D’SILVA1 & DEEPALI VORA2 1

M.E. Student, Information Technology, Vidyalankar Institute of Technology, Mumbai, Maharashtra, India 2

Information Technology, Vidyalankar Institute of Technology, Mumbai, Maharashtra, India

ABSTRACT The World Wide Web is a massive source of information that is widely used today due to the constant availability of useful and dynamically changing information. However, users are often confused by the large number of webpages and they find it difficult to search the appropriate information relevant to their interest. This paper presents a novel recommendation system that provides navigation recommendations to users. The proposed model is implemented by integrating preprocessing, rough set clustering using upper approximation, sequential pattern mining using PrefixSpan technique and closed sequential pattern mining. Rough set clustering helps to create clusters of transactions having similar browsing patterns. Mining such similar transactions using PrefixSpan generates more efficient sequential patterns. Closed sequential pattern mining uses the post pruning strategy to generate a compact set of closed sequential patterns from the set of sequential patterns generated by PrefixSpan. The combination of all these techniques helps to provide more efficient and accurate recommendations to the users.

KEYWORDS: Preprocessing, Rough Set Clustering, Upper Approximation, Sequential Pattern Mining Using PrefixSpan, Closed Sequential Pattern Mining

INTRODUCTION The web has become an important information resource for people to search information and communicate across the corners of the world. It has gained importance because of its huge subject content that keeps changing dynamically, better presentation of the materials, emerging E-commerce applications and discussion facility through blogs, forums and chats. However, users find it difficult to find relevant materials of their interest from this vast enriched source. Users visit various links in the website that they think are relevant until they find the desired information in one or more pages. This increases the browsing time as well as the network traffic. These problems can be solved by providing useful navigation recommendations to users based on previous user’s browsing patterns. These Web browsing patterns of the users are stored in Web logs. Web usage mining is then used to analyse the Web log files to extract useful patterns. Using various web usage mining techniques we can predict the next page to the user. This makes it easy for the user to browse the website and find relevant content quickly. It also reduces the network latency by pre-fetching the recommended web pages and avoids visiting unnecessary pages. This paper discusses an intelligent recommendation system implemented using Rough Set Clustering using upper approximation, Sequential Pattern Mining using PrefixSpan and Closed Sequential Pattern Mining. These techniques are applied on the Web Server log data. The web log data is first pre-processed to filter out unwanted and ambiguous data and organize the relevant data before applying any of the data mining techniques. Rough set is then used to find out relationships within imprecise data, eliminate irrelevant attributes and discover the relationships between objects and attributes so that the knowledge discovery can be efficiently done. Rough set clustering forms clusters of web sessions that


Mitchell D’Silva & Deepali Vora

210

have similar browsing patterns. A rough cluster is a cluster whose elements can belong to more than one cluster. Upper approximation includes elements that may or may not belong to a given concept. The rough clusters are then upper approximated until it results in mutually disjoint equivalence classes. Sequential pattern mining technique called PrefixSpan is then applied on individual clusters. PrefixSpan determines all frequent subsequences or sequential patterns that are present in the sequence database consisting of web sessions having similar browsing patterns. Sequential patterns whose support is greater than the specified minimum support are selected and then projected database of those sequences are created thereby providing the final output until no sequence with support greater than the specified value are available. PrefixSpan generates a large number of patterns exponentially which becomes a problem when the database consists of long frequent sequential patterns. Closed sequential pattern mining is then used to reduce the number of patterns generated by PrefixSpan by using the post pruning strategy. It eliminates all the patterns in the set whose supersets are present in the set and that have the same support as that of the superset. Thus the final output consists of all the patterns that have no super-sequence with the same occurrence frequency. The closed sequential patterns thus generated are then provided as recommendations by matching the user’s current web browsing sequence with the closed sequential patterns stored in the database.

BACKGROUND OF STUDY Several techniques are used to improve the effectiveness and efficiency of recommendation systems. Many papers combined clustering with association rule mining. An upper approximation based rough set clustering and dynamic k th order association rule mining using Apriori for providing navigation recommendations is proposed in [1]. Rough set clustering helps to form clusters of transactions that have similar browsing patterns. Then association rule mining using Apriori algorithm is applied on each cluster for personalization to provide recommendations. They proved through experimentation that by using association rule mining on clusters provides much better results than that on non-clustered data. However Apriori algorithm is not suitable for large databases or when sequential patterns to be mined are numerous and long. It generates too many rules and there is no guarantee whether the rules generated are relevant [1]. A collaborative filtering technique using k-nearest-neighbor strategy is proposed in [2] which consumes a lot of time in dynamically finding k-nearest-neighbors. Moreover, it uses association rule mining for personalization using multiple support and confidence levels which are complex to fix. Pairwise-nearest-neighbor (PNN) based clustering technique combined with Markov model based sequential pattern mining is proposed in [3]. PNN approach is time consuming as it merges every pair of clusters and updates distance values after every merge. Thus to reduce the number of distance updations only first k neighbors are considered. Major advantage of PNN is that every object must be a member of only one cluster thereby improving recommendation accuracy. The kth order Markov model is then applied on each cluster. This model proved that recommendation accuracy was improved by clustering the data before applying Markov’s model. Another paper [4] integrated rough set clustering with Markov model. This approach suffers from lack of prediction accuracy since in rough set clustering an object can belong to more than one cluster thereby reducing the cluster tightness. The k thMarkov model is then used for sequential mining which reduces the prediction accuracy if coverage is less. A recommendation system that integrated clustering, association rule mining and Markov model has been proposed in [5]. This paper uses K-means clustering which forms clusters of similar browsing patterns. Association rule mining is applied on each cluster to generate frequent patterns. Higher order Markov models achieve higher accuracy because of large browsing history but have higher state space complexity. Experimental results proved that this integration provides better prediction accuracy than using each technique individually [5]. A new modified Markov model has been proposed in [6] that attempt to alleviate the scalability issue in the number of paths. It has been proved through experimentation that this approach improves the


A Novel Recommendation System using Rough Set Clustering and Closed Sequential Pattern Mining

211

prediction time without compromising the prediction accuracy. A sequential pattern mining technique called PrefixSpan is proposed in [7] that is efficient for mining sequential patterns in large databases. PrefixSpan mines complete set of frequent patterns in a sequence database that can be provided as recommendations. It examines only the prefix subsequences and projects only their corresponding postfix subsequences into projected databases [6]. Moreover it reduces the size of projected databases leading to more efficient processing. However PrefixSpan algorithm generates a large number of patterns and establishing the minimum support is also a difficult task. These drawbacks have been overcome by closed sequential pattern mining technique proposed in [8]. This technique reduces the number of sequential patterns generated by PrefixSpan by removing all the patterns in the set whose superset exist in the set and has the same support as that of the pattern being removed.

PROPOSED RECOMMENDATION SYSTEM The proposed recommendation system is presented in Figure 1. It involves preprocessing, rough set clustering using upper approximation, sequential pattern mining using PrefixSpan technique and closed sequential pattern mining. The Web

Figure 1: Proposed Web Recommendation System Server access log file is used as input bythe proposed system. The log data is then preprocessed in three stages namely data cleansing, user identification and session identification thereby filtering all irrelevant pages from the log data. All the relevant pages are then passed as input to rough set clustering using upper approximation method that forms clusters of sessions that have similar browsing patterns. Sequential pattern mining technique called PrefixSpan is then applied on each cluster to mines all frequent sequential patterns in the cluster. Closed sequential pattern mining is then used to reduce the redundant patterns generated by PrefixSpan. These patterns are then stored in the database and based on the users current browsing sequence efficient recommendations are given to the user by matching the current sequence with the patterns in the database. The following section describes each block in detail.


Mitchell D’Silva & Deepali Vora

212

Web Server Access Log File The input data used for the proposed recommendation system is the web server access log of the website. All the pages visited by the user are recorded in the log with unique session id for each user. The log records IP address, user-id, timestamp, method, url, protocol, status code, size, referrer and agent. Pre-Processing Preprocessing is used to extract useful data from raw log data and arrange it in a form suitable for pattern discovery. The original log file has large amount of irrelevant entries in it because when an HTML document is downloaded several other graphic files also get downloaded along with other scripts. In addition to this effect of web spiders and web crawlers that run continuously are also irrelevant for the mining task. Thus the original web log files need to be cleaned, analyzed and transformed into suitable standard format for mining. Preprocessing includes three stages namely data cleaning, user identification and session identification [10] as shown in Figure 2.

Figure 2: Stages in Preprocessing Data Cleansing The raw web log data is cleaned to remove all irrelevant entries in it. HTTP is a connectionless protocol and hence requires a separate connection for each webpage requested by the browser. When the user requests for a webpage, other graphics and scripts also get automatically downloaded along with the requested webpage. However, only the entry of the requested web page is relevant and should be retained in the log file for web usage mining as the other graphics and scripts were not requested by the user.Thus preprocessing should remove the following irrelevant entries: 

All log entries with filename extensions as .jpeg, .gif, .bmp, .jpg, .JPEG, .GIF, .JPG. For example: 123.456.78.9 – [13/Jun/2013:05:10:45 -0500] “GETadvertisement/pics/ad_logo.gif HTTP/1.1” 200 72 A.html Mozilla/3.04 (Win95, I)

Records of failed HTTP status codes. Entries having status codes above 299 and below 200 should be eliminated. For example: 123.456.78.9 – [13/Jun/2013:09:10:55 -0500] “GET A.html HTTP/1.1” 404 350 Mozilla/3.04 (Win95, I)

Log entries that are generated by robots. For example: 123.456.78.9 – [13/Jun/2013:07:15:45 -0500] “GET


213

A Novel Recommendation System using Rough Set Clustering and Closed Sequential Pattern Mining

/dw/0,1855,2872,00.html

HTTP/1.1”

200

32100

“-”“Mozilla/5.0

(compatible;

Googlebot/2.1;

http://www.google.com/bot.html)” User Identification This is used to identify unique users from the web server access log. Different user-id indicates different users. In the absence of user-id, IP address is used to identify users. New IP address indicates that there is a new user. If the IP addresses are same then the different browsers and operating systems indicate different users which can be found by client IP address and user agent.If the IP address, browsers and operating systems are same, the referrer information is taken into account to identify users. If referrer is set, referring page must have been visited by user i.e. if the requested page is not directly reachable by hyperlink from the visited page assume a new user with same IP. Session Identification A session is the sequence of pages visited or the activities performed by a single user within a predefined period of time. The purpose of session identification is to divide the pages accessed by individual users into unique sessions. This is achieved by using the method of timeout. If the time duration between two consecutive page accesses exceeds a certain amount of time, the session expires and it is assumed that the user is starting a new session. Most of the applications used a default timeout period of 30 minutes. Rough Set Clustering Using Upper Approximation A rough set is defined by pair of sets which give lower approximation and upper approximation of the original set [1]. Lower approximation is defined as the union of all the equivalence classes present in the target set. It is a collection of complete set of elements that definitely belong to the target set. Upper approximation is defined as the union of all the equivalence classes whose intersection with the target set is not empty. It consists of elements that possibly belong to the target set. As upper approximation successively adds data elements, it is termed as upper approximation based rough agglomerative clustering. Let U be the universe of discourse then a binary relation R between any two transactions x and y is defined as xRy = 1 if the similarity threshold value of both the transactions is greater than the predefined threshold value. Such transactions whose similarity value is greater than the predefined threshold are called as similar learning patterns. Rough set clustering is applied on the preprocessed data to form clusters of transactions that have similar learning patterns. The steps for rough set clustering are as follows: Construct A Similarity Matrix Given two transactions x and y, the measure of similarity between x and y is given by Jaccord coefficient as sim (x,y) =

where sim (x,y) ∈ [0,1]

sim (x,y) = 1, when two transactions x and y are exactly identical and sim (x,y) = 0, when two transaction x and y have no common items [4]. Calculate the Similarity Class for Each Transaction Similarity class of y, denoted by R(y), is the set of transactions which are similar to y.It is given by R(y) = {x∈ T, xRy} i.e. all the transactions whose similarity value, sim(x,y) ≥ threshold are taken as members of the similarity class [4].For different threshold values (th) we can get different similarity classes. For a fixed threshold, th∈ [0, 1], a transaction from a given similarity class may be similar to an object of another similarity class.


Mitchell D’Silva & Deepali Vora

214

Compute Upper Approximation Upper approximation is performed by considering objects in k neighborhood.Upper approximation is given by Aupper(X) = {x ∈ U | [x]R∩ X ≠ ∅} where,

U –Universal set R – equivalence relation on U [x]R denote the equivalence class of R, containing x X is characterized in A by a pair of sets − its lowerand upper approximation in A

All the transactions whose intersection with the class considered is not equal to null are combined.This step is repeated until the result of two successive iterations is same [1]. Sequential Pattern Mining Using PrefixSpan It is a pattern-growth algorithm used for mining complete set of sequential web access patterns from the sequential database. Its main goal is to project frequent prefixes in the sequence rather than projecting the sequence database by considering all possible occurrences of frequent subsequences.Anα-projected sequence database is the set of subsequences in the sequence database that are suffixes of the sequence that have the prefix α.In every step, the algorithm checks for frequent sequences with prefix a, in the correspondent projected database [8]. Given a sequence α = <e1, e2,…, en>, a sequence β = <e′1, e′2,…, e′m> (m ≤ n) is called a prefix of α if and only ife′i= eifor (i ≤ m-1) and e′m⊆em. All the items in (em- e′m) are alphabetically after those in e′m[8]. Given sequences α and β such that β is a subsequence of α, i.e., β⊆ α, subsequence α′ of sequence α is called projection of α w.r.t. prefix β if and only ifα′ has prefix β. There exists no proper super-sequence α′′ of α′ such that α′′ is a subsequence of α and also has prefix β [8]. PrefixSpan algorithm is applied on individual clusters to generate sequential patterns. A cluster is considered as a sequence database. A minimum support value is predefined. The steps for PrefixSpan are as follows: Find All Length-1 Sequential Patterns from Each Transaction A cluster consists of several transactions. Each transaction is considered as a sequence. Scan each cluster to find all frequent items in the transaction. Each of these frequent items is a length-1 sequential pattern. Count the number of occurrences of each length-1 sequential pattern in the cluster and represent the pattern and its corresponding support count as <pattern>: support count. For example: <a>: 5, <b>:7.. All the sequential patterns whose support count is less than the specified minimum support count are eliminated. Construct Projected Database for Each Length-1 Sequential Pattern The entire set of length-1 sequential patterns is then divided into projected databases. If there are n length-1 sequential patterns whose support count is greater than the predefined minimum support count then there will be n projected databases. For example, projected database for <a> will consists of all the pages following <a> in the each transaction. Pages preceding <a> will not be included in the projected database. In the following sequence <(cd)(ab)(ef)gh> only the subsequence <(_ b)(ef)gh> should be considered for mining sequential patterns having prefix <a>. In case if the sequence is <a(abc)(gh)e(bc)> the subsequence to be considered is <(abc)(gh)e(bc)>.


A Novel Recommendation System using Rough Set Clustering and Closed Sequential Pattern Mining

215

Find Subsets of Sequential Patterns from the Projected Databases The subsets of sequential patterns can be found by mining the projected databases. For each projected database again find all length-1 sequential patterns and compute their support count. All the patterns whose support count is greater than the specified minimum support count are then selected and appended to that particular projected databases’ sequential pattern thereby generating a length-2 sequential pattern and so on. A projected database is then created for the newly formed length-2 sequential patterns and again the same process is repeated until no more frequent subsequences can be generated from that particular projected database. Mining Closed Sequential Web Access Patterns The number of frequent sequential patterns generated by PrefixSpan is too large which becomes difficult to deal with when the data is large. To reduce the number of patterns generated by PrefixSpan, closed sequential pattern mining is implemented. Closed sequential pattern mining is used to mine closed sequential web access patterns from the complete set of sequential web access patterns (FS) using post-pruning approach.Closed frequent sequential pattern is defined as CS = {α | α∈ FS and β∈ FS such that α⊆β and support(α) = support(β)} [8]. Each sequential web access pattern Fsi, in the pattern set FS is compared with the other patterns in the set for instance Fsj.The pattern Fsi is removed from the pattern set FS, if and only ifthe support of both web access patterns Fsiand Fsjare the same and Fsiis a subset of Fsj[8]. All these closed sequential patterns are then stored in the database. Generation of Recommendations for User Sequences Recommendations can be provided to the user only if the length of the user web access sequence S satisfies the thresholds minlen and maxlen. For example, if the minlen value is set to 5 then the user will receive recommendations only if its access sequence is of length greater than or equal to 5. Similarly if the maxlen value is 20 then the user will not receive any recommendations if the length of its access sequence is greater than 20. The user’s access sequence is then compared with the sequential patterns in the database and the postfixes of the access sequence are provided as recommendation to the user.

CONCLUSIONS This paper presents a novel recommendation system implemented by collaborating preprocessing, rough set clustering using upper approximation, sequential pattern mining using PrefixSpan algorithm and closed sequential pattern mining. Using rough set clustering prior to sequential pattern mining is expected to improve the mining efficiency as well as provide much accurate recommendations to users since clustering forms groups of all transactions having similar browsing patterns. In addition to this, closed sequential pattern mining is employed to reduce the large number of patterns generated by PrefixSpan so that it becomes easy to handle the number of sequential patterns especially when the database to be mined is large. The closed sequential patterns thus generated are then compared with the user’s current web access sequence. Navigation recommendations are then provided to the user bymatching the user’s current web access sequence with the patterns in the database. Thus, the proposed recommendation system that uses an integration of several techniques is expected to provide more efficient, accurate and effective recommendations to the users as compared to those provided by using each technique individually.


Mitchell D’Silva & Deepali Vora

216

REFERENCES 1.

Anitha and Dr. N. Krishnan. (January2011).A Dynamic Web Mining Framework for E-learning Recommendations using Rough Sets and Association Rule Mining.International Journal of Computer Applications (0975 - 8887),vol. 12 – No.11,pp. 36 – 41.

2.

BamshadMobasher, Honghua Dai, Tao Luo and Miki Nakagawa. (November 9, 2001). Effective Personalization Based on Association Rule Discovery from Web Usage Data.3rd ACM workshop on Web Information and Data Management, Atlanta, Georgia, USA, ACM 0-12345-67-8.

3.

Anitha. (October 2010). A New Web Usage Mining Approach for Next Page Access Prediction. International Journal of Computer Applications (0975 - 8887), vol. 8 – No.11, pp. 7 – 10.

4.

SiripornChimphlee, NaomieSalim, MohdSalihin Bin Ngadiman, WitchaChimphlee and SuratSrinoy. (2006). Rough Sets Clustering and Markov model for Web Access Prediction. Proceedings of the Postgraduate Annual Research Seminar, pp. 470 – 475.

5.

Faten Khalil, Jiuyong Li and Hua Wang. (2008). Integrating Recommendation Models for Improved Web Page Prediction Accuracy. 31st Australasian Computer Science Conference in Research and Practice in Information Technology, Vol. 74.

6.

Mamoun A. Awad and Issa Khalil. (August 2012). Prediction of User’s Web-Browsing Behavior: Application of Markov Model. IEEE Transactions on Systems, Man and Cybernetics, Vol. 42, No.4.

7.

Jian Pei, Jiawei Han, BehzadMortazavi-Asl and Helen Pinto. PrefixSpan: Mining Sequential Patterns Efficiently by Prefix-Projected Pattern Growth. Supported by Natural Sciences and Engineering Research Council of Canada, the Networks of Centres of Excellence of Canada and the Hewlett-Packard Lab, USA.

8.

UtpalaNiranjan, Dr. R.B.V. Subramanyam and Dr. V. Khanaa. (May 2010). An Efficient System Based on Closed Sequential Patterns for Web Recommendations. International Journal of Computer Science Issues (1694 - 0784), Vol. 7, Issue 3, No. 4.

9.

Ms. Jyoti, Dr. A. K. Sharma and Dr. AmitGoel. (2009).A novel Approach for Clustering Web User Sessions using RST. International Journal on Computer Science and Engineering (0975 - 3397), Vol. 2(1), pp. 56 – 61.

10. V. Sathiyamoorthi and Dr. V. MuraliBhaskaran. (November 2009). Data Preparation Techniques for Web Usage Mining in World Wide Web – An approach. International Journal of Recent Trends in Engineering, Vol. 2, No. 4. 11. DoruTanasa and Brigitte Trousse. (March / April 2004). Advanced Data Preprocessing for Intersites Web Usage Mining. IEEE Computer Society (1094 - 7167).


24 a novel recommendation full