Issuu on Google+

Full Paper Proc. of Int. Conf. on Advances in Computer Engineering 2012

An Investigation of RSS News Feeds Categorisation Processes: A Case Study of Online Newspapers Thakerng Wongsirichot, Paweena Jantasara, and Suchada Srisuwan Information and Communication Technology Programme, Faculty of Science, Prince of Songkla University, Thailand Email: {thakerng.w@gmail.com, susie_potter@hotmail.com, weena_jj@hotmail.com} II. THEORITICAL BACKGROUND

Abstract— Really Simple Syndication (RSS) has been utilised as a one way communication from a sender to many receivers. In Thai online newspapers industry, RSS has been employed for many years. Due to the emerging of RSS news feeds in online newspapers, the information overloading dilemma is barely avoided. Methods or techniques to categorised news into groups are important in order to overcome the information overloading problem. We propose a preliminary study on the investigation of categorisation processes in Thai newspapers industry. The experimental tests are properly performed with acceptable results. The results are represented by two measures that are the cosine similarity and the Euclidean distance according to the Information Retrieval theory. Limitations, further studies and their applications are seemingly addressed.

In common, the RSS news feeds combine only significant words or important short phrases in their messages. Overwhelming usages of insignificant words or short phrases may violate maximum capacity of RSS news feeds allowed. Thus, it is suffice to conclude that a set of keywords and vital phrases must be selected and included into the RSS news feeds. Specifically, in some online newspapers’ websites only provide a line on their web site to represent a short message from RSS headline feeds. Therefore, the size of message should be carefully monitored. An example of online newspapers or news agents that provides an RSS feeds source is the British Broadcasting Corporation (BBC) as shown in Figure 1. [3]

Index Terms—RSS, News Feeds, Text Categorisation, Information Retrieval

I. INTRODUCTION Really Simple Syndication (RSS) is a common category of web feed formats that is utilised to instantly publish messages over the Internet. The RSS is widely as message vehicles in various industries such as online newspapers, education, and medicine industries. Fundamentally, the RSS is constructed with the XML-based instructions. Due to frequencies and volumes of RSS documents especially in online newspapers, gigantic amount of data are emerged. It may cause the dilemma of “information overloading”. In the case of the online newspapers, uninterrupted messages are delivered constantly. Usually, online newspaper companies predetermine an interval of feed times. For example, normal news delivers every one hour interval. Breaking news delivers more frequently such as every 30 minutes interval. However, the criterions are prescribed by the companies according to their internal organizational policies. [1] Due to the gigantic amount of information, a categorisation process is useful for the purpose of assembling similar topics or subjects together. Text retrieval is a sub-field of the Information Retrieval that assist in identifying and categorising words into predefined groups. It utilises the similarity measurement by discovering a distance between two words. [2] There are a number of well-known edit distance values that can be employed into our research. There are Levenshtein edit distance, Hamming distance, Episode distance, and Longest common subsequence distance. These edit distance values are computationally formulated and transformed into similarity measurements.

Figure 1. The British Broadcasting Corporation (BBC) web site

In the case of English RSS news feeds, a group of selected words in a message still carry some stopwords and symbols, which can be omitted as in the words segmentation by using techniques in the Information Retrieval. [4] On the other hand, Thai words and phrases are commonly written or typed without any symbols such as periods for stopwords and sentences. [5] Thus, it dramatically augments the difficulty of the process of words segmentation. In order to perform a comparison of words, a similarity measurement technique or mechanism must be selected. In our observations and investigations, two similarity measures namely the cosine similarity and the Euclidean distance have been employed. The cosine similarity is a measure of similarity between two words. The result of the cosine similarity is 1 if two words are equal and the angle of two vectors is zero. Otherwise, it is less than 1 with a variation of angle. [6] In the field of the Information Retrieval, the value of cosine similarity must not be negative together with the angle must not greater than 90 degrees. The cosine similarity can be performed as the following normalized formula. n

A B i

Cos ( ) 

n

 ( Ai ) 2  i 1

© 2012 ACEEE DOI: 02.ACE.2012.03.19

40

i

i 1 n

 (B ) i

i 1

2


Full Paper Proc. of Int. Conf. on Advances in Computer Engineering 2012 where

B. Words segmentaiton and selection

Ai is an attribute vector of document A Bi is an attribute vector of document B Another widely used similarity measurement technique is derived from validating the Euclidean distance. The Euclidean distance is originally a measure of two points namely p and q in N dimensions. The Information Retrieval derives the Euclidean distance to measure the differences between two documents. The Euclidean distance in one dimension can be performed as the following normalised formula. [7]

In the process of words segmentation and selection in our study, there contains two sub procedures. Firstly, the procedure divides a RSS documents into a set of selected words. Seconding, the set of selected words are compared with the word corpus. Procedure docClassify Description: Divides a RSS documents into a set of selected words. Input RSS_doc.xml: A RSS document Word_corpus.xml: An XML document contains unnecessary words.

N

d ( p, q) 

 (q

i

 pi )

2

i 1

where pi is a representation of document A qi is a representation of document B

Output KeywordSet

III. INFORMATION PROCESSING AND ALGORITHMS

Methods 1. Read RSS_doc.xml 2. Convert RSS_doc.xml to String Str_RSS_doc 3. Call ThaiBreak(String Str_RSS_doc): Cls_Str_RSS 4. For every i in Cls_Str_RSS 5. If (i+1 = NULL ) 6. KeywordSet = e 7. Else 8. e=e+i 9. End If 10. End for 11. Return KeywordSet

The information processing theory states the process of transforming data into information. [8] Our investigation process also conforms with the information theory as represented in Figure 2.

IV. EXPERIMENTAL DESIGN AND RESULTS The experimental tests have been performed over various news sources and news types including crime, entertainment, etc. A set of ten news feeds are selected as a sample for demonstration purposes. The news feeds are tacked together regardless of news types and news sources in a single day period. For the purpose of the investigation, there are some news feeds that narrate the same meanings by using different words. Specifically, the news feeds that contains similar meanings are A and B, C and D, E and F, G and H, respectively. The Euclidean distance has been employed into the first experiment and the following is its results.

Figure 2. Input – Process – Output of the program

A. The Overall Machanisms of the System  Input: A number of well-known Thai RSS news feeds are selected. The news feeds are included periodically into our Java-based application. The news feeds are formatted in XML attributes  Process: words segmentation and selection processes include methods of dividing Thai news feeds into words. ThaiBreak(String input): String method is utilized in Thai words segmentation. However, there are some incorrect word breakings, which may affect the final segmentation and words segmentation results. After the completion of text words segmentation and words segmentation process, the entire set of selected words is compared with our word corpus. The word corpus includes words that are not able to affect the meaning of sentences such as adjectives, adverbs, etc. The selected words that are identical with words in the corpus are excluded. Output: It demonstrates pairs of RSS news feeds order by the similarity measures. © 2012 ACEEE DOI: 02.ACE.2012.03. 19

Figure 3. Euclidean distance from the experimental test

41


Full Paper Proc. of Int. Conf. on Advances in Computer Engineering 2012 Particularly, those columns that we are interested in are column A, C, E and G. The similarity of the set of news feeds can be narrated. Figure 3 can be simplified by using Reversed Euclidean distance (R.E.).

R.E.( pi , qi ) 

V. LIMITATION AND FUTURE RESEARCH Even though the findings of our case study represent acceptable results, some of limitations and problems are aware of. Future research will be conducted to overcome current dilemmas. The following are some of the limitations and future research.  News Feeds Sources: Various news feeds sources should be included in further studies. Currently, we only employed five sources as a starting point. It leads to some bias such as a source may read news from another source and rewrite by itself.  Thai words segmentation: A proper Thai words segmentation procedure must be developed rather than using the default method from the Java package. Currently, there are a number of researches regarding Thai words segmentation arises.  Other uncontrollable factors: Other uncontrollable factors such as words in the corpus and the complexity of messages in news feeds should be at least partially controlled. Otherwise, the quality of categorisation may reduce.  News Traceability: News is an emerging event in our real world. News does not stop after we read it. The news is expanding into its context. A possible technique to trace into news that people are interested in is possible and will be useful.  Time variation: Due to many news feeds sources are able to be included, each of the news feeds sources may publish its news in different time intervals. Therefore, it may be some overlapping of news feeds between different sources.  Online social networking services: The emergence of social networking services may lead to interesting research regarding the studies of news feeds. For example, Twitter.com [9] is one of the popular social networking services, recently. News is being published by individuals and commercials. Some people repeat (retweet) same news to different people therefore it may lead to some angles of information overloading dilemma.

1 d ( pi , qi )

where pi is a representation of document A qi is a representation of document B Reverse values of those columns are presented the following table. TABLE I. REVERSED EUCLIDEAN DISTANCE FROM THE EXPERIMENTAL TEST

Obviously, the maximum value in column A is 0.088 (B), column C is 0.141 (D), column E is 0.218 (F) and column G is 0.108 (H). It shows that the Euclidean distance is one of the preferred measures in this case study. The second measure, Cosine similarity, has been selected to be investigated. The previous ten news feeds are selected as samples. The results are shown in Figure 4.

CONCLUSION The investigation of RSS news feeds categorisation process research is in a preliminary stage. We investigated different similarity measures in order to categorise news feeds from various sources. The early conclusions were discovered. Thai RSS news feeds has some difficulties in words segmentation. The measures, which are Euclidean distance and Cosine similarity, were achieving acceptable outputs in classifying similarities between different RSS news feeds. Further studies must be targeted into uncontrollable factors, methods to categorisation and news traceability.

Figure 4. Cosine similarity from the experimental test

The cosine similarity measure significantly represents a powerful distinguisher for our research study. The maximum values of cosine similarity, excluding the 1.00, in column A, C, E and G are 0.53(B), 0.80(C), 0.92(E) and 0.31(H), respectively. We can partially conclude our findings that both measures are relatively effective. However, there are a remaining number of uncontrollable factors such as words in the corpus, words segmentation methods, the complexity of messages in news feeds, etc. Additionally, in real life situations, news feeds may not published in a same time from different sources. It may cause another complicated problem, which is the time variation. © 2012 ACEEE DOI: 02.ACE.2012.03.19

42


Full Paper Proc. of Int. Conf. on Advances in Computer Engineering 2012 ACKNOWLEDGMENT

[3] British Broadcasting Corporation [Online]. Avalible: http:// www.bbc.co.uk [4] R. Baeza-Yates and B. Ribeiro-Neto.Modern Information Retrieval. AddisonWesley, 1999. [5] C. Tanprasert and S Sae-Tang. “Thai type style recognition,” Proc. of the 1999 IEEE International Symposium on Circuit and Systems, May 1999, pages 336–339. [6] P. N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Addison-Wesley, pp. 500-510. [7] D. Bailey, An efficient euclidean distance transform, Spinger Berlin, pp. 394-408. [8] P. Edwards, Systems Analysis & Design, Mcgraw-Hill International Editions, 1993. [9] Twitter.com [Online]. Available: http://www.twitter.com

This research is fully supported by the Information and Communication Technology (ICT) Programme, Faculty of Science, Prince of Songkla University, Thailand. The authors would like to thank valuable lecturers, staffs and students at ICT for providing helpful supports. REFERENCES [1] MediaThink, “RSS: The Next Big Thing On Line,” MediaThink White Paper, July 2004, pp.1-7. [2] G. Navarro, “A guided tour to approximate string matching,” ACM Computing Surveys, vol. 33, March 2001, pp.31–88.

© 2012 ACEEE DOI: 02.ACE.2012.03. 19

43


19_