Page 1

International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) ISSN 2249-6831 Vol. 3, Issue 2, June 2013, 379-388 Š TJPRC Pvt. Ltd.


Assistant Professor, Department of IT, DJSCOE, Vile Parle (West), Mumbai, Maharashtra, India 2

Principal, Institute of Computer Science, MET, Bandra (West), Mumbai, Maharashtra, India

ABSTRACT With more and more development of information technology, data set in many domains is reaching beyond petascale; making it difficult to work with the document clustering algorithms in central site and leading to the need of increasing the computational requirements. The concept of distributed computing thus; is explored for document clustering giving rise to distributed document clustering. Here, we propose the high performance document clustering using Hadoop and MapReduce. Hadoop is open source software framework to support data intensive distributed applications and MapReduce is parallel programming technique. In this proposed model, we also introduce high level of semantics for better accuracy and quality of document clustering.

KEYWORDS: Document Clustering, Distributed Document Clustering, Parallel Document Clustering, Semantic Document Clustering, Hadoop, MapReduce

INTRODUCTION The steady and amazing progress of computer hardware technology in the last few years has led to large supplies of powerful and affordable computers, data collection equipments, and storage media. Due to this progress there are huge number of databases and information repositories available for transaction management, information retrieval, and data analysis. Thus, technology advancement has provided a tremendous growth in the volume of the text documents available on the internet, digital libraries and repositories, news sources, company-wide intranets, and digitized personal information such as blog articles and emails. With the increase in the number of electronic documents, it is hard to organize, analyze and present these documents efficiently by manual effort 1. These have brought challenges for the effective and efficient organization of text documents automatically 2. For this, document clustering concept, unsupervised machine learning approach, is used. It organizes documents into different groups called as clusters, where the documents in each cluster share some common properties according to defined similarity measure 3. There are various subcategories of document clustering like soft, hard, partitioning, hierarchical etc. which are again classified further 4. Various applications of document clustering include finding similar documents, organizing large document collections, duplicate content detection, news aggregation, and search optimization 5, 6, 7. Various algorithms are proposed for document clustering starting from Croft which clustered various titles. Then k-means, hierarchical agglomerative clustering and various variations of these algorithms are proposed over years. Concepts of frequent-itemset, fuzzy theory, neural network, genetic algorithm, self-organizing map, non-negative matrix factorization etc. are also studied for increasing efficiency of document clustering. These algorithms are studied and compared in 8. Then to increase the quality and efficiency semantics and natural language processing are also applied to document clustering; this includes latent semantic indexing, frequent word phrases, WordNet, ontology, part-of-speech


Neepa Shah & Sunita Mahajan

tagging, sense disambiguation, machine learning, and many more. Exhaustive survey of these methods is given in 9. Still there is research going to find out more semantically oriented approaches to further increase the accuracy and quality of clusters. Only these algorithmic and conceptual changes are not enough in current scenario of large datasets. Because huge volume of documents produced due to advancement in technology possesses two key characteristics: (a) data is no longer maintained in central databases and (b) computers processing this data are no longer central supercomputers, but rather a huge network of computers of all scales. Centralized mining cannot scale to the magnitude of the data 10. Storage and computing of mass documents in a distributed system is an alternative method. In distributed computing, a problem is divided into many tasks, each of which is solved by one computer. However, many problems such as task scheduling, fault tolerance and inter-machine communication are very tricky for programmers with little experience of parallel and distributed system 11. To address these issues, scalable and easy-to-use tools for distributed processing are also emerging. The most popular one of these emerging technologies is Hadoop-open source software framework for data intensive distributed applications. Hadoop is a fault-tolerant distributed system for data storage which is highly scalable. This scalability and reliability is due to its unique file system; Hadoop Distributed File System (HDFS). MapReduce is parallel programming abstraction and runtime support for scalable data processing on Hadoop. Thus, Hadoop provides a reliable shared storage and analysis system. The storage is provided by HDFS and analysis by MapReduce 12. Here, we propose the model for high performance semantically related document clustering using Hadoop and MapReduce. Section 2 highlights related work in the area of parallel document clustering and distributed document clustering. Usage of Hadoop and MapReduce for document clustering domain is given in section 3. Section 4 gives our proposed work model. We conclude in section 5.

RELATED WORK In this section we give brief overview of work done till now in increasing performance and speedup of document clustering using parallel and distributed techniques. Parallel Document Clustering Here various algorithms based on parallel processing techniques like message passing model, multithreading model, and hardware based implementation like Graphics Processing Unit (GPU) etc are highlighted. Hierarchical clustering algorithms are practically infeasible for large document collection due to its quadratic time complexity. A Principal Direction Divisive Partitioning (PDDP), based on divisive algorithm, another category of hierarchical methods, uses the concept of principal component analysis to reduce the dimensionality of vector space. PDDP is part of SVD based algorithms. In 13, parallel implementation of PDDP with message passing model is proposed. Here, the different parts of each data file are distributed to different processors and improvement in space and time for execution is seen. In 14, three parallel algorithms are proposed and tested over large web document dataset. These being the parallel implementation of PDDP algorithm, k-means algorithm based on message passing model and hybrid algorithm of parallel PDDP and parallel K-means. Hybrid algorithm proved similar or better in parallel runtime than parallel K-means algorithm and parallel PDDP.

Semantics Based Distributed Document Clustering: Proposal


A new parallel algorithm for bisecting k-means for the message-passing multiprocessor systems with prediction step to balance the workloads of multiple processors is proposed in 15. The experimental results show linear speedup of the algorithm with the number of processors and data points. Also it scales up better than the parallel k-means with respect to the desired number of clusters. In the past few years, accelerator based computing, where a portion of computation is performed on special purpose hardware, has been gaining popularity. With the arrival of GPUs, accelerator based computation has become mainstream 16. In 17, usage of GPUs to accelerate data-intensive document clustering algorithms is presented. Also, the process of TF-IDF is demonstrated using GPU devices through NVIDIA’s CUDA. Traditional algorithms are redesigned to parallel CPU-GPU co-processing framework to effectively execute on many-core GPU architecture. Experiments demonstrate up to 10 times speedup over a single-node CPU desktop A functionally equivalent multi-threaded MPI application in C++ is developed for performance comparison. The GPU cluster implementation shows dramatic speedups over the C++ implementation, ranging from 30 times to more than 50 times speedups. A high performance method for parallelizing the agglomerative document clustering algorithm is proposed in 18 using a multiprocessor system based on message-passing model. The proposed parallel algorithm proved better in terms of decreasing the clustering time, increasing the overall speedup, and achieving high quality. Concept of multithreaded implementation of the parallel anchors hierarchy, a machine learning approach, using XMT-C idioms on Cray-XMT is explored in 19. Running times and speedup for different number of processors on Cray XMT for 500,000 documents of Wikipedia is observed. Distributed Document Clustering In this section we highlight various techniques based on distributed concept like RACHET, Distributed Information Bottleneck (DIB), collaborative peer-to-peer and variations of it etc. Napster is centralized search engine and Gnutella is distributed search engine. There are inherent problems of scalability for centralized approach and in Gnutella network there is flooding problem. Therefore, an intelligent way of partitioning documents among different machines which are participating in distributed network to facilitate key word search is proposed in 20. Probabilistic Latent Semantic Analysis (PLSA) with Expectation-Maximization algorithm is used. Dataset is small as PLSA has high computational overhead.. RACHET is a distributed clustering technique suitable for very large, high dimensional and horizontally distributed data. But, RACHET was evaluated on small datasets and standard measures of evaluation were also not applied to calculate quality of the clusters. In 21, the reliability and applicability of RACHET is tested on standard dataset with widely known evaluation parameters. Here, local dendrograms are generated using single link, average link and complete link (agglomerative hierarchical clustering methods) at each site. These dendrograms are then pruned to get descriptive statistics to capture major information and then merged at single site to produce global dendrogram. In case of distributed clustering, it is found that 30% degradation in performance due to distribution but it runs significantly faster by 40% than the centralized. In 22, DIB, a distributed document clustering approach, is proposed. It is two stages agglomerative information bottleneck algorithm to generate local clusters. The high-dimensional document vector is reduced significantly in the first stage by finding word clusters. In the second stage, document-clusters are obtained from these word-clusters. DIB extracts


Neepa Shah & Sunita Mahajan

compact but interpretable local models from these document-clusters generated at each site. Also the number of local clusters to be generated at each site is determined automatically. An efficient merging technique is used to combine these local models to a global model. Experiments on medium scale real-world datasets demonstrate comparable quality solution to the distributed document clustering problem with minimized transfer cost. Two methods for distributed document clustering are proposed in 10. The first method is called collaborative P2P document clustering. In this, nodes are treated as collaborative nodes in a peer-to-peer network with the goal of improving the quality of individual local clustering solutions. The second method, hierarchically-distributed P2P document clustering (HP2PC), produces one clustering solution across the whole network. It addresses scalability of network size and the complexity of distributed clustering by modeling the distributed clustering problem as a hierarchy of node neighborhoods. For large networks, HP2PC is more appropriate due to its scalability. For smaller networks either collaborative P2P or HP2PC can be used. The proposed algorithms offer high degree of flexibility, scalability, and interpretability of large distributed document collections. A distributed document clustering based on peer-to-peer model is proposed for search engines in 23 for effective and enhanced searching. This model, Distributed Document Clustering for Search Engine (DDCSE), refines the results of search by filtering out irrelevant data. TF-IDF is used for data representation and initial cluster generation and cosine similarity measure is used. Clustering quality is tested over three algorithms namely, Hierarchical Agglomerative Clustering, Single Pass Clustering, and K-Nearest Neighbors with Entropy and F-measure evaluation measures. Significant improvement in quality is achieved. Variations of HP2PC, Enhanced HP2PC (EHP2PC) and Semantic enabled EHP2PC (SEHP2PC), for horizontally partitioned data, are proposed in 24 for addressing the issues of modularity, flexibility and scalability in document clustering. It is based on multilayer network of super nodes and peer nodes. These nodes first create local clusters using KMeans algorithm and then global clusters are created. Optimization is performed using initial centroid estimation. Multidomain ontology with medical, finance, sports, computer, artificial intelligence and web mining domains is used to add semantics for better accuracy. Experiments prove that it achieves comparable quality to centralized counterpart and also significant speedup is achieved. The clustering process is improved with term relationship using semantic analysis. Thus, SEHP2PC performed better than other variants. The SEHP2PC framework is enhanced further using fuzzy document clustering concept (FSEHP2PC) in 25 for better clustering accuracy. Experiments are done to compare performance, speedup and accuracy of HP2PC, EHP2PC, SEHP2PC, and FSEHP2PC. These proved FSEHP2PC better than other frameworks.

HADOOP AND MAPREDUCE FOR DOCUMENT CLUSTERING Hadoop is open source software framework to support data intensive distributed applications. MapReduce is a software framework for processing large data sets in a distributed fashion over several machines. The core idea behind MapReduce is mapping data set into a collection of <key, value> pairs, and then reducing all pairs with the same key. While writing code there are 2 scripts: the map script, and the reduce script. The framework splits the input into parts, passing each part to a different machine. Each machine runs the map script on the portion of data assigned to it. The map script maps this data to <key, value> pairs according to the specifications. These generated <key, value> pairs are shuffled i.e. pairs with the same key are grouped and passed to a single machine, which will runs the reduce script over them 26. Thus, Map transforms a set of data into key-value pairs and Reduce aggregates this data into a scalar. This process is shown in Figure 1 27.

Semantics Based Distributed Document Clustering: Proposal


Figure 1: MapReduce Framework Figure 2 26 diagrammatically shows word count example, where the number of word occurrences are counted to get frequencies. Reduce script simply sums the values of the collection of <key, value> pairs with the same key.

Figure 2: MapReduce Process for WordCount Below we give overview of Hadoop-MapReduce framework applied in the field of document clustering. In 28, Cloud Computing is highlighted as latest computing techniques which lead to 5 th utility (after water, electricity, gas, and telephony) to provide computing services. Also, differences between Cloud computing with other two well-known computing paradigms: Cluster computing and Grid computing are discussed. Comparison of the Aneka enterprise Cloud technology (developed by authors) with some representative Cloud platforms developed by industries like Amazon EC2, Microsoft Azure, Google App Engine, Sun Grid, is given. In 27, MapReduce algorithm for computing pairwise document similarity is proposed for large document collections. On a collection of 900,000 newswire articles the algorithm shows linear growth in running time and space in terms of the number of documents. This is achieved by removing high frequency terms. After removing high frequency terms, two separate MapReduce jobs are used for indexing and pairwise similarity of documents.Distributed document clustering based on MapReduce is implemented in 11. For document pre-processing stage i.e. for calculation of TF-IDF weight, new iterative algorithm is given on MapReduce framework. Then, K-Means is used on MapReduce framework for document clustering. Highest frequency terms (with 95% cut) are ignored to speed up the algorithm on MapReduce. Linear growth for running time and space is observed through experiments when dataset size is increased.


Neepa Shah & Sunita Mahajan

Non-negative Matrix Factorization based feature extraction and using these extracted features as meaningful labels of clusters produced by k-means algorithm is proposed in 29. For large scale documents, the parallel implementation of k-means using MapReduce framework on Hadoop is given.To overcome the drawback of keyword based matching and increase the accuracy and quality of clustering, Latent Semantic Indexing (LSI) approach is explored by many researchers. Here 30, to improve the speed of LSI, MapReduce based distributed LSI and k-means for document clustering using Hadoop architecture is proposed. For comparison, standalone LSI and distributed k-means LSI using socket programming are used. The result shows great improvement in speed. Scientific computing cloud (SciCloud) 31 is demonstrated for solving resource greedy computationally intensive scientific, mathematical, and academic problems. Also, comparison between Hadoop MapReduce framework and Twister Framework is given. Three algorithms namely Conjugate Gradient (CG), k-medoid clustering algorithms: Partitioning Around Medoids (PAM), Clustering Large Application (CLARA), and Factoring integers are tested on both the frameworks. The analysis through experiments has shown that Hadoop MapReduce gives trouble with iterative problems but suits very well for embarrassingly parallel problems, and Twister handles iterative problems much more efficiently. R is used in statistical computing and graphics. It has recently got an update with tm package for text mining work like filters, transformations, data export etc. With the arrival of Hadoop and MapReduce for processing large dataset using parallel programming model, here 32 authors have proposed the package tm.plugin.dc for integrating tm with Hadoop for distributed text mining. The performance of some text mining tasks such as stemming, stopword removal, and document-term-matrix construction is compared with Message Passing Interface (MPI) i.e. data parallelism using R/MPI module. The R/MPI implementation is limited by the main memory of the calling machine in terms of dataset size. Thus, Hadoop will be preferred for the text corpora containing documents with more text and needing more disk space. In 33, popular K-means algorithm is exploited to work in parallel using Hadoop and MapReduce. Document preprocessing is done using MapReduce in parallel and then parallel K-Means clustering algorithm is applied on MapReduce and Hadoop. Experiments are performed to study time required by stand-alone version against parallel version for same number of documents and parallel version is also checked for different number of nodes. It is proved that high accuracy and efficiency is achieved by parallel version. The Apache Mahout 34 Machine learning library's goal is to build scalable machine learning libraries. This library includes many algorithms of machine learning and data mining like clustering, classification, collaborative filtering, and frequent pattern mining. Out of these, 4 clustering and 2 classification codes are evaluated for performance on 2 cloud runtimes Granules and Hadoop in 35. Experiments prove that Granules based implementation outperform Hadoop.

PROPOSED WORK MODEL Distributed clustering solves two problems: infeasibility of collecting data at a central site, and intractability of traditional clustering algorithms on huge data sets 36. In our proposal, we will be working on distributed document clustering using cloud computing tools like Hadoop and Granules. The highlight will be to work on very large dataset to address scalability, reliability, and speedup; and to work on semantic understanding of document before clustering using Ontology to increase accuracy. Ontology Usage for Semantics Many documents do not contain common words even though they contain the similar semantic information. For instance, if one document describes hockey, it should be turned up as game though that document does not contain word game. In order to deal with such problem, a concept-based model using ontologies is necessary 37.

Semantics Based Distributed Document Clustering: Proposal


Ontology is defined by Gruber 38 as “Ontology is an explicit specification of a conceptualization of a given domain”; and by Campbell 39 as “Ontology consists of a representational vocabulary with precise definitions of the meanings of the terms of this vocabulary plus a set of formal axioms that constraint interpretation and well-formed use of these terms”. Ontology 40, 41 typically provides a vocabulary that describes a domain of interest and a specification of the meaning of terms used in the vocabulary. They, indeed, are a practical means to conceptualize what is expressed in a computer format. The term ontology is often used to refer to a range of linguistic and conceptual resources like thesauri, dictionaries, taxonomies, and formal knowledge representations 37. Ontology is applied in the text clustering algorithm to use the word meanings 15. In Figure 3 we show various areas of study for our model.

Figure 3: Various Threads and Fields of Focus for Our Model

CONCLUSIONS Document clustering is a fundamental operation used in unsupervised document organization, automatic topic extraction, and information retrieval. In this paper, we have given overview of research and work done in the area of document clustering with the major goal of increasing speed using parallel and distributed technologies. We propose the work model and fields need to explore for our work. Also we further suggest making use of semantics like ontology and natural language processing to increase quality of clusters.


Rekha Baghel and Dr. Renu Dhir, “A Frequent Concepts Based Document Clustering Algorithm,” Int’l Journal of Computer Applications, Vol. 4, No.5, pp. 0975 – 8887, Jul. 2010


A. Huang, “Similarity measures for text document clustering,” In Proc. of the 6th New Zealand Computer Science Research Student Conference NZCSRSC, pp. 49-56, 2008


Nicholas Andrews and Edward Fox, “Recent developments in document clustering,” Technical report published by citeseer, pp. 1-25, Oct. 2007


Neepa Shah & Sunita Mahajan


Chun-Ling Chen, Frank S.C. Tseng, and Tyne Liang, “An integration of WordNet and fuzzy association rule mining for multi-label document clustering,” Data and Knowledge Engineering, Vol. 69, Issue 11, pp. 1208-1226, Nov. 2010


Michael Steinbach, George Karypis, and Vipin Kumar, “A comparison of document clustering techniques,” In KDD Workshop on Text Mining, 2002


Pankaj Jajoo, “Document Clustering,” Masters’ Thesis, IIT Kharagpur, 2008


Chih-Ping Wei, Chin-Sheng Yang, Han-Wei Hsiao, and Tsang-Hsiang Cheng, “Combining preference and content-based approaches for improving document clustering effectiveness,” Int’l Journal of Information Processing and Management, Vol. 42, Issue 2, pp. 350-372, Mar. 2006


Neepa Shah and Sunita Mahajan, “Document Clustering: A Detailed Review,” Int’l Journal of Applied Information Systems, Vol. 4, No. 5, pp.30-38, Oct. 2012


Neepa Shah and Sunita Mahajan. “Semantic based Document Clustering: A Detailed Review,” Int’l Journal of Computer Applications Vol. 52, No. 5, pp.42-52, Aug. 2012

10. K. Hammouda, “Distributed Document Clustering and Cluster Summarization in Peer-to-Peer Environments,” PhD Thesis, Dept. of System Design Engineering, University of Waterloo, 2007 11. J. Wan, W. Yu, and X. Xu, “Design and implementation of distributed document clustering based on MapReduce,” in Proc. the 2nd symposium on International Computer Science and Computational Technology, pp. 278–280, 2009 12. Tom White, “Hadoop: The Definitive Guide,” O'Reilly Media, Inc., 2009 13. Deyun Gao, Xiaohu Li, and Zheyuan Yu, “Parallel Clustering of Large Document Collections”, 2003 14. ShuTing Xu and Jun Zhang, “A hybrid parallel web document clustering algorithm and its performance study,” Technical report, Department of Computer Science, University of Kentucky, 2003 15. Yanjun Li, “High performance text document clustering,” PhD Thesis, Wright State University, USA, 2007 16. Surendra Byna and Xian-He Sun, “Special issue on Data Intensive Computing,” Journal of Parallel and Distributed Computing, Vol. 71, Issue 2, pp. 143-144, Feb. 2011 17. Yongpeng Zhang, Frank Mueller, Xiaohui Cui, Thomas Potok, “Data-intensive document clustering on graphics processing unit (GPU) clusters,” Journal of Parallel and Distributed Computing Vol.71,No.2,pp.211-224, Feb. 2011 18. Aboutabl Amal Elsayed and Elsayed Mohamed Nour, “A Novel Parallel Algorithm for Clustering Documents Based on the Hierarchical Agglomerative Approach,” Int’l Journal of Computer Science & Information Technology, Vol. 3, Issue 2, pp.152, Apr. 2011 19. Jace A. Mogill and David J. Haglin, ”Toward Parallel Document Clustering,” IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum (IPDPSW), 2011, pp.1700-1709, May 2011. 20. J. Li and R. Morris, “Document clustering for distributed fulltext search,” In 2nd MIT Student Oxygen Workshop, Cambridge, Aug. 2002.

Semantics Based Distributed Document Clustering: Proposal


21. Debzani Deb, M. Muztaba Fuad, and Rafal Angryk. “Distributed hierarchical document clustering,” ACST'06 Proceedings of the 2nd IASTED International Conference on Advances in Computer Science and Technology, pp. 328-333, 2006 22. Debzani Deb and Rafal Angryk, “Distributed document clustering using word-clusters,” In IEEE Symposium on Computational Intelligence and Data Mining (CIDM), pp. 376-383, 2007 23. Chang Liu, Song-nian Yu, and Qiang Guo, “Distributed Document Clustering for Search Engine,” International Conference on Wavelet Analysis and Pattern Recognition, 2009. ICWAPR 2009, pp.454-459, Jul. 2009 24. Thangamani M. and P. Thangaraj, “Peer to Peer Distributed Document Clustering with Global Centroid Optimization Technique and Semantic Analysis,” European Journal of Scientific Research, Vol.49, No.3, pp.387402, 2011 25. Thangamani M. and P. Thangaraj, “Ontology Based Fuzzy Document Clustering for Distributed P2P Network,” Global Journal of Computer Science and Technology, Vol.11, No 5, pp.21-38, Apr. 2011 26. Diana MacLean, “A Very Brief Introduction to MapReduce,” 2011 27. Tamer Elsayed, Jimmy Lin, and Douglas Oard, “Pairwise document similarity in large collections with MapReduce,” Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies, Jun 2008 28. Rajkumar Buyya, Chee Shin Yeo, Srikumar Venugopal, James Broberg, and Ivona Brandic, “Cloud Computing and Emerging IT Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility,” Future Generation Computer Systems, Vol. 25 No. 6, pp. 599-616, Jun. 2009 29. Dipesh Shrestha, “Text Mining with Lucene and Hadoop: Document Clustering with Feature Extraction,” PhD Thesis, Wakhok University, 2009 30. Yang Liu, Maozhen Li, Hammoud, S., Khalid Alham, N., Ponraj M., ”A MapReduce based distributed LSI,” 7 th Int’l Conference on Fuzzy Systems and Knowledge Discovery (FSKD), 2010, Vol.6, pp.2978-2982, Aug. 2010 31. Satish Srirama, Pelle Jakovits, and Eero Vainikko, “Adapting scientific computing problems to clouds using MapReduce,” Future Generation Computer Systems, Vol. 28, No. 1 pp. 184-192, 2012 32. Theußl, Stefan, Ingo Feinerer, and Kurt Hornik. “Distributed Text Mining in R,” 2011 33. Ping ZHOU, Jingsheng LEI, Wenjun YE, “Large-Scale Data Sets Clustering Based on MapReduce and Hadoop,” Journal of Computational Information Systems, pp.5956-5963, 2011 34. “What is Apache Mahout?” available at 35. Kathleen Ericson and Shrideep Pallickara, “On the performance of high dimensional data clustering and classification algorithms,” Future Generation Computer Systems, Jun. 2012 36. Khaled Hammouda and Mohamed Kamel, “Distributed collaborative Web document clustering using cluster keyphrase summaries,” Information Fusion Vol. 9, Issue 4, pp. 465-480, 2008 37. Saurabh Sharma and Vishal Gupta. “Recent Developments in Text Clustering Techniques,” Int’l Journal of Computer Applications Vol. 37, No. 6, pp.14-19, Jan.2012


Neepa Shah & Sunita Mahajan

38. Thomas R. Gruber, “Toward principles for the design of ontologies used for knowledge sharing,” Int’l Journal of Human-Computer Studies, Vol. 43, Issues 5-6, pp. 907-928, Nov.1995 39. A.E.Campbell and S.C.Schapiro, “Ontological Mediation: An Overview”, Proceedings of the IJCAI Workshop on Basic Ontological Issues in Knowledge Sharing, AAAI Press, 1995 40. Chris Biemann, “Ontology Learning from Text: A Survey of Methods,” LDV-Forum, Vol. 20, No. 2, pp. 75-93, 2005 41. Christophe Brewstera and Kieron O’Hara, “Knowledge representation with ontologies: Present challenges Future possibilities” Int’l Journal of Human-Computer Studies, Vol. 65, No. 7, pp. 563-568, Jul. 2007

42 semantics based full  

With more and more development of information technology, data set in many domains is reaching beyond petascale; making it difficult to work...