web crawler & web scraper

Page 1

SoftwareX 6 (2017) 98–106

Contents lists available at ScienceDirect

SoftwareX journal homepage: www.elsevier.com/locate/softx

RCrawler: An R package for parallel web crawling and scraping Salim Khalil *, Mohamed Fakir Department of Informatics, Faculty of Sciences and Technics Beni Mellal, Morocco

article

info

Article history: Received 8 November 2016 Received in revised form 2 March 2017 Accepted 13 April 2017

Keywords: Web crawler Web scraper R package Parallel crawling Web mining Data collection

a b s t r a c t RCrawler is a contributed R package for domain-based web crawling and content scraping. As the first implementation of a parallel web crawler in the R environment, RCrawler can crawl, parse, store pages, extract contents, and produce data that can be directly employed for web content mining applications. However, it is also flexible, and could be adapted to other applications. The main features of RCrawler are multi-threaded crawling, content extraction, and duplicate content detection. In addition, it includes functionalities such as URL and content-type filtering, depth level controlling, and a robot.txt parser. Our crawler has a highly optimized system, and can download a large number of pages per second while being robust against certain crashes and spider traps. In this paper, we describe the design and functionality of RCrawler, and report on our experience of implementing it in an R environment, including different optimizations that handle the limitations of R. Finally, we discuss our experimental results. © 2017 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).

Code metadata Current code version Permanent link to code/repository used for this code version Legal Code License Code versioning system used Software code languages, tools, and services used Compilation requirements, operating environments & dependencies If available Link to developer documentation/manual Support email for questions

1. Introduction The explosive growth of data available on the World Wide Web has made this the largest publicly accessible data-bank in the world. Therefore, it presents an unprecedented opportunity for data mining and knowledge discovery. Web mining, which is a field of science that aims to discover useful knowledge from information that is available over the internet, can be classified into three categories, depending on the mining goals and the information garnered: web structure mining, web usage mining, and web content mining [1]. Web structure mining extracts patterns from the linking structures between pages, and presents the web author. * Corresponding E-mail addresses: khalilsalim1@gmail.com (S. Khalil), fakfad@yahoo.fr (M. Fakir).

v 0.1 http://github.com/ElsevierSoftwareX/SOFTX-D-16-00090 MIT git r, Java 64-bit operating system & R environment version 3.2.3 and up (64-bit) & R packages: httr, rJava, xml2, data.table, foreach, doParallel, parallel https://github.com/salimk/Rcrawler/blob/master/man/RcrawlerMan.pdf khalilsalim1@gmail.com

as a directed graph in which the nodes represent pages and the directed edges represent links [2]. Web usage mining mines user activity patterns gathered from the analysis of web log records, in order to understand user behavior during website visits [3]. Web content mining extracts and mines valuable information from web content. The latter is performed with two objectives: search results mining, which mines to improve search engines and information retrieval fields [4]; and web page content mining, which mines web page content for analysis and exploration purposes. This work is part of a project that aims to extract useful knowledge from online newspaper contents. In web content mining, data collection constitutes a substantial task (see Fig. 1). Indeed, several web mining applications use web crawlers for the process of retrieving and collecting data from the web [5]. Web crawlers, or spiders, are programs that automatically browse and download web pages by following hyperlinks in a

http://dx.doi.org/10.1016/j.softx.2017.04.004 2352-7110/© 2017 The Authors. Published by Elsevier B.V. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.