Real-time Streaming Data Analysis using Spark

Page 1

ISSN 2347 - 3983 Volume 6, No.1 January 2018 Kyeongjoo Kim et al., International Journal of Emerging Trends in Engineering Research, 6(1), January 2018, 1 – 5

International Journal of Emerging Trends in Engineering Research Available Online at http://www.warse.org/IJETER/static/pdf/file/ijeter01612018.pdf

https://doi.org/10.30534/ijeter/2018/01612018

Real-time Streaming Data Analysis using Spark Kyeongjoo Kim1, Jihyun Song2, Minsoo Lee3 Dept. of Computer Science and Engineering, Ewha Womans University, Seoul, Korea, Email:kjkimkr@ewhain.net 2 Dept. of Computer Science and Engineering, Ewha Womans University, Seoul, Korea, Email:ssongji7583@ewhain.net 3 Dept. of Computer Science and Engineering, Ewha Womans University, Seoul, Korea, Email:mlee@ewha.ac.kr 1

2. RELATED RESEARCH ABSTRACT 2.1 Spark streaming Streaming data generates a lot of information in real-time. Various types of streaming data exist, and analyzing such streaming data in real-time is a very important issue. Streaming data such as IoT sensor data, SNS Twitter data, stock data could contain very sensitive information that need to be analyzed in real-time. Our approach to analyze streaming data is based on real-time twitter data. We used Spark streaming provided by Apache Spark API for processing streaming data. We analyzed hashtags of real time twitter streaming data to find some related information about specific interesting keywords. In this paper, we did two kinds of analysis. First, searching the Top 10 hashtags related to a keyword. Second, analyzing how many words used related with happiness represented as integer values.

Spark Streaming enables scalable, high-throughput, fault-tolerant stream processing of live data streams. It is an extension of the core Spark API and can be used to stream live data and perform real-time processing. Spark Streaming provides a single platform for ingesting data in order to perform fast and live processing in Apache Spark [1]. Data Streaming is a technique to transfer data so that it can be processed as a steady and continuous stream. Streaming technologies are becoming increasingly important and we can use Spark Streaming to stream real-time data from various sources like Twitter, Stock Markets, IoT sensors and Geographical Systems and perform powerful analytics to help businesses. The fundamental stream unit is DStream which is basically a series of RDDs to process the real-time data. The major features of Spark Streaming are scaling, speed, fault tolerance, integration, and business analysis [2]. Spark Streaming workflow has four high-level stages. The first is to stream data for real-time streaming from various sources like Kafka, Flume or Parquet. The second type of source is for static/batch streaming which includes HBase, MySQL, PostgreSQL, Mongo DB and Cassandra. Afterwards Spark can be used to perform Machine Learning on the data via its MLlib API[3]. Further, Spark SQL can be used to perform further operations on this data. Finally, the streaming output can be stored into various data storage systems such as HBase, Cassandra, Kafka, HDFS and local file systems. Spark Streaming Workflow contains four components, Streaming Context, DStream, Caching, and accumulators, broadcast variables and checkpoints [4]. Streaming Context consumes a stream of data in Spark and is the main entry point for Spark functionality. Discretized Stream (DStream) is the basic abstraction provided by Spark Streaming. Accumulators are variables that are only added through an associative and commutative operation and are used to implement counters or sums. Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than

Key words: Hashtag, Sentimental analysis, Spark streaming, Twitter 1. INTRODUCTION is one of the biggest platform in Social network systems (SNS). And their instant messages (i.e. tweets) are published every second. Users tend to express their real feelings freely in Twitter, which makes it an ideal source for capturing the opinions towards various interesting topics, such as brands, products or celebrities, etc. Hashtags, starting with a symbol “#” ahead of keywords or phrases that users want to emphasize are widely used in tweets. In this paper, we did hashtags analysis based on real time Twitter data using Apache Spark streaming. Specifically we used the iphone8 keyword as it is a popular keyword. We analyzed hashtags about iphone8 in two ways. The first way is analyzing Top 10 hashtags represented with “#iphone8” and another way is how related the tweet with iphone8 hashtag is to happiness using the happiness word table. TWITTER

1


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.
Real-time Streaming Data Analysis using Spark by The World Academy of Research in Science and Engineering - Issuu