Page 1

Stream Processing Elasticity Metric: Availability Aditya Tiwari, Manas Mhapuskar, Sreelatha Chalamkuri

Abstract—

Stream processing is a computing technique that has developed to manage large set of high bulk of data in real time. Unlike conventional database systems, stream line processing systems handle and work on data in real time. In this paper we present the conceptual architecture of Distributed data processing systems and the metrics evaluation of the data stream processing and to evaluate the elasticity of the system. We also survey several techniques and systems that help us compare availability, metrics elasticity and scalability of varied set of database systems. We also study about adapting processing graphs, operator elasticity in context to streaming middleware, executor-centric methods, irregular flows of data management and approaches that optimize the system performance. Keywordsavailability, processing graph

scalability,

used here is TCP/IP as it can prevent information loss and during the transfer of data if any failures occur that will lead an output which is quite different from the expected output and the change in the output can only be seen during the failures while dealing with computations.

latency, fig: Data stream processing system conceptual architecture

I. INTRODUCTION

We all know a single click online, (e.g... on webpages) produces a lot a data which are connected directly to the computer networks. So, in order not to lose this data, we need systems that have ability to process these huge amounts of data within shorter periods and should be able to easily modify with the change in the input as the data stream are never the same and there will always be a variation in its throughput. So, choosing the right data stream processing system has always been a challenge as we never know what the variations in the data stream will be. As we cannot predict them we can expose us system to some severe conditions to test it. Regardless of which we need some metrics to be defined to measure the behavior of our system in the conditions that are extreme. II. DATA STREAM PROCESSING SYSTEMS ARCHITECTURE.

Here the author [1] considered distributed stream processing system as the previous system were monolithic( where same memory has been shared by all the components in the system) and the current systems has different process where some nodes share the same memory while others do not and there are different processes that are linked by the distributed data stream processing systems such that they form a direct acyclic graph. Also, these processors are referred to as operator as they perform different operations that are required to produce systems output. These operators communicate within themselves using the network connections. The protocol that is generally

And its architecture is as shown in the figure where the circles represent the operators which are hosted by the nodes and the network connections are represented using the network connections are represented using the arrows connecting the operators. The data stream input enters into the data stream processing systems through the operators that are connected to the world outside data stream processing systems. These are called as edge operators. The function of these edge operators is to convert the data stream input into sequence of events. These sequences of events are usually treated as tuples or registered with keys and a payload where the operators communicate within each other using these tuples. The key recognizes the tuple and the payload, which is the data that is being processed between the operators. III. METRICS FOR EVALUATING DISTRIBUTED DATA STREAM PROCESSING SYSTEMS

According to the paper [2] the author has proposed that, In the distributed data stream processing system the number of events that have been processed in a given amount of time is the throughput metric and when a query is passed to the system, total amount of time the system takes to give response to that query is the response time. Also, the paper proposed a performance evaluation framework called FINCO's which has the following characteristics: flexibility, independent of workloads,


neutral, correctness check, scalability. With the framework the author proposed other features such as correctness, the capacity of the system in adapting itself to different input stream loads but these evaluations process ended up with some issues and the author also proposed new features to evaluate the data stream processing systems in the paper [3]. He says that the data stream processing systems can also be evaluated on the basis of the amount of memory consumption and latency , maximum peak latency and post peak latency variation ratio's and they stated that as most of the system's processing happen in the main memory, it is important to consider the main memory as the main feature and latency here was defined and the delay of time between the time when the event enter the system and the time when the system responds to that particular event and the maximum peak latency is the delay in time that occurs while the system is processing a maximum load and the post peak latency variation ratio as the average delay in time after the system deals with a maximum load by average delay in time on the steady phase. The author stated that considering these latencies would retrieve the behavior of the system when the system is exposed to extreme conditions. According to the paper [4], the author considered factors such as network latency and jitter (which means the delay's that change with time) and processing latencies in the cloud environment and also their results showed that latency is the main feature that needs to be considered when evaluating the cloud- based data stream processing systems.

fig: calculating the latency metric of a system in multiple paths The information latency can be given as the delay in the time that occurs when system is processing certain number of events and then produce outputs and the system might produce different information latency based on different input. To analyze this, let us consider a data stream processing system which has a sensor attached to it and this sensor results in tuples where these tuples have key and float value and these tuples are classifies by their ranges and then an average of those range are considered per unit time.

IV. MEASURING THE LATENCIES.

The author Grab & Lu in the paper [5] stated that the latencies are of 2 types: Information Latency and the System Latency, where system latency is known to be the delay in the time that occur while the system is processing an event. So, the system latency can be given as (At = tout − tin) where tin is the input tuples and tout is the output tuples. We must properly calculate this case by calculating Tic for all the I-paths and then retrieve the average of all cases.

fig: explains the calculation of metrics of information latency form tuple, and the information latency is given below:

V. BENCHMARKING ELASTIC QUERY PROCESSING ON BIGDATA

Elastic scaling is important to know the full ability of a cloud. It describes the potential of a system to raise and


decrease based on changes in the requirements of the performance. Elastic scaling can be a challenging aspect for RDBMS and cloud-based environments and the number of nodes in the cloud-based environments are high. According to the paper [6] the author evaluates the elasticity of both RDBMS and cloud by proposing a benchmark with elastic query processing. Elastic query processing benchmark can be used to differentiate huge range of systems. This elastic query processing in developed based on the TPC-H queries and the data generator while excluding the execution of the benchmark. VI. METRICS:

The paper aims to evaluate the elasticity of the system and to evaluate we need take the below two metrics into consideration- Scaling overhead and elastic overhead. The number of nodes decides the execution time for a single stream where a complete query set is given to it. Scaling overhead is usually the amount of time wasted while the system is stable. It is usually calculated as the difference between the amount of the time spent during measurement and the target time, the system needs to be stable for the number of queries and then the total value is multiplied to the number of workers in that phase. Elastic overhead is the amount of time that is wasted due to the sub-optimal elasticity. It is defined as the difference between the part-time measurement and the part time references. It is measured for every single phase and every single person individually and finally the overall sum is considered.

VII. .ELASTIC SCALING OF DATA PARALLEL OPERATORS IN STREAM PROCESSING

In the paper [7], we learned the knowledge of operator elasticity with respect to context of streaming middleware. The interesting factor about the runtime elasticity is that it not just restricted to the changes in the traffic patterns. The resource availability keeps changing as the applications that run for long are done with their data analysis tasks. It is expected that best operating levels with less overhead and ability to readjust to changes in the conditions can be achieved using Elastic Operator algorithm. It is observed that in circumstances involving sets of elastic operations is used along with the policies of operating systems, results in the best level with respect to efficiency. This is significant because it globally speeds up the entire application, under the assumption that the application's physical layout cannot be changed at runtime. The objective behind offering such ability in spade is to reduce the amount of adjustment time and guesswork

that typical developers must allot when the applications are deployed on various platforms and runtime structures. The idea of elasticity can go outside the scope of a single multicore device, such that it can have an improved impact on distributed resources. Different sorts of parallelism that are common in the streaming applications such as task parallelism and pipelining can also be addressed. VIII. STUDY AND COMPARISON OF ELASTIC CLOUD DATABASES : MYTH OR REALITY?

According to the paper [8], the capacities of the storage elasticity are helpful in identifying the accurate elasticity for the certain databases. The result verifies for the theoretical analysis of the anticipated elasticity with regards to the advantage of the systems that don’t require to move all the data like HBase. For the systems that require to move the data, the impact on the time required to complete all the transfers has also been noted. New mongoDB nodes would serve requests quickly is a benefit as it would spread the load faster on a larger set of nodes. Although, the new nodes serve request immediately after it has downloaded a complete chunk, suggests that they will also be serving requests while they pre-allocate big files on the disk to make room for the next chunks to come. The pre-allocations will consume too much of I/O and thus degrade the performances of the node. If the data set is too big to fit into memory, the I/O are very crucial and events like compactions, merging and pre-allocations can have a strong impact on performance, even bigger than the impact of new nodes addition for small clusters. Therefore, it is important to select the appropriate infrastructure regarding to the data set to handle. The goal of these measurements is to observe in practice what is the real elasticity and scalability of these databases because it is very likely to observe unforeseen behaviors in comparison to the polished version given by the databases themselves. IX. ELASTICUTOR: RAPID ELASTICITY FOR REALTIME STATEFUL STREAM PROCESSING

In the paper [9], author proposed a method to improve the elasticity by overcoming the key partitioning at operator-level. The author uses Elasticutor (Rapid elasticity for Realtime stateful stream processing) framework, which allows for swift elasticity, especially for stream processing systems. Elasticutor permits an original executor-centric method which statically combines executors to operators, at the same time allows executors to scale autonomously. This method decouples the scale of operators from the global cloud which is needed for processing. The elasticutor frame work is made up of two main components: Elastic executors, which help in performing load balancing, and a main scheduler that improves the computational


resource’s routine. Trials done with functioning stock system transactions demonstrate that compared with resource reliant method providing elasticity, elasticutor increases the throughput and accomplishes a mean latency of orders which magnitudes lower. X. PROACTIVE ELASTICITY AND ENERGY AWARENESS IN DATA STREAM PROCESSING

In the paper [10], author proposes a predictive method for elastic data stream processing operators in multicores. This method is built around Model Predictive Control [MPC]. Using MPC a predictive controller is constructed. This method adjusts the overall cores used and the frequency of the CPU. This controller can control power consumption by accommodating throughput and latency needs. The target of this methodology is the multicore based shared memory systems allowing to achieve a good SelfAdaptive and Self-organized system [SASO] trade-offs. XI. EFfiCIENT ELASTIC BURST DETECTION IN DATA STREAMS

In this paper [11], author introduces the notion of monitoring data streams based on the model of elastic window and establishes the appeal a new model. The purpose is to identify abnormal aggregates in the data streams. Instead of the data streams the detections take into considering the sliding windows. Several windows are monitored and those that differ significantly or have an abnormal aggregate from other periods are reported. The algorithm proposed by author for Gamma Ray burst detection in large set of data. Using this model, the system can discover the size of the sliding window in monitoring of data streams. The paper also suggests another type of data structure for effectual detection of elastic bursts and extra aggregates. Experiments on real data sets demonstrate that this algorithm is quicker than a brute force algorithm with respect to order of magnitude. XII. MEASURING ELASTICITY FOR CLOUD DATABASES

With the growing usage of internet, creation of data sources has increased the number of “Bigdata” storage problems. Many of these data sources are huge, they also grow more bigger in a short of time. Distributed databases are suitable for this type of data sets, but It needs to increase in terms of scalability and elasticity to account for the increase in demand for computational power and storage. The aim of the paper [12], is to show measurement results that analyze the elasticity of the three chosen databases. The chosen databases are Cassandra, MongoB and HBase. These three have been chosen by the author because they are the most prevalent horizontally scalable NoSQL databases that are in use. Practical loads of 48 nodes were made by the author, Wikipedia and rack space cloud infrastructure were used to create dataset. These were used since they show describe exactly the methodology and provide a

limitless degree for elasticity to permit unvarying assessments of different databases at changing set of scales. The obtained results undoubtedly show procedural choices taken by databases have a strong response when new nodes were added to the clusters. Technical choices taken by the databases influence the way they respond on addition of a node. Form the before description, HBase is the clear winner, because due its technical selections and architecture, it performed much less data transfer when new nodes were added. The aim of this article was to provide results to only systems that scale up not the other way around. The glitches that came upon for mongoDB to measure its performance were not clearly addressed. It is uncertain what the results of the test would be if it were a different set of values for parameters like read-only percentage or using a different statistical distribution. The article also did not cover the measurement space and how to refine elasticity measure. XIII. LATENCY-AWARE ELASTIC SCALING FOR DISTRIBUTED DATA STREAM PROCESSING SYSTEMS

A normally distributed processing system static no. of processing nodes which is selected to close down on the expected maximum workload. But, since max loads occur frequently, mostly the system is most of the time underutilized. The system must by itself get the processing nodes or nodes to compare itself to the workload. These systems are known as elastic. The most important for any data processing systems is to attain SLA constraints in the latency of end to end. The cause of spikes is not following these constraints can be an excessive system load or habitual operator movement concerning distinct hosts. The elastic scaling data stream processing system excessive load situation may abstain in the load balancing which happens online. The elastic data processing engine exalts the consumption of boosting the no. of hosts employed into the system. FUGU contains these components: a stream processing engine and management component which scale the processing engine elastically by deriving operators along with the operator movement evaluation to the existing operators. FUGU is an integrated unit. It calculates the hosts and reaches scaling decisions and it coordinates with the hindsight of the data stream processing engine. The problem of elastic scaling of a data stream deciding when to where to move the operators in a situation


where a scaling decision has been made. Hence, we first define the scaling strategies deployed to fix the issue scaling. placement strategy is presented, it is used in FUGU to respond to the query and predict the location to assign the operators which need to be moved. The presented model can improve the no. of latency violations which can happen during elastic scaling for processing engine. The no. of contraventions decreases close to 50% compared to the old one. XIV. KEEP CALM AND REACT WITH FORESIGHT: STRATEGIES FOR LOW-LATENCY AND ENERGY-EFFICIENT ELASTIC DATA STREAM PROCESSING

Data Stream Processing is a computing mechanism which enables the real-time processing of continuous data streams which must be processed on-the-fly. These applications are fed by irregular flows of data that should be timely processed to detect anomalies, provide real-time incremental responses for the users, and take immediate decisions. Elasticity is an important feature in this paradigm. It allows applications to scale down and up the used resources to allow dynamic requirements and workload. But, elasticity is a challenging problem in Data Stream Processing applications that maintain an internal state while processing input data flows. The runtime system must be able to spread and send data structures supporting the state while keeping correctness and performance. Elasticity is a feature of SPE depended on the automatic adaptation to the actual workload by scaling down and up the resources. But, this approach introduces a model-based predictive approach rather than heuristic-based reactive. It presented a predictive method to elastic data stream operators on multicores. This approach has two aspects: it adds the Model Predictive Control approach in the data stream processing and it considers providing latency guarantees. In these contexts, performance guarantees are enabled elastic of paramount importance to meet the with high probability by lowering the operating costs. Distributed optimization and Game Theory are some approaches to solve this problem.

XV. SELF-ADAPTIVE PROCESSING GRAPH WITH OPERATOR FISSION FOR ELASTIC STREAM PROCESSING.

The proposal of an elastic stream processing system is that it checks the state of each processing operator separately and accommodate the changes to the processing graph using both a reactive and a predictive method. This solution can anticipate operator load using statistical information collected. Two algorithms determine the state of the operator: short-term and midterm algorithms. The short-term algorithm determines detecting high-level traffic, whereas the mid-term algorithm determines to find patterns in the traffic. SPEs are software solutions which process unlimited streams of unstructured events in a distributed fashion. Most SPEs prototype the applications as the flow of events is edges whereas processing tasks by the vertices. These prototypes are oriented to show the state of the operator based on the info provided by the monitor submodel. This prototype consists of two algorithms that show the operator. The first algorithm determined on detecting sudden peaks of traffic while the second is determined on detecting midterm traffic changes as others found. The analyzer module two algorithms. The short-term algorithm is added every to time and the mid-term algorithm is added every time, with both being system parameter values. The time window that defines the short-term algorithm to must be less than the time window to when the mid-term algorithm is added to tackle various traffic behaviors. The mid-term algorithm needs data to provide precise predictions. This area of SPEs, provide elasticity of resources included with computation are linked with the of the processing operators. Commercial systems do elasticity of shown at the beginning of the execution. The number of processing operators is the same, and the system needs to restart for adaptation to event dynamics. Stream Cloud uses an elasticity protocol that makes some conditions that call reloading, decommissioning or balancing of load. They all require the CPU utilization of the machines and not on the data parallelism which is linked to the graph topology. Results show that an elastic approach can increase the no. of the operators requiring the graph topology to the data rate and sudden high levels of data. depends on the parameters of the algorithms, for e.g., of clones generated when the system anticipates an overloaded state for the limit for the operator as overloaded.


XVI. ELASTIC STREAM PROCESSING IN THE CLOUD.

In this paper, the attention is based on the working of elastic computing of data streams which can be obtained over the Cloud computing. One of the important tasks in stream processing is to accommodate to exceedingly varying environments, resource inadequacies, as well as versatilities in the incoming data quality. The work in Elasticity Eradication discusses a set of eight should be prevented in real-time stream processing. In the eight rules described, the following three rules are quite good for elastic stream processing in the Cloud: management of stream malfunctions affirmed data safety and delivery and programmed segregating and scaling of applications. The solutions can be divided into strategies based on single events and more the processing logic. The strategies fundamental contributions that can be traced back to before Cloud era. The works that are recent times have attention on cloud and elasticity of resource.


Sr. no 1.

2.

3.

4.

Paper Name

Method

Problems

Solutions

Latency-aware elastic scaling for distributed data stream processing systems.

FUGU Method: 1. Distributed Data stream 2. Management. Use of QoS to detect anomalies in stream time processing.

System Overload Situations.

No. of latency violations reduced by 50%

Elasticity problem in processing input flow.

Distributed Optimization and Game theory.

Managing resource availability and additional needs. The transfer of data within operation.

Obtains best efficiency.

The Graph topology adapted by the operators will result in an increase in the replicas using elastic approach.

To anticipate operator load using statistical information accumulated from them.

Monitoring abnormalities in the sliding windows over data streams. Predictive methodology using MPC Comparing several databases.

Gamma burst detection

Usage of novel data structure.

Elastic data streaming processing

Creating a predictive controller improves the outcome. Selecting appropriate by measuring elasticity and scalability

Executor centric approach

Operator-level key repartitioning prohibits rapid elasticity

Elasticutor framework that has two levels of optimization.

FINCO’S framework

Evaluating the metrics of Data stream processing systems

Evaluation was done using correctness and capacity of the system which again raised to some issues By calculating amount of memory consumption and latency, maximum peak latency and post peak latency. Considering latency as the main metric and evaluating it.

Keep Calm and React with Foresight: Strategies for Low-Latency and EnergyEfficient Elastic Data Stream Processing. Elastic Scaling of Data Parallel Operators in Stream Processing Metrics and tools for evaluating DSP

5.

Self-adaptive processing graph with operator fission for elastic stream processing.

6

EfďŹ cient Elastic Burst Detection in Data Streams

7

Proactive elasticity and energy awareness in data stream processing. Study and Comparison of Elastic Cloud Databases

8

Using Adaptive Algorithm Considering Distributed DaSP Systems using former versions of DaSP Systems. Two algorithms are used for determining the operator state, they are: Short-term and Mid-term.

Identifying appropriate elasticity for databases

Explained using the architecture of DSPS systems.

9

Elasticutor: Rapid Elasticity for Realtime Stateful Stream Processing

10

A framework for performance evaluation of complex event processing systems

11

Performance evaluation and benchmarking

Calculation of main memory consumption and latency

Evaluating the metrics of data stream processing systems

12

Adaptive provisioning of stream processing in clouds

Evaluating the metrics in cloud environment

13

Measuring performance of complex event processing system

By calculating on the basis of throughput, processing latency and network latency Calculating it using input tuples, keys and payload

Calculation of system latency

Evaluating system latency and information latency

14

Benchmarking elastic query processing on big data

Elastic query processing benchmark

Evaluate the elasticity of the system

15.

Elastic Stream Processing in the Cloud

Integrating elasticity into the cloud computing environments.

Stream processing cannot process in highly dynamic environments.

By considering the two metrics- scaling overhead and elastic overhead Using eight rules of cloud to solve the stream processing problem.


REFERENCES [1] Andre Leon S. Gradvohl.2018 6th International Conference on Future Internet of Things and Cloud Workshops. “Metrics and Tool for Evaluating Data Stream Processing Systems “School of Technology, University of Campinas [2] M. R. N. Mendes, P. Bizarro, and P. Marques, Jul.2008, “A framework for performance evaluation of complex event processing systems,” in DEBS'08: Proceedings of the second international conference on Distributed event-based systems. USA [3] M. R. Mendes, P. Bizarro, and P. Marques,2009, “Performance evaluation and benchmarking,” ser. Lecture Notes in Computer Science, R. Nambiar and M. Poess, Eds. Berlin, Heidelberg [4] M. R. Mendes, P. Bizarro, and P. Marques, “Performance evaluation and benchmarking,” ser. Lecture Notes in Computer Science, R. Nambiar and M. Poess, Eds. Berlin, Heidelberg: Springer, 2009, ch. A Performance Study of Event Processing Systems, pp. 221–236.of Event Processing Systems, pp. 221–236. [5] J. Cervi˜no, E. Kalyvianaki, J. Salvach´ua, and P. Pietzuch, “Adaptive provisioning of stream processing systems in the cloud,” in Proceedings- 2012 IEEE 28th International Conference on Data Engineering Workshops, ICDEW 2012, 2012, pp. 295–301. [6] T. Grabs and M. Lu, “Measuring Performance of Complex Event Processing Systems,” in Topics in Performance Evaluation, Measurement and Characterization. Seattle: Springer Berlin Heidelberg, 2012, pp.83–96. [7] G. M. Tiziano De Matteis, “Proactive elasticity and Energy Awareness in Data Stream Processing”. [8] T. Dory, “Study and Comparison of Elastic Cloud Databases,” 21. [9] D. S. Yunyue Zhu, “Efficent Elastic Burst Detection in Data Streams,”. [10] B. M. P. V. R. Thibault Dory, “Measuring Elasticity for Cloud Databases,”. [11] S. Schneider, H. Andrade, B. Gedik, A. Biem and K.-L. Wu, “Elastic scaling of data parallel operators instream processing,” 10 July 2009. [12] T. J. F. R. T. M. M. W. Z. Z. Li Wang, “Elasticutor: Rapid Elasticity for Realtime Stateful Stream Processing,” pp.1-12, 2107 [13] Z. J. G. H. C. F. Thomas Heinze, "Latency-aware Elastic Scaling for Distributed Data Stream," 2016.

[14] B. S. S. D. Waldemar Hummer, "Elastic Stream Processing in the Cloud," 2012. [15] G. M. Tiziano De Matteis, "Keep Calm and React with Foresight: Strategies for Low Latency," 2014. [16] D. W. E. R. Nicolas Hidalgo, "Self-adaptive processing graph with operator fission for elastic stream," 11 June 2016. [17] Y. Ahmad and U. Çetintemel. Network-aware query processing for stream-based applications. In VLDB, pages 456–467, 2004. [18] R. C. Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Integrating scale out and fault tolerance in stream processing using operator state management. In SIGMOD, pages 725–736, 2013. [19] T. Heinze, V. Pappalardo, Z. Jerzak, and C. Fetzer. Auto-scaling techniques for elastic data stream processing. In ICDEW, 2014. [20] M. Stonebraker, U. Çetintemel, and S. Zdonik. The 8 requirements of real-time stream processing. ACM SIGMOD Record, pages 42–47, 2005.

Profile for Aditya Tiwari

Stream Processing Elasticity Metric Availability  

New
Advertisement