Page 1

COMPUTER SOCIETY OF INDIA

Volume 2, Issue 1

The Flagship Publication of the Special Interest Group on Big Data Analytics

Image Credit: wordclouds.com

Oct – Dec 2017


Chief Editor and Publisher Chandra Sekhar Dasaka Editor Vishnu S. Pendyala Editorial Committee B.L.S. Prakasa Rao S.B. Rao Krishna Kumar Shankar Khambhampati and Saumyadipta Pyne Website: http://csi-sig-bda.org

Please note: Visleshana is published by Computer Society of India (CSI), Special Interest Group on Big Data Analytics (CSISIGBDA), a non-profit organization. Views and opinions expressed in Visleshana are those of individual authors, contributors and advertisers and they may differ from policies and official statements of CSI-SIGBDA. These should not be construed as legal or professional advice. The CSI-SIGBDA, the publisher, the editors and the contributors are not responsible for any decisions taken by readers on the basis of these views and opinions. Although every care is being taken to ensure genuineness of the writings in this publication, Visleshana does not attest to the originality of the respective authors’ content. © 2017 CSI, SIG-BDA. All rights reserved. Instructors are permitted to photocopy isolated articles for non-commercial classroom use without fee. For any other copying, reprint or republication, permission must be obtained in writing from the Society. Copying for other than personal use or internal reference, or of articles or columns not owned by the Society without explicit permission of the Society or the copyright owner is strictly prohibited.

Dear Readers,

From the Editor’s Desk

Visleshana is now into its second year of publication. We are in the process of evolving it into a world-class technical journal in the Big Data Analytics area. I’m glad to report that the journal is now indexed by Google Scholar, as can be seen from https://scholar.google.com/scholar?q=visleshana+big+data Also, you can now follow Visleshana on Twitter and Facebook and participate in the discussions. Internet of Things is now live in quite a few of our homes. Now, I have someone new to talk to when I get home - Amazon Alexa. It turns on / off the lights and appliances connected to WiFi plugs, plays songs and Sanskrit verses that I request, orders items that I want to buy, answers some non-trivial questions, tells jokes, and so on. The going is good so far and I see a lot of promise for the future. Internet of things (IoT) has finally come of age! In spite of these technological advances, the economic inequality between the rich and poor continues to grow. How can we leverage technology to bring more and more people into the core echelons and uplift their lives? My article inside addresses this core subject. It is always good to get fresh perspectives on known topics. Bala gives his perspective on the Big Data landscape in another article. Location intelligence is growing in importance these days. We need it for choosing our restaurants, avoiding traffic, and travel to foreign countries, to name a few uses. The article on POI for GIS Maps delves into how location intelligence is achieved by using Big Data Analytics. Fraud in financial sector has become a growing menace. The authors of the article on Detecting Anomalies in Banking Transactions give insights into some of the solutions in this space. Query Optimization has been an important problem since the inception of RDBMS. How did the problem transform itself in the Big Data domain? DeviPriya and Chandra Kumar talk about it in their article inside. The coming quarter is once again lined up with a number of events in the Big Data domain. I’ll be traveling to India in the coming months and am scheduled to offer a series of lectures and tutorials on “Big Data Analytics for Humanitarian Causes” for 8-days in the MHRD approved GIAN program at Osmania University, towards the end of November. Following that is my invited talk at the IEEE sponsored International Conference on Soft Computing on “In Pursuit of Truth on the Web: Soft Computing for Truth Finding” and then another invited talk at ICCT, another International Conference in Jaipur. Hope to meet some of you at some of these events. In the meantime, happy Reading! With Every Best Wish,

Vishnu Pendyala Tuesday, October 1, 2017 San Jose, California, USA

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

2


BIG DATA LANDSCAPE

Point of View on Big Data – A Look at the Landscape Bala Prasad Peddigari Abstract—Digitization is amassing a wealth of data in various forms. The ability to mine this data and deduce meaningful insights is quite a challenge. Today, every enterprise is investing in the solutions to address this common challenge through Big Data Platforms. This Papers speaks and gives out a Point of View on the Big Data based on the experiences dealt with and provisions you with a thought to solve this challenges with various landscape of solutions available in the Big Data and Analytics space. Index Terms—Big Data, Analytics, Hadoop, Map / Reduce

—————————— u —————————— differentiation, monetize the data in the form of data as 1 INTRODUCTION a service and experiment to incubate new opportunities. his article explains and brings out why we need Big Data, what benefits it offers and explains the point of view associated with Big Data that one needs to look at.

T

Why Big Data? Digital Expansion is driving the Information Explosion in the entire landscape of every organization is dealing with. One of the important outcomes of this explosion is data. Data and its forms are different, we witness data in Structured, unstructured (full, semi and quasi). If we witness Transportation industry which is highly disrupted by Uber, Ola or Hospitality industry which is disrupted by Airbnb and Entertainment industry which is disrupted through Netflix are sitting on massive data which through Crowdsourcing approach. Every organization when they start any new initiatives today are working on the strategies on how to manage and monetize this data as they move forward. Some of the key use-case observations where Big Data can play role are: • Promotional offers through Digital Marketing Campaigns • Gain Insights through Customer behavior analysis • Fraud Analytics through Cyber Security footprint

Figure 1: Benefits of Big Data

3 POINT OF VIEW Carrying a Point of View to Big Data helps in making investments and decisions and proper. Hence sharing few thoughts around the same: •

In the subsequent sections, you will see the advantages of Big Data, Point of view sharing some of the key observations with implications and overall Landscape playing a key role here.

2 BENEFITS OF BIG DATA It is clear that organization invest where there is a clear outcome of business benefit and today we see where organizations sitting on rich data sets and exposed multiple channels and multiple platforms need to analyze and derive meaningful outcome. Hence Big Data is coming to their rescue. The figure 1 outlines the business benefits to the organization in terms of 1. Operational Efficiencies helps in lowering complexity, reduce costs, improve self-service abilities 2. Grow Sales focuses on improving sales, reducing churn, predict outcomes and help in improving customer experience through customer insights 3. Empower business to come up with new business models where we organization can create competitive ———————————————— Bala is the Technology Head for Digital Initiatives with Tata Consultancy Services. He can be reached via e-mail at bala.peddigari@ieee.org

Big Data helps in creating Competitive Differentiation: Given the information explosion going on all around, and the current stream of innovations happening altogether, Big Data is going to be very important. Organizations that learn how to “harness” Big Data and “harvest” useful information and insight from Big Data will create competitive advantage for themselves. They will be seen by their customers as keeping up with the March of technology capabilities. Others that are not current will appear to behind the times, and therefore not competitive. Big Data need to be managed and analyzed effectively to derive real-benefits: Given the Seven Vs such as volume, variety, velocity, veracity, variability, visualization and Value are the characteristics of Big Data, it is not amenable to being managed by traditional technologies. It requires a new class of Big Data platforms e.g. The Hadoop ecosystem, the Map / Reduce Algorithm and technologies built on top of them, to harness Big Data. At the same time, analyzing Big Data with a view to harvesting useful nuggets of insight from a variety of Big Data sources require completely different technologies as well. These two domains of technologies are complementary to each other, i.e. two sides of the Big Data coin.

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

3


BIG DATA LANDSCAPE •

Unstructured Data need to be mapped to structured form to consume and interpret - Unstructured information cannot be interpreted and used by end users, as it is. It must be converted into a useful form. This requires filtering a lot of noise out of the data, since Big Data tends to have a lot of noise relative to useful data. Further the information content of Big Data streams, must be interpreted in the context of other more traditional types of information, before it can be deemed useful. This requires the “Fusion” of Big Data based information with more traditional structured information to derive useful insight. Consolidation of Structured and Unstructured is creating a new competency stream such as Data Science and Business context is essential for Data science: Data Science is emerging in the industry. While Information consolidation is a general expertise, its application is usually within the boundaries of a specific Business context. Examples of specific business contexts are Finance, Marketing (Digital), Sales, Brand Management, Customer Service, Fraud and Risk analytics etc. Within each Business context, the information sources that are relevant, and the process of extracting useful insights from Big Data, are unique and distinct. This requires knowledge and understanding of Data sources and the processes for deriving useful information from Big Data in business contexts. Big Data is in incubation phase for many organizations while organization such as Google, Amazon, Microsoft, yahoo are major and matured players who adopted early - This technology is now slowly beginning to become viable for large commercial enterprises. Use cases which represent possible scenarios where Big Data can be fruitfully exploited, are still being discovered and documented.

5 CONCLUSION Every enterprise should pay attention and design solutions with Big Data need in mind for the current or future solutions. Hence every, solution must be Big Data Ready. Looking into some of the Point of views and insights, it helps to solve some of the key application problems exist in the area of Traffic Control, Telecom, Retail, Fraud, Finance, healthcare and so on. Understanding the landscape is very critical to ensure that one need to place the Lego building blocks of technology to solve the problems of industry and create new insights. Time to take the Leap and take the game of enterprise to next level to become data-driven enterprise.

ACKNOWLEDGMENT I would like to thank Chandra Dasaka for motivating to do this paper.

REFERENCES [1] https://www.tcs.com/big-data [2] http://sites.tcs.com/big-data-study/ceo-point-of-view-returns-on-big-data/# Bala Prasad Peddigari (Bala) working with Tata Consultancy Services Limited for over 18 years. Bala practices enterprise architecture and evangelizes platform solutions, performance and scalable architectures and Cloud technology initiatives within TCS and currently serving as Technology Head – Digital Initiatives. Bala drives the architecture and technology community initiatives within TCS through coaching, mentoring and grooming techniques. Bala has done Masters in Computer Applications from University College of Engineering, Osmania he was Ranked University First. Bala is Open Group Master IT Certified Architect and serving as a Board Member in Open Group Certifying Authority. He received accolades for his cloud architectural strengths and published his papers in IEEE and regular speaker in Open Group conference and technology events. Bala is a Working Group member in Open Group for Cloud and Open Platform 3 working Groups and also serving as Managing Committee member of CSI ; serving as Chairman for IEEE Computer Society Hyderabad, Chapter.

4 BIG DATA LANDSCAPE Please find the snapshot of Big Data Landscape which describes major players in this area in figure 2

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

4


BANKING

Detecting anomalies in Banking Transactions Surya Putchala, Sreshta Putchala ABSTRACT—Fraudulent Financial transactions in Interbank fund transfers costs a lot of money to the banks and erodes customer trust. In order to provide for a robust security to protect customers and ensure that only authentic fund transfers occur from their accounts, Financial Institutions can utilize state-of-the art algorithms drawn from the fields of Machine learning and Statistics to augment the rule based engines that have been protecting customer’s money. The availability of information about customers, financial institutions, countries, currencies provide rich landscape for identifying non-genuine transaction if they occur. In this article, we will cover how we could build “normals” so that Anomalies could be identified. INDEX TERMS — Fraud Detection, Anomaly Detection, Outlier Detection, Financial Fraud, Machine Learning, Artificial Intelligence, Risk Management, Network Analysis, Big Data

—————————— u —————————— could potentially meet scalability and performance of such 1. INTRODUCTION sophisticated systems. Interbank transfer of funds is facilitated by worldwide financial messaging network like Society for Worldwide 3. APPROACHES FOR OUTLIER DETECTION Interbank Financial Telecommunication (SWIFT). The Anomalies are detected primarily with the help of Outliers. SWIFT messaging system is widely used and by hackers We can consider a transaction to be a suspect or proxying as real account holders to transfer money to anomalous, if it deviates significantly (based on a destination accounts of their liking. For a financial performance indicator/s) from the “normal”. Hence Institution point of view, it becomes important to utilize establishing “normals” becomes critical in devising an the information within its purview to closely monitor Anomaly/Fraud detection system. fraudulent messages and pro-actively prevent incidences. Traditionally, these transactions are identified by welldefined rules for thresholds of when a transaction is considered a threshold. As the fraudsters adopt novel methods, the rules engines will have to evolve and adapt newer ways of identifying the suspect transactions. By being pre-emptive, a financial institution builds its reputation and improve confidence of the customers who transacts with them.

2.

ANOMALOUS TRANSACTIONS AND BIG DATA

Banking Transactional volume depends on size of the financial institution. Even a smaller institution with 10000 customers with 10 transactions a month could yield substantial volume. Although, actual fraudulent transactions are but a fraction of the total volume of transactions, it is imperative that a financial institution safeguards against suspicious transactions is from both its financial and customer perspectives. The ability to detect these transactions non-intrusively by sophisticated methods, thus play an important role in smooth fund transfers. In the ubiquitous world of information sources related to the consumers through Social Media, KYC, OFAC and various AML services and availability/adoption of Machine learning and advanced Analytical applications, detecting fraudulent transactions can become much more adaptive than ever. Fraud is perpetrated in a variety of ways and the strategies of fraudsters change frequency. However, a method to detect financial transactions that are “out of the ordinary” or “anomalies” and reviewing them closely could help detect potential Fraud before it occurs. Any solution for detecting transactional anomalies is a complex endeavour and depends on various external sources with different formats and various latencies. Besides, the transactional volumes can go well beyond million rows an hour. Infrastructure and algorithms will have to scale to meet such demands. Big Data frameworks

Single Criteria Outliers: An Indicator based on a single variable, like “Amount” transferred can be considered anomalous with respect to the rest of the Amounts that were ever sent from a Sender could be considered as single criteria outlier. Multi-Criteria Outliers: If a combination indicator such as Sender-Receiver “amount” is profiled and if the combination falls outside the “normal” transactional range between these two, could be considered anomalous. The combination represents a specific context or condition. These are also called contextual outlier or Multi-variate outlier. Population outliers: If transaction characteristics are anomalous with the entire population and violate “Normals” is a population outlier. There are different strategies establishing “normals”. Some of the sophisticated methods that could potentially help in achieving this objective is by identifying patterns using

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

5


BANKING Artificial Intelligence, Machine learning and Statistical Analyses.

3.1 Statistical Analyses Identifying an observation as an outlier depends on the underlying distribution of the data. Here we focus on univariate data sets that are assumed to follow an approximately normal distribution. The box plot and the histogram can also be useful graphical tools in checking the normality assumption and in identifying potential outliers. It is common practice to use Z-scores or modified Z-score to identify possible outliers. Grubb's test is a recommended test when testing for a single outlier.

3.2 Machine Learning Many applications require being able to decide whether a new observation belongs to the same distribution as existing observations (it is an inlier), or should be considered as different (it is an outlier). Often, this ability is used to clean real data sets. Supervised Outlier detection: Techniques trained in supervised mode assume the availability of a training data set which has labelled instances for normal as well as outlier class. Typical approach in such cases is to build a predictive model for normal vs. outlier classes. Any unseen data instance is compared against the model to determine which class it belongs to. Unsupervised Outlier detection: It detects outliers in an unlabelled data set under the assumption that the majority of the instances in the data set are normal by looking for instances that seem to fit least to the remainder of the data set.

4. DEFENSE AGAINST FRAUD An Fraud system should use a hybrid approach for anomaly detection – both in real-time and batch mode. The models should be finely synchronized to provide a reliable/robust and performant system. It uses eclectic methods and techniques like: 4.1 Rules Engine The common patterns of known Fraud patterns and their thresholds are setup Subject matter experts. Then, rule “pattern matcher” detects these predefined suspicious behaviours; whilst the “sequence matcher” finds temporal relationships between events from market data which exists in a potential violation pattern. The thresholds of the rules can be based on probabilistic models and can be updated periodically. The updates are a function of the number of Consumer accounts and Volume of transactions. They are also changed when new set of rules are derived (from machine learning) or new rules are imposed by regulatory authorities. 4.2 Assessing Risk Risk is accessed at different levels – Sender, Sender Bank, receiver, receiver bank, network properties of different entities in a transaction. This profiling is done based on individual characteristics as well as peer group profiling. For this, historical transactions and consumer profiles are used. There can be other methods like

“segmentation” of various groups based on risk parameters and entities. If a new sender or any other entity is added, they are defaulted to a “normal” category until their transactional activity has started. 4.3 Predictive Models Various combinations of Heuristic (Benford’s law), Statistical, Machine learning (Supervised, Unsupervised, Reinforcement) and deep learning models (neural networks) can be used to identify Anomalies. 4.4 Novelty Detection The system also has the ability to ascertain “Noise” and denoise the signals as the learning instances grows, which in turn improve the supervised learning algorithms. Besides this, the system has “novelty detection” algorithm to detect potential anomalies. 4.5 Network Analysis Analysing the Network for with each individual a customer transacts. The nodes and connections are enhanced and various network statistics such as betweenness, closeness are studied. The network nodes are either “named” or “unnamed”. A Customer’s network should be studied to assess his activities in terms of his usual transactional accounts. His next level contacts, unusual/first time transaction entity.

5. COMPONENTS OF THE FRAUD DETECTION SYSTEM Historical data of the past 2-3 years data of a customer should yield a good behavioural profile. However, it also depends on the overall transaction volume. We should extract required features, create networks, derive various risk factors and build profiles. We need the following: 5.1 Customer Demographics Demographic information like age, gender, ethnicity, geolocation, salary/income, home ownership (length of residence, home size, mortgage), education level, dependent children, type of cars, marital status, net worth/savings can be obtained from both internal and publicly available sources about the customers. This information can be beneficial in modelling fraud propensity of various customer segments.

5.2 Account details Tenure, Linked Accounts, Overall activity with this bank, address and location changes. Techniques like ABC Analysis will yield information about the transactions that could warrant additional scrutiny. § Transaction details: details of the Sender, Receiver, Sender Bank, Receiver Bank, Country of Origin, country of destination, Currencies of exchange. The volumes of transactions and their spread will give key insights about the normalcy measures. § Customer Watchlists: Watchlists are often obtained from Public and Consortium data (Experian, radaris,

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

6


BANKING OFAC, CFTC, KYC etc.,) that could flag or highlight the risk of a transaction. Generally, it is easy to identify transactions outside of normal business hours, high risk countries, sudden activity in a dormant account etc. However, little more sophisticated safeguards can be achieved by implementing: § § §

Client/Account dependent (Customer behaviour) In relationship to overall transactions (Customer Behaviour in relationship to the overall transactional behaviour) Demographic risk factors (Customer and the entity he is transacting) 5.3 Profile Transaction activity

Time has a great impact in determining normalcy of a transaction. The behaviour should be studied w.r.t time sensitive features like, frequency of the transactions, binning of the transactions by time unit in a day, time between transactions etc., can significantly help in establishing User habits. There are correlations and comparisons that performed as a part of the model w.r.t. inter transactions and intra-transactions. For each customer, the transactional profile based on Recency, Frequency, Monetary values are derived. A profile of each SWIFT user’s message traffic based on its specific business activities and the countries, counterparties and currencies it is typically involved with shall be developed. The solution can accommodate binning time at any granularity – from hourly to yearly. The model can be configured (time granularity) at the time of implementation of the solution based on the volume of transactions and the risk tolerance of the Bank. § § §

Velocity is the calculated by the average time between transactions. binned by the hour, day, week, month) Volumes Day of the week: Week of the transactions (7days by count/amount of transactions) Binning by Time of the day: Time of the day transactions (24 hours by count/amount of his transactions)

The above normals could be studied to establish various segments of customers and entities, which can in turn be used to model behaviors. 5.5 Risk Profile We need to be able to profile the risk of various entities involved in bank financial transactions. Customer’s Risk profile. The risk rating rules associated with each demographic risk parameters like nationality, line of business etc., it derives the score for each of the customer. Customer Risk Profiling and establish transactional thresholds to establish “normal”. These are periodically performed. Customer pattern changes are also identified….and new patterns of his transactions are detected and all changes are kept as reference. Any abrupt differences would be Anomalies. ML algorithms tries to establish “new normal” for each customer periodically. A transaction’s risk is evaluated based on: § The Risk of the Entities Involved (Sender, Receiver, Sender Bank, Receiver Bank) § Transactional Behaviour: The nature of the transaction whether it is normal or abnormal § Institution Risk Profile: Institution wise peer profiling, which brings out the commonality across institutions. § Network Analysis: Nature and the flow of transactions between entities

6. PUTTING IT ALL TOGETHER A fraud or anomaly detection system should ideally integrate components s “Outlier” + Risk Score + Network Analysis. The models can be scheduled/configured to run at any window depending on the volume of transaction and need for recency of the models.

5.4 Segment behavior As part of customer profiling, the historical transactions are processed to build the profile of a customers. The information like the beneficiaries to whom the customer sends payments, amount of transactions, frequency of transactions done in a month etc. are used to build these profiles. Profiles and aggregates with different combinations of nominal variables such as counterparty relationships and payment flows, Currency, country and counterparty activity breakdowns, reviewing large or unusual transaction values and volumes could highlight risks of Unusual patterns in payments.

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

7


BANKING Since, the learning model uses eclectic approaches, the success of the implementation depends on the following factors: 1.

Get timely feedback on the system detection performance from Compliance users (which helps the reinforcement models) 2. Ability to let the models incrementally learns (to retain the recency and enhance the relevance of the supervised models). 3. Scoring of various entities: The risk scoring of Customers as they transact The general update of various rules depends on: 1. 2.

Optimization goals or Objectives of the Bank Estimates based on the preliminary statistics of Customer statistics and transactional features 6.1 Self-adapting system

Fraud is often perpetrated in patterns that have no priors, a detection system should have the following properties: 1. A system with soft thresholds for establishing and interpreting "normals" 2. Ability to choose appropriate statistical and Machine learning models that fits the data 3. Have strong feedback mechanism capable of evaluating, changing and identifying change in its behavior 6.2 Fraud Optimization Goals Goals for fraud detection vary, and that in turn affects the thresholds that are set up. Generally, optimization should pursue 2 Objectives: Anomaly driven and Budget driven. Anomaly-driven situations are those in which you have a required rate of detection and must identify the true occurrence of fraud. Goals can be set up for: § § §

True positives (Sensitivity) True Negative (Specificity) Accuracy rate

Budget-driven fraud detection occurs when you have a limited budget for response, and you must determine how many anomalies and false alarms you can handle within that budget, setting the threshold to match. § § §

Cost or Penalty of an incorrect detection False Positives False Negatives

Often these goals are a fine balancing act between the customer satisfaction and customer protection. 6.3 Case Management The detections are logged in as events and channelled to the users (depending on their roles) through email, SMS, Social Media alerts for appropriate action. The

alerts will also provide information about the likelihood of the Fraud for a transaction. It provides appropriate details in easily consumable dashboard with information about the customer, his past history, demographics, the transaction details, possible reason for flagging. The cases can be accepted, further reviewed or rejected. Once the case is resolved, the learning from the action will be fed back as additional inputs either as abstracted as a feature or a tweak to the learning models.

7 CONCLUSION Building an anomaly detection system for a financial institution is not an exact science. The techniques that are available are multiple. Although, establishing base normals for key entities are fairly easy, the modelling of this information for a Financial Institution has to take into account, practical considerations such as its risk appetite, resources/budget available for case reviews and management.

REFERENCES 1. 2. 3. 4. 5. 6. 7.

8.

Building a Large-Scale Machine Learning-Based Anomaly Detection System Part (1- 3) – Anadot White Papers Practical Machine Learning- A New Look at Anomaly Detection - Ted Dunning & Ellen Friedman Anomaly detection for monitoring – Baron Schartz & Preetam Jinka Fraud Detection Using Data Analytics in the Banking Industry, acl Discussion Whitepaper A-Z of Banking Fraud 2016 - Temenos and NetGuardians whitepaper The Dawn of Machine Learning for Banking and Payments – feedzai whitepaper Fraud Detection by Monitoring Customer Behavior and Activities - Parvinder Singh, Mandeep Singh, International Journal of Computer Applications (0975 – 8887) Volume 111 – No 11, February 2015] Siddhartha Bhattacharyya, Sanjeev Jha, Kurian Tharakunnel, and J. Christopher Westland. Data mining for credit card fraud: A comparative study. Decision Support Systems, 50(3):602-613, 2011. On quantitative methods for detection of financial fraud. Surya Putchala provided thought leading consulting solutions to Fortune 500 Clients for over 2 decades. He is passionate about areas related to Data Science, Big Data, High Performance Cluster Computing and Algorithms. He held senior leadership roles with large IT service providers. He graduated from IIT Kharagpur.

Sreshta Putchala is a summer intern with ZettaMine Technologies. She has worked on Exploratory Data Analysis of over 5 Million SWIFT Transactions using various statistical methods using SQL and R. She is currently pursuing her Batchelor’s degree in Computer Science from Chaitanya Bharati Institute of Technology (Osmania University). Her interests are in the fields of Big Data, Statistical Analysis, Machine Learning and Artificial Intelligence

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

8


HUMANITARIAN WEB

The Case for a Humanitarian Web Made Possible by Big Data Analytics Vishnu S. Pendyala Abstract—The World Wide Web (WWW) has come a long way in improving the quality of lives of people, mostly the elite and working classes, all over the world. This article examines whether the Web can serve the needs of the underprivileged with the advent of promising technologies like Big Data Analytics, Internet of Things, Virtualization, Cloud Computing, Augmented Reality, and Brain-to-Brain Communication. The article discusses the trends, enablers, and use-cases to make an argument for a Humanitarian Web that serves more pressing needs, particularly of the lower echelons of the society. Index Terms—World Wide Web, Big Data, Analytics, Cloud Computing, Internet of Things, Augmented Reality, Virtualization, Healthcare, Cybersecurity, Brain-to-Brain Communication, Augmented Reality

—————————— u ——————————

1 INTRODUCTION

T

he Web browser has virtually become a window to the world and is a significant component of Big Data. With the advent of cloud technology, most of the interactions with computing resources have also moved to the browser. There is still plenty of scope to evolve the Web and make it yet another tool for human empowerment, as we identified in [2]. Quite a few industries are moving to the Web. The entire stock exchange is now online. Publishing went online with the advent of the Web. Education is going online – we now have a number of Massive Open Online Courses (MOOC) that can be taken on the Web. There is scope for taking more and more industries online in order to make them serve wider populations, particularly in the developing countries. Analytics plays a significant role in each and every such industry. The following sections highlight some of the trends and use cases that make the case for a humanitarian Web backed by Big Data Analytics stronger.

puter [1]. The enabler that the Web is, becomes indistinguishable from the service it provides, namely ubiquitous computing. Online storage is free today, so raw computing power could as well be made available for no or very little expense. When information, which is more expensive than the computing power, is made available for free on the Web, computational power is likely to follow. We propose that the Web be the conduit for this free access to computing resources.

2 UBIQUITOUS COMPUTING AND ANALYTICS AS A SERVICE (AAAS)

A vast majority of technology is still too complex or inaccessible for routine use by a significant majority of the population. Lowering the bar to use technology and increasing technical insights among masses are necessary steps in the advancement of mankind and the Web’s role in enabling ubiquitous computing is a key factor in achieving this goal. Cloud Computing already offers the technology to create Virtual Machine (VM) instances on the fly. The Web protocols should be able to support provisioning this instance to the Web user on a request from a client. Users will then be able to use the Web for running applications of their choice in a VM like instance.

Computing everywhere is an important enabler for crowdsourcing projects. With the advent of the cloud computing paradigm and push button deployment of compute intense applications, the Web can increasingly be used in computing for user-specified, non-conventional applications and possibly for raw computing as well. It has a potential to help populations from developing countries, which cannot afford to own substantial computing resources to get online and contribute to the computing revolution. The Web, with its broad reach and pervasiveness, can then be perceived as the Ubiquitous Com-

Other than for certain specific areas such as expensive corporate computing needs, Cloud computing should become synonymous with the Web in the long run. It should be possible to deploy user-defined applications on the Web without much effort and harness them via mobile devices. The mobile devices in turn will then be able to serve a plethora of humanitarian needs. The Web usage is bound to experience a paradigm shift from serverdefined computing to serving user defined computing needs at the point-of-use.

————————————————

• Vishnu S. Pendyala is with Cisco Systems Inc, San Jose, USA. E-mail: vishnu[at]pendyalas.org.

All aspects of cloud computing, namely, Software as a Service (SaaS), Platform as a Service (PaaS), Infrastructure as a Service (IaaS) and more importantly, Analytics as a

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

9


HUMANITARIAN WEB Service (AaaS) need to be offered free. Only then can the true power of computing be unleashed to the masses. Provisioning Analytics tooling is time-consuming and resource incentive. Coupled with ubiquitous computing, AaaS will go a long way in deploying a number of humanitarian applications to serve the underprivileged. AaaS is part of the evolving trend towards Big Data as a Service (BDaaS), a term used for provisioning various aspects of Big Data, from the data itself to the analytics that run on that data, as a service on the cloud. Amazon’s Elastic Map Reduce (EMR) offered on its Web Services infrastructure (AWS) is an example of the BDaaS offering. As AaaS and BDaaS grow in their usage, the expense to use them should come down and help with humanitarian applications. A Web that provides free computing resources in the cloud gains increasing prominence as more and more things become part of it as we describe below. Mobile devices do not have the computing power to process intense computing as needed for running, for instance, Augmented Reality (AR) applications. These ubiquitous devices will need to connect to the Web to enable compute intense applications. It is therefore also important that the Web protocols support AR.

3 HUMANITARIAN VERTICALS Internet of Things (IoT) brings a great promise to transform the Web and our lives as well. IoT is about making dumb products smart by network-enabling them. We discussed some of the futuristic aspects of the Web that will be enabled by IoT in our paper cited as [2]. IoT devices have a potential to be deployed in almost every topic identified on the call for papers for this conference, such as electrification or transportation. For instance, in one of the IEEE Global Humanitarian Technology Conferences, there was a discussion on how IoT can help control the movement of wheelchairs [3]. The Web of Things of these IoT devices can dramatically improve the quality of lives of the underprivileged populations across the world. Web of Things (WoT) provides the application interface to the IoT. The things then become part of the World Wide Web. IoT gives the Web, the ability to see, hear, and sense the world in realtime. While social network websites help in discovering macro trends like an epidemic breakout or happenings in disaster relief, the Web of Things can help in micro-monitoring. For instance, WoT can help in smart energy management to eliminate power wastage or smart irrigation to optimize water usage. In these and other humanitarian verticals such as transportation, sensors, which are now available for a few cents can be deployed to get readings needed to optimize resource usage. The data also helps to predict any maintenance required by running predictive analytics. Misbehaving IoT devices can be detected by aggregating the data and analyzing it for deviations. All this computing is done in the cloud, which as we pointed out earlier, should become

synonymous with the Web. Another important vertical in the developing world is agriculture. It is the largest consumer of water, that has become a valuable commodity in view of the recent droughts. Table 1 lists the disruptions that WoT can bring about in this essential sector, including its impact on saving water. Thing Farms

Benefit Monitor temperature, moisture, and fertility of soil for a better harvest. Farming Driverless navigation of vehicles that equipment can automatically sow seeds precisely at optimum points based on soil fertility. Cows Monitor herd of cows, know when they get into labor and prevent a miscarriage, track their health and milk yield. Irrigation Water leakage monitoring, usage equipment optimiztion, pollution level checks. Table 1 Connecting things to the emerging economies to the Web: Example of the Agriculture Vertical Air quality is a significant conern in the developing world. WoT can help in monitoring it, so as to be able to take preventive steps to improve it. WoT can also help in ensuring safety of civil engineering structures such as Dams, bridges, and huge buildings by measuring their health parameters such as pressure, temperature, and movements using sensors and analyzing that data. Transportation sector can similarly benefit by the analytics run on the data that the sensors in the vehicles send. There is a huge scope to bring all these applications over to the Web. While the Internet connects the things, the Web can connect the applications of the things. For instance, the application which monitors the safety of bridges can work with the application that monitors traffic to regulate or divert the vehicles proceeding to the bridge. These applications heavily rely on analytics software to provide useful humanitarian services. The Web can bring about seamless interconnection of disparate things, resulting in enormous synergy.

4 HEALTHCARE Healthcare is placed high on the humanitarian agenda. The Web, using cloud technology, can facilitate the deployment of healthcare applications supported by analytics engines, for mass usage. Medical diagnosis, that we presented in one of the IEEE’s flagship humanitarian conference, GHTC [4] and further elaborated in [5] and in [6] in the previous issue of Visleshana is an example. Mining the information on social networks can help predict epidemics. Brain-to-brain communication is now a possibility. When human brains can communicate with microchips embedded in them, they can potentially become yet another thing in the Web of Things. Human brains can have an IP address and communicate with other human brains and machines directly, resulting in limitless man-machine synergy. The Web, with its power to connect, can help unleash this limitless potential. As we

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

10


HUMANITARIAN WEB stated above, WoT can help in micro-monitoring and a good example is monitoring type-I diabetics. An ambitious humanitarian project would be to monitor all type-I diabetics in the world using a single WoT application. The diabetics, wearing insulin monitors, pumps and other medical devices can immensely benefit from such monitoring. Fatalities can be avoided and roads become safer, as the danger of a diabetic getting into coma while driving can be prevented. Aggregating and analyzing anonymous data from those wearables may reveal patterns that can help in further medical research and inventions. There are other chronic conditions like epilepsy, where such monitoring will immensely help. A growing trend with Medical providers is video consultation. Patients can consult with the doctor entirely online, by transmitting a realtime video captured by a camera attached to a web-enabled device. When combined with the Web conferencing feature, a Primary Care Physician can conference-in a speacialist in the same session to provide conclusive diagnosis and prescription to the patient. The physician can also combine this feature with the analytics applications described in the above paragraph to substantiate his findings. These Web-enabled applications are a boon to the many parts of the world where access of healthcare is unacceptably low. Web protocols do not yet support Augmented Reality (AR). An important research direction for the Web, as we identified in [2] is to encompass AR. Video consultation becomes much more effective with the advent of AR. The possibilities in healthcare with video and AR are many. For instance, images from MRI scans can be superimposed on a patient’s body to note the exact location of points of concern. It must be noted that whatever technology can enable needs to be evaluated against the backdrop of public policy. Medical domain is highly regulated, which is actually impeding research and investment in healthcare. Striking a delicate balance between public safety and entrepreneurial initiatives is essential for a sustained advancement in this area.

5 CRISIS MANAGEMENT The Web played a major role in disaster relief in the past. During the Haiti earthquake, funds could be quickly pooled with the help of SMS and Web technology. Hundreds of people could collaborate using these technologies to collaboratively assist in the relief operations. The Web can help in knowledge and volunteer management, decision and logistics support, and information dissemination in case of such emergencies. Web-based projects such as Ushahidi and the crisis mapping initiative of the National Institute of Health have played a major role in disaster management. Social media has caused a revolution in the quality of life in the world. It has placed a lot of power in the hands of the people to influence and change the social order. Several public safety issues and human rights violations have been successfully addressed by tweets and videos on the social media sites. Mining the information on the social networking websites and running analytics, can reveal important patterns and enable to create applications that can aid in humanitarian tasks. One such application is the Crisis Oriented Search Engine (COSE) proposed a few

years ago. The COSE project indexes social media posts for subsequent retrieval and analysis. While data aggregation from the social media websites helps in on-site engagement, information posted on them can also help in outreach to solicit outsider involvement for fundraising or recruitment.

6 RESOURCE SHARING There are growing numbers of people willing to share their resources and give back to the society. Lyft, Uber, Wingz, AirBnB are ventures based on this Premise. With the advent of cloud computing and virtualization, people almost do not have to possess anything physically. It was machines to start with. A machine with the configuration the user wants could be easily provisioned in the cloud. Now there are applications which can be used to share not just machines, but even clothes. This technology is an important enabler in resource contrained locations, where resource utilization needs to be optimized. More web applications need to be developed for mapping need with excess capacity to enable easy pooling and sharing of resources. These applications will continue to be supported by analytics engines. Education web-portals, offering Massive Open Online Courses (MOOC), such as Coursera also fall in this category. Knowledge is the commodity that is shared through these courses. Beyond the basic needs, education is probably the next major concern after healthcare. MOOC movement is limited to relatively advanced topics for the underprivileged. Learning has the best impact when the teacher is a peer. A significant humanitarian project will be to build the knowledge ground-up, for, of, and by the underprivileged populations through MOOCs. The Web can possibly enable more such ventures, for the underserved populations. Information exchange is key to collective development. While there are efforts to develop ontologies [7] and supply chain kinds of mechanisms [8] for information exchange in the humanitarian domain, more needs to be done to facilitate the exchange even among individuals. Another web-enabled project that can be taken up is to pair every child in a developing country with at least one mentor in a developed world. The mentor can then interact with the child using all possible mechanisms that the Web protocols support. The project can be extended to grown-ups and even to those fighting depression and other psychological ailments all over the world.

7 HUMAN RIGHTS The Web has furthered the exercise of fundamental rights of the citizens in multiple ways. Websites such as change.org help citizens to inquire the government about humanitarian causes. But there has also been research on how civil and human rights such as the right to privacy and the freedom to assemble have been violated by governments to gain control over the data and operations on the Web. The Chinese government’s efforts at censoring collective action on the Web and NSA’s role in undermining NIST’s cyber security standards are recent examples. Politics always had and will have an upper hand over technology. With the growing outreach and

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

11


HUMANITARIAN WEB power of the Web, governments will naturally try to gain as much control over it as citizens would tolerate. Technology can help counter governments that encroach upon citizen’s rights, but only to some extent. Data encryption has the potential to let the originator of the data decide who can access it. But if governments deter attempts at tougher encryption standards as in the case of NIST, there is no recourse. Similarly, if governments decide to reign-in tighter control by enforcing country-wide VPNs, technology will not be able to side with the citizens. VPNs certainly protect against hackers and malicious attacks from outside the borders, but can be used to give the governments far more control over citizens’ access to and activities on the Web. Some of the controls and legal checks are necessary. For instance, health data by its very nature is personal. Privacy slows technological pace, but in such domains, is necessary. Transmitting cleartext medical data or other privacy violations in US attracts a 7 year prison term. Health-related Web transactions are therefore invariably encrypted. However to enable mass web applications such as online medical diagnosis that we described in [4], governments should allow release of anonymous or pseudonymous data for research use. Privacy is still a basic human right, that continues to be encroached upon not only by governments, but private parties as well. There are numerous examples such as “Uber's ride of glory,” where sensitive information was unethically mined. Public policy needs to catch-up with technology to prevent such privacy violations. The Web has been helping the governments in humanitarian efforts such as to enhance public safety. The trend towards smart cities include predictive policing to prevent crime. By running analytics on the data collected from geographical locations, the software is able to predict where crime is likely to occur next and show it on the Web. The Web combined with the mobile devices is empowering the common man in many ways and promoting citizen engagement. For instance, citizens can upload a picture of a crime scene or damaged infrastructure and file a complaint in minutes as contrasted to the hours it takes for conventional methods. It can be easily seen that incorporating Augmented Reality in the Web protocols will immensely enhance all these applications.

7

[4]

[5]

[6]

[7]

[8]

lite (GHTC-SAS), 2014 IEEE. IEEE, 2014.) Pendyala, Vishnu S., et al. "A text mining approach to automated healthcare for the masses." Global Humanitarian Technology Conference (GHTC), 2014 IEEE. IEEE, 2014. Pendyala, V. S., and Figueira, S. (2017, April). Automated Medical Diagnosis from Clinical Data. IEEE Third International Conference on Big Data Computing Service and Applications (BigDataService), 2017 IEEE. Pendyala, Vishnu S. (2017, July). "Towards Automated Healthcare for the Masses Using Text Mining Analytics. CSI SIGDBA Visleshana 1.4 (2017): 22-25 Clark, T., Keßler, C., & Purohit, H. (2015, December). Feasibility of Information Interoperability in the Humanitarian Domain. In 2015 AAAI Spring Symposium Series. Rahman, M. A. (2014). Improving Visibility of Humanitarian Supply Chains Through Web-Based Collaboration. In UserCentric Technology Design for Nonprofit and Civic Engagements (pp. 69-87). Springer International Publishing.

Vishnu S. Pendyala is a Senior Member of IEEE and Computer Society of India, with over two decades of software experience with industry leaders like Cisco, Synopsys, Informix (now IBM), and Electronics Corporation of India Limited. Vishnu received the Ramanujam memorial gold medal at State Math Olympiad and has been a successful leader during his undergrad years. He also played an active role in Computer Society of India and was the Program Secretary for its annual convention, which was attended by over 1500 delegates. Marquis Who's Who has selected Vishnu 's biography for inclusion in multiple of its publications for multiple years. He is currently authoring a book on a Big Data topic to be published by Apress / Springer.

CONCLUSION

The Web holds a great potential to bring more and more people into the core echelons of the society. This article examined some of the ways this can be achieved.

REFERENCES [1] [2]

[3]

Pendyala, Vishnu S., and Simon SY Shim. "The Web as the ubiquitous computer." IEEE Computer 42.9 (2009): 90-92. Pendyala, Vishnu S., Simon SY Shim, and Christoph Bussler. "The web that extends beyond the world." Computer 48.5 (2015): 18-25.) K. Akash, S. A., et al. "A novel strategy for controlling the movement of a smart wheelchair using internet of things." Global Humanitarian Technology Conference-South Asia Satel-

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

12


HIVE QUERIES

Efficient Measures for Improvement and Optimization of Big Data Hive Queries K. DeviPriya, V. Chandra Kumar Abstract—Big Data is the term used for representing huge datasets that are not processed using traditional techniques. These days, every sector like Industry, education, hospitals, companies and devices like IOT and Sensors are generating huge amount of data in the order of terabytes, petabytes, yottabytes etc. It is a difficult task to store and analyze the data using traditional RDMS and programming languages. Storing and analyzing big data requires special tools to process the data in an effective manner. Hadoop is one such tool to store and process big data in an efficient way. Map Reduce is a programming model under Hadoop framework which processes the data using < key, value> pairs. But the difficulty level of the programming in the Map Reduce approach, users needs another easy and effective solution to handle big data. Hive is such a solution that processes large amounts the structured data. The environment and commands of hive provide facility to the users easy and flexible query interaction of analyzing big data. But the problem associated with hive is lack of optimization of queries. In this paper, initially, we formulate the approach of map reducer for analyzing big data then discuss problems and difficulty associated with the map reducers. We then discuss implementation and evaluation of Hive queries and then how to create Hive tables with optimized. The results are evaluated using Hadoop Ubuntu Virtual Machine. Index Terms—Big Data, Bucketing, Hadoop, Hive, Map Reduce, TEZ engine

——————————u——————————

1

INTRODUCTION

2

The age of Big Data has begun[1]. Data on web servers, Social media, Industries,Bio Informatics ,Medical Sciences etc., increased very quickly and the present technologies are unable to store the data due to storage problems and retrieving useful information from the stored server is also challenge task. These complex data sets may be supporting different formats structured, semi structured or unstructured. Recently industries have been spending millions on big data area to meet the challenges. Apache Hadoop is one tool among the existing technologies to handle big data and it is an open source project maintained by many people around the world. Map Reduce, Pig, Hive are some of the core components in the Hadoop framework.Map Reduce is a batch processing model which analyzes and processes the data in terms of mapper and reducers. Apache Pig is interactive query processing model to process and analyzing the data by writing pig Latin scripts. Hive is interactive query language which is process the data by writing hive queries.

————————————————

• K DeviPriya is with the Department of Computer Sceince and Engineering, Aditya Engineering College, Surampalem, AndhraPradesh, E-mail: devipriya.kamparapu@aec.edu.in. • V ChandraKumar. is with the Department of Computer Sceince and Engineering, AIMS Engineering College, Mummidivaram,. E-mail:chandrakumar.vichuri@gmail.com.

ARCHITECHTURE OF HADOOP

2.1 HADOOP Hadoop Files System is desgined with the nature of distributed file system. Hadoop is run on the commodity hardware for the storing of complex datasets in a distributed way. HDFS holds vast amount of data and easily accessible by the applications. The objective of Hadoop is moving processing technique to data instead of moving data to the processing model. The detailed architecture of Hadoop Ecosystem is shown in the below Fig. 1. HDFS-Hadoop Distributed File System [2] is used for storing the large data set in to different data nodes that are available in the Hadoop cluster. Ha-

Figure 1 EcoSytem of Apache Hadoop.

doop cluster is a group of machines that are designed to store huge volumes of the data in a distributed environment. It supports the distributed stor-

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

13


HIVE QUERIES

ing of large datasets in to the data nodes. The controlling of data nodes will be handled by a node considered as Name Node. Name Node is a master node which stores the meta information about the data and known storing information about data.

MAPREDUCE-Map Reduce is a programming model that analyzes and process any type of data in the format of <key, value >pair using mapper and reducer classes. In this model, the Job is divided in to tasks and assigns to the task trackers and controlling, montoring of all these task trackers will be handled by the Job Tracker. PIG-Pig is a scripting language to support interactive query processing. The language supported by the Pig is Pig Latin. In Pig Latin script Load, Dump, Transform, Store etc., commands are available to analyze and process the data. MAHOUT-It is an Apache project goal to build scalable machine learning algorithms. Hive-Hive is an important interactive query processing model that process and handles the data using Hive Query Language. The users who are not familiar with the programming languages writes the Hive queries easily with basic SQL knowledge. HBase-HBase is a NoSQL database. Apache HBase isa column-oriented key/value data store built to run on top of the Hadoop Distributed File System (HDFS) Sqoop-Sqoop is a tool intended for efficient transfer of vastamount of data between Hadoop and Relational database Ambari-Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters ZooKeeper-is used to manage large clusters in the Hadoop file sytem. It have open source and, distributed nature Oozie-is used to manage workflow of jobs in the large clusters. It is scheduler that schedule jobs in the apache adoop.

The following are some of the application areas comes under area of the bigdata [3]. 1. Geographic Information System-The main Objective of GIS System is better decision about the location. It includes features modifying, managing, collecting, reteriving and sorting of geographical data. Geographical data is very hugeand for analyzing these data Apache Hadoop, Map Reduce, Apache Spark will be needed. 2. Cloud Control System (CCS)CCS manages large amount of traffic hosting, delivery, video streaming etc., generates the Big Data and the efficient processing tool for this area is Apache Hadoop framework. 3. Social Media-FACEBOOKFacebook generates huge amount of the data like post, uploading photos, likes etc., According to statistics Facebook data warehouse has 700TB of data. The efficient processing of these data is possible through Hadoop and Hive. 4. Bio-Informatics [csi-pgno36-7ref] is the study of understanding the molecular mechanism of the life on earth by analyzing Genomic information. Biological data is very huge big data, understanding and analyzing of these data is very difficult and challenge task faced by the researchers. In this area BioPig and Cross bow has been developed for sequence analysis.

3

MAP REDUCE APPROACH

Map Reduce is a framework that provides facility to write the programs for parallel processing in the distributed environment.This approach is divided in to two taks map and reduce.It is possible to write the map function followed by the reduce function.In the configuration settings the number of mappers required to process the data will be decided.The map and reduce function considers the input and out putint the format of <key, value>pair.The following pseudo code describes the data flow from the input of key/value pairs to the list output: Map (key1, value1) ->list (key2, value2) Reduce (key2, list (value2)) ->list (value3) Workflow of MapReduce consists of 5 steps 1.

Splitting – Splitting of data based on parameters like space,comma, newline etc.,.

2.

Mapping – Conversion of input<key,value> into another <key,value> format.

Volume-Volume is considered as quantity of data sets generated from the sources Velocity –Velocity is referred as the speed at which the data is generated. Variety-Variety is represented as the formats of the datasets like structured, semi structured and unstructured. Examples includes here is images, text,video,audio etc.,

3.

Intermediate splitting – The entire procedure in parallel on different clusters. In order to group them in “Reduce Phase” the similar KEY data should be on same cluster.

4.

Reduce – In this phase,aggregation operations performed like sum,count,max,min.

2.3 Application Areas of BigData Big Data is generated from the different application areas.

5.

Combining – It is the last phase where all the data is combine together to form a Result

2.2 Characterstics of Big Data The huge amount of data generated from the different sources is termed big data –is identified mainly by three characteristics: 1. 2. 3.

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

14


HIVE QUERIES

Creation of Mapper and reducer class includes the foolwing Class MyMapper extends Mapper<LongWritable,Text,Text,IntWritable> { Public void map(LongWritable key,Text Value,COntect c) { //mapper logic } } Class MyReducer extends Reducer<Text,IntWritable, Text,IntWritable > { Public void reduce(LongWritable key,Text Value,COntect c) { //reduce logic } } The classes that are required for map reduce programs are available in the Hadoop API.The implementation of map reduce is possible through different programming languages but the challenge of mapreducede is the developer must have good knowledge in the programming concepts which is very difficult task to the normal users and solves the big data problems which are key, value format.

4

HIVE

Fig. 2. Components of Hive

Hive [4] tool which structures data in to databases using the concepts tables,colums,rows,partitions,bucketing etc. Hive supports primary data types –int,float,double etc., and complex types sturct,unio,map with key and value pair etc., The user uses either CLI or Web GUI or JDBC/ODBC to execute Hive queries. If the user uses CLI or Web GUI for Hive queries then it is directly connected to the Hive driver. If the user uses JDBC/ODBC (JDBC Program) at that movement of the time by using API (Thrift server),it is possible to connect to the hive driver. The Hive driver accepts the Hive queries from the user and sends to the Hadoop Distributed File system.(HDFS).HDFS uses NameNode,DataNode,Job Tracker, Task Tracker for receiving and dividing the work for parallel execution. Meta Store is used for storing the schema of the Hive tables. The detailed architecture of Hive is shown in Fig. 2.

Hive is easy and interactive query language,persons who don’t know the programming language also can easily writes the Hive queries for anlayzing the big data.The syntax and queries of Hive is similar to SQL.The following examples shows how to stores and reterives data into tables using Hive query language. Creation of Table using Hive create table if not exists research( reid int, rname string,rarea string, yearofjoining date) comment ‘research details’ row format delimited fields terminated by ‘\t’ lines terminated by ‘\n’ stored as textfile; .

Load Data into research table Using Hive load data local inpath '/client/user/research.txt' overwrite into table research; Reterive Data from research table Using Hive select * from research where rid=3; The discussed examples show the normal way of storing and reterving the data without applying optimization. Optimization to the Hive increases the performance of queries. The

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

15


HIVE QUERIES

defaultway of creating tables doesnot provides optimization.

5

OPTIMIZATION OF HIVE QUERIES

For any type of data, performance of the queries becomes an important challenge. The queries which have long run execution on bigdata not only consumes resources of the system that makes the capability of server and application down. So, optimization of the queries becomes an important task.Hive without optimization is useful for the queries that requires scan of the entire table [5]. Even the queries run on the Hive needed limited amount of the data to be analyzed and processed. For this type of requirements users needed some domain knowledge on the attributes of the table and tell it to Hive. This requirement is possible through Partitioning of tables in the Hive. “Partitioning” is a feature that improves the performance of the queries. For some types of the attributes partitioning is not possible, then it is better to implement bucketing or clustering on the colums.The Column values which are same will be stored in one bucket. Bucketing is useful for the Joins operations. The following techniques improve optimization of hive queries and discussed with examples.

manageable parts. The Bucketing is based on the hash function which improves the query performance. The following examples shows how to create table with bucketing concept create external table

tablename(columnname datatype,column2 dattype,….) partitioned by(column data type) create external table researcher_info (rid smallint, rname string) partitioned by (rjoindate date) clustered by (rid )into 256 bucketes.

5.2 Hive Tables Creation Using Parttiion Even it is possible to create tables in hive without partionning. Then the queries will scan entire table to get the result.By applying partition the records are stored in the separate folders then the queires fetch only required directories instead of fetching all. The following syntax indicates how to create partions in hive table. Creation of Partition Table

create external table tablename(columnname datatype,column2 dattype,….) partitioned by(column datatype) The following example show creation of researcher table and partitioning the table based on the Researcher Joining Date.

create external table researcher_info111 (rid smallint, rname string) partitioned by (rjoindate date)

Loading of Data into Hive Tables through partition wise load data local inpath /home/user2/desktop/sample.txt' overwrite into table r_info partition (year='2016');

5.3 Hive Tables Creation Using Buketing Bucketing [6] is another way of decomposing table in to

5.4 ORC format for Storing The Optimized Row Columnar (ORC) file format provides a an efficient way to store the data in the Hive database. It was designed to overcome the limitations of the other Hive file formats. When Hive is reading, writing, and processing data ORC files improves performance [7] Createtable table_orc ( column1datatype, column2datatype, column3type, column4type ) stored as orc;

5.3 TEZ instead of Map Reduce Engine TEZ engine is more efficient than map reduce for interactive queries. To set the TEZ engine is possible by setting the following property. set hive.execution.engine=tez The use of TEZ engine is supporting for interactive queries along with single map phase followed by multiple reduce phases. But in map reduce always reducer reduce require map phase The response time of TEZ is efficient compare to map reduce due to lesser job splitting and HDFS access. In map reduce the task is divided in to more jobs and HDFS accessing is also more. TEZ does not write any temporary results in to HDFS.After completion of all map and reduce tasks only the final result is stored in to the HDFS.Coming to map reduce for map and reduce phases the temporary result is stored in the HDFS which is time consuming process.

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

16


HIVE QUERIES

The following points summarized points to optimize the Hive Queries 1. Create Hive Tables Using Partitions 2. Create Hive Tables using Bucketing Concept 3. Store the table in ORC format 4. Use TEZ engine instead of Map Reduce engine

6

EXPERIMENTAL SETUP

We implemented Hive queries using Ubuntu virtual machine with Hadoop and Hive. The evaluated results are shown below. Creation of ResearcherInfo using Partition and Bucketing

CLI,Web/GUI,ThriftServer,JDBC/ODBC,Meta Store component.Hive a toplevel Hadoop project that process vast amount of structured data using Hive query language.Hive is useful for the queries which require to scan entire table.In some applications,queries doesnot require to scan the entire table.In this situation ,table creation using partitions and bucketing redcues the scanning time of the tables and improves the efficiency of the queries.In this paper practically implemented creation and insertion of data using partions and bucketing concept. The file formats also affects the effieciency of Hive.ORC file format is an efficient format for Hive storage. After this, described the objective of using TeZ engine instead of Mpa reduce engine is briefly described .Finally the experimental results are evaluated using Ubuntu Virtul Machine. REFERENCES

Loading of Data

[1]

Loading of Data in to Researcher _Info Par

[2] [3] [4] [5] [6] [7]

Rupender Singh *, “Analyzing Performance of Apache TEZ and Map Reduce with Hadoop multi node cluster on Amazon Cloud” Journal of Big Data (2016) 3:19 DOI 10.1186/s40537-016-0051Dr. S.Rama Sree,K.DeviPriya “An Insight of Big Dat Analytics Using Hadoop” article CSI Communications Volume no.40 issue No 8 November 2016 Swati Harinkhere et.al “Big Data Processsing Techinques &Applications :A Technical Review” article CSI Communications Volume no.40 issue No 8 November 2016 Ashish Thusoo et.al “Hive- A Petabyte Scale Data Warehouse Using Hadoop” 2010 IEEE 26th International Conference on Data Engineering (ICDE 2010)Year: 2010 http://hadooptutorial.info/partitioning-in-hive/#Syntax http://www.hadooptpoint.com/hive-buckets-optimizationtechniques/ Yib Huail et.al “Major Technical Advancements in Apache Hive” Proceedings of the 2014 ACM SIGMOD InternationaConference

Mrs. K. Devi Priya (CSI - F8000838) is working as a Senior Assistant professor in the department of Computer Science and Engineering, Aditya Engineering College, Surampalem, Andhra Pradesh. She has 7 years of teaching experience. She taught several subjects such as Web Technologies, Mobile Computing, Database Management Systems, Big Data Analytics, Hadoop and Big Data. Her research interests include Big Data Analytics, Cloud Computing, Network Security. She can be reached at k.devipriya20@gmail.com

Show Partions of Researcher_Info

7

CONCLUSION

Hadoop is a framework that handles large amount of different formats of datasets using mapreduce,Pig,Hive etc.,In this paper, initially discussed architecture of Hadoop ,characteristics and application areas of Big Data with supported framework tools.Later,the detailed description and architechture of Hive is described with

Mr V. Chandra Kumar is currently working as Assistant Professor in the department of Computer Science and Engineering, AIMS College, Mummidivaram. He completed M.Tech from JNTU, Kakinada. He has 7 years teaching experience . He taught several subjects webtechnologies, mobile computing, Java, Computer Organization, Computer Graphics etc.,His research interests includes Mobile Computing,Image Processing,Design and Analysis of Algorithms ,Big Data Analytics etc., He can be reached chandrakumar.vichuri@gmail.com

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

17


GIS

Locating Points of Interest for GIS Maps Using Apache Solr Vivek P. Talekar, Sajidha S.A. Abstract — The geographical information system (GIS) is an approach to encapsulate, load, handle or control analysis, manage, and present all kinds of geographical Big Data. This map making contains recording the images, saving the data, inspecting, and flourish information identified with positions on Earth's surface. This will contain unique picture mosaics with separated vector and customer provided ascribe information to make single, information-rich pictures for GIS Mapping ventures. A geomap is a map of a country, continent, or region map, with colors and values assigned to specific regions. The Points of Interest (POIs) are businesses and landmarks important in the Find, Guide, and Display functionality of vehicle and pedestrian navigation, Internet mapping, and Enterprise mapping solutions. POI will represent the activity at a specification location. The division of the Point of Interest into its subclasses is driven by specific point of interest behavior in POI relationships and POI attribution. As the data required for map making process, is in very huge volume. Finding of particular information about the point of interest is very hectic process. Initially this process has been taken huge time. Fetching the data which comes in fallout report should be faster. The main objective of the paper is to fetching required information from terabyte of data in millisecond using apache solr technology. This apache solr is having different functionality, indexing method, which helps out fetch or searing faster. Index Terms— GIS, map, Position, GeoMap, Point of Interest (POI), Apache Solr, Sharding the data, Indexing.

—————————— u —————————— problem, we introduced faster search engine using Apache Solr technology. Solr is used for to extract 1. INTRODUCTION useful information from database. “Suppose you are having a young boy, and you asked to This will take very little time for fetching the data from him to travel into a small house, and in short time period even terabyte of data. This technique is having more collect more, you can do this easily as you know everything advantages because of its Indexed searching. To make about that house; but he can’t. Let the boy carry with him the searching and fetching faster than the traditional some map or paper which describing the house, where he way, this method gives good results. travelled, which will be a key to his search...” This illustration is solid counsel and sets the stage well to think on different applications and distinctive research open doors for land data frameworks (GIS) in business maps. GIS are quickly expanding being utilized for multipurpose in light of the fact that they are capable devices that can be utilized to bring the great data that is kept in the information that depicts area (e.g., scope and longitude, addresses, postal codes, provinces,). Land data is a choice instrument which enables a client to bring aggregate maps. The main feature of the system is that it distinguishes this from other object information. This provides the greater help for the user to get actual geographic locations. As we know that most of maps contain lots of useful information like hotel, school, colleges, offices, restaurants and malls etc. Map reading and map drawing are important skills to learn in GIS.

2. PROBLEM STATEMENT As mostly using traditional way, to search the data or fetch the data, which is important for business, will take more time for performing the operations. This is because of traditional way like text search, or word search may take much more time. This happened because of that used old method. For overcoming that

3. GEOGRAPHICAL INFORMATION SYSTEM (MAPPING) It should be noted that the GIS is much more useful tool for creation of any kind of map. Or this GIS can be used to show the graphical presentation. For example, different kind of sheets packages now contains GIS usability that permits users displaying the new kind of map displays. Although this kind of ability is useful for making graphical presentation and similar kind of displays, this represent just some kind of abilities that can woks fully on GIS possessing. The geocoding process is to describe the method of connecting attribute information with the actual coordinates of the location on a map. For which, if somebody required to place the locations of all stores on a map from their way, they could geocode all stores address with coordinates on the map to define POI that is used to locate all stores nearby each store. The geocoding is process of creating the data fields in the attribute datasets for the longitude (Coordinate X) and latitude (Coordinate Y) of each point of interest. The emerging link generates a GIS database which is a great merging of the two different data sets, the attribute data and the map data. Because of this significant dataset, it will provide more accurate data for number of applications. This dataset is having

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

18


GIS huge amount of data which contains information about the POI or Node.

the

all

Another feature of a GIS application is the display of geographic information. In other words, maps can be represented with the help of GIS. Adding to that, more than one different set of data can be represented.

4. POINT OF INTEREST APPLICATION Because of accurate POI, it is very easy to reach at desired location. This will help to user of map, to finding the more accurate location. Accuracy of POI is important concern in the POI building. This will also be beneficial for a new user for finding the exact location. Map gives exact shortest distance in between POI. Accurate POI will give the all detail information about that location. Blindness affects nearly 45 million people worldwide. Because of rapid population broadening, this number is expected to more than double by the year 2020. Navigation in indoor conditions is highly demanding.

5. RELATED WORK GIS is a more effective decision support tool because it enables their clients to oversee property information, as well as to catch, oversee, and fuse space information in their analysis. Because of these abilities and also other industrial facts, the GIS business had huge growth in the past few years. Despite the fact that a government administration still speaks to the biggest section of the GIS-client group, quite a bit of this current development can be credited to across the board dissemination of GIS into the business group. [1] The search innovation for the source, they are using The k-shortest path method [2] to calculate the minimum or shortest path query. Here, the variable k remains for the quantity of transit point that are contained in the inquiry Result. transit point focuses are only the Point of Interest (POI). For finding the minimum distance in between the two nodes can be find by using this method. [2] Points of Interest (POIs) and Regions of Interest (ROIs) are two kind image features rapidly used in many computer graphics applications. Detection of the features has received further studies. [3] Statistics information like addition, average, and distribution of points of interests (PoIs), e.g., hotel, restaurant on guide administrations, for example, Google maps and Foursquare give significant data to applications, for example, showcasing basic leadership like decision making. For example, the learning of the PoI rating dissemination empowers us to assess a specific PoI's relative administration quality positioning.

In addition, a restaurant start-up can surmise nourishment inclinations of individuals in a geographic zone by looking at the notoriety of restaurant PoIs serving different cuisines within the area of interest [4]. In the meantime, it can likewise assess its market measure in light of PoI total insights, for example, the quantity of foursquare clients checked in PoIs inside the region. Also, a lodging start-up can use hotel PoIs properties, for example, appraisals and surveys to comprehend its market and rivals. In the technique, a record is produced progressively utilizing an arrangement of competitor travel focuses and an arrangement of applicant goal focuses. The framework assesses shorted way rapidly utilizing the Index. Any client can change beginning stage. A similar file is utilized for a similar arrangement of hopeful travel focuses and an arrangement of applicant goal focuses. Social networking like Facebook, YouTube or LinkedIn contains numerous data like preferences, aversions, telephone number, and Email id. The author utilized the solr innovation to construct a web index utilizing apache solr to remove helpful data from Facebook. Apache solr is utilized to get more imperative data. [6]. The creator utilizes the Apache Hadoop and Apache Storm to parallelize the procedure, with the goal that it can be quick than some time recently. The current ways to deal with recover of information advancement of various substance based ordering strategies that permit to proficiently get the information in sight and sound dataset, for example, pictures. In any case, for that move, record capable information ought to be removed first from accessible dataset. Indexing is necessary move can be very time to speed the process up. Apache Hadoop and Apache Storm are used for the making system more parallelized. [7] The author uses the apache solr for carrying out their different activities, tagging tools which allow activities to Bulk/group tag or untag big size data sets of objects in temporary work sessions, for experiencing the real time. Their actions before making the changes visible to end-users. Apache solr is usded for completely functional annotation tagging environment over fulltext index Apache Solr, [8]

6. APACHE SOLR TECHNOLOGY Apache Solr is open source Search Server / web application, an open source enterprise search server which will provide huge-scale scalability, deployed on top of application server like Tomcat. Apache Solr can accomplish quick hunt reactions in light of the fact that, rather than seeking the content straightforwardly, it looks an index instead.

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

19


GIS Solr is powered by Lucene which able to powerful matching abilities like phrases, joins, an open source enterprise search server which will provide large-scale scalability, also it can be deployed on top of application server like Tomcat. Solr is controlled by Lucene which empowers intense coordinating capacities like expressions, special cases, joins, bunching and significantly more crosswise over different information sorts.

generation will start. This contains different kind intermediate steps, which perform the different tasks for allotted to them.

The apache solr is having support to Query language for structured and textual search. The main top Hadoop users from Cloudera, Hortonworks and MapReduce all having bond with Solr as the search-fetch engine as their Big data Projects. Solr is also work as end point in various big data processing projects.

In the Extract-Transmit-Load process, all data will process. First data will be extracted from PPP pipeline process.

In above method, the data is loaded taken from supplier or number of suppliers. The data may be in different format. This data loaded into pipeline for further process. After that, database get ready with all given data.

Solr will work with HTTP Rest APIs with different xml and json formats and solr will integrate with any programming language which supporting these standards. For more comfortable of java, Python, Ruby, C# client

7. REGULAR APPROACH AND NEW APPROACH. Traditional Approach to getting the report is more time consuming. For report generation, it will take almost two to three days. Traditional way to fetch the data required lot of time for the process. This will take more time for processing the data and report generation.

Fig 2 Our Approach using Apache Solr

Fig 1 Traditional approach of report generation For this project on first day, the production engineer will start the task of PPP pipelining process. This PPP pipeline is the main task of this project. After completion of PPP pipeline process, the actual report

Transmission of data from PPP pipeline into oracle database will take too much time. As the data is transferred from local machine to oracle machine. This will take more time for data processing. Loading of the data from local machine to oracle machine will take more time for uploading the data. After that report generation will start. For time reduce purpose we introduce a new method using the apache

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

20


GIS solr technology and apache tomcat. Apache solr will work on indexing method. Because of this indexing search, required time is very less for report generation as compared to traditional way. In new approach, the input from supplier added into PPP pipeline process. The data is loaded and it will start executing all the process. This will take the sometime for report generation. After the report generation, the data is loaded into apache solr with the help of scripting language like python. Because of apache solr, time taken for report generation is very less. This will work on indexing method instead of other traditional way. As this is using the indexing method, fetching of data is very simple and in faster manner. Apache Tomcat is needed here for server. This apache tomcat will act as server in your machine. The apache tomcat will be used for deploying the java servelets and JSP pages.

8. APACHE SOLR ARCHITECTURE Apache Solr contains different number of shards. These shards are nothing but data divided into small parts. For this row data is stored in HDFS. Row data is nothing different than the data on which we have to process. This kind of data is in very large volume. Mostly it contains the terabyte of data. This data cannot handle by traditional approaches. Apart from this section, the important task is fetch the required data from the terabyte of data. Fetching the data from large volume is very important task. The old approach is very time consuming and also a hectic process. The SOLR method will take very less time to fetch the data as it uses the indexed searching. Indexed searching is one of the fastest search available now. The row data is undergo the ETL (extract-transformload) process. Extracting the required data from terabyte of row data is essential work. Web service comes into action after the ETL process done. Then data is merged into different kind of shards. These shards are able to keep the data into smaller version. As solr is using index method, that index is divided into chunks, known as shards. It is just a logical partition of data.

Fig 3 Solr architecture In traditional way, dividing the index into shard is done by manually. And also, it is not supporting to distributed indexing. Load balancing is also major factor. Using apache solr these problems are solved explicitly. Apache solr is supporting to distributed indexing and failover. Zookeeper plays an important role while load balancing and failure of process. The important factor is that, it is not master or slave process. Each and every shard, will contains the replica of its data. This will be helpful while load balancing or failure happened. The zookeeper automatically selects the leader for their process. If the leader get fails, one of the replica get selected as new leader of process.

9. EXPERIMENTAL RESULTS The apache Solr gives the results as expected in lesser time. As Solr is working on indexed searching, will gives the faster result. With the help of apache solr and apache Tomcat as an server, this makes system faster than other traditional ways of searching techniques. The finding of the solr is faster, indexing plays important role in this finding. Before using this apache solr method, the required result would take more time for fetching the data from

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

21


GIS large dataset. This is not good for the current map making project. As it required faster result to provide the report.

[2]

[3] [4]

[5]

Fig.4 All phases

[6]

[7]

[8]

Fig 5. Normalization Phase

10. CONCLUSION AND FUTURE WORK This is a survey paper on the topic of Point of interest. This is very interesting area of research. In future, the accuracy and availability of POI is main thing. Because of that mapping level will be increased. Efficiency of the POI is also challengeable task. The quality and quantity POI while creating maps is main concern. Because the less POI with greater quantity is more efficient while creating new maps. This will be having more effect on maps quality. Removal out of business POIs is also important. As year by year some POI got changed. As some updated or some deleted. Also some new POIs added. This is also a great work if done successfully.

Opportunities for Information Systems Researchers” Proceedings of the 1996 IEEE Kunihiko Kaneko , Shinya Honda ,“A Map Database System for Route Navigation with Multiple TransitPoints and Destination Points”, 2016 5th IIAI International Congress on Advanced Applied Informatics Qi Li and Yongyi Gong, Yixuan Lu, “Integration of Points Of Interest And Regions Of Interest,” Ieee 2014. J. A. D. C. Anuradha Jayakody, Iain Murray, Johannes Herrmann,” An Algorithm for Labeling Topological Maps to Represent Point of Interest for Vision Impaired Navigation”,IEEE 2015 International Conference on Indoor Positioning and Indoor Navigation (IPIN), 13-16 October 2015, Banff, Alberta, Canada. Vision Impaired Navigation”,IEEE 2015 International Conference on Indoor Positioning and Indoor Navigation (IPIN), 13-16 October 2015, Banff,Alberta,Canada. AsifUddin Khan, Bikram Kesari Ratha,” Business Data Extraction from Social Networking” 3rd InCI Conf. on Recent Advances in Information Technology IEEE 2016 David Mera, Michal Batko, Pavel Zezula,” Towards Fast Multimedia Feature Extraction: Hadoop or Storm”, 2014 IEEE International Symposium on Multimedia. Michele Artini and etall,.” TagTick: A Tool for fAnnotation Tagging over Solr indexes” 2014 IEEE

Vivek P. Talekar is a Master’s student at the School Of Computing Science and Engineering from Vellore Institute of Technology (VIT), Chennai. Vivek received his Bachelors’ degree from University of Pune. He also has industrial Internship Experience. He is currently working in Data Mining and Big Data technology.

S.A Sajidha is working as an Assistant Professor (Sel Grade) at VIT University, Chennai. She has around 17 years of academic experience and is working on Data Mining, Artificial Intelligence and Big data topics. She is currently pursuing her Ph.D. in Data Mining.

REFERENCES [1]

Brian E. Mennecke, Martin D. Crossland, “Geographic Information Systems: Applications and Research

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

22


COMPUTER SOCIETY OF INDIA

The Flagship Publication of the Special Interest Group on Big Data Analytics

Call for Papers

Visleshana is the official publication dedicated to the area of Big Data Analytics from Computer Society of India (CSI), the first and the largest body of computer professionals in India. Current and previous issues can be accessed at https://issuu.com/visleshana. Submissions, including technical papers, in-depth analyses, and research articles in IEEE transactions format (compsoc)* are invited for publication in “Visleshana”, the flagship publication of SIGBDA, CSI, in topics that include but are not limited to the following: • • • • • • • • • • • • • • • • • • • •

Big Data Architectures and Models The ‘V’s of Big Data: Volume, Velocity, Variety, Veracity, Value, Visualization Cloud Computing for Big Data Big Data Persistence, Preservation, Storage, Retrieval, Metadata Management Natural Language Processing Techniques for Big Data Algorithms and Programming Models for Big Data Processing Big Data Analytics, Mining and Metrics Machine learning techniques for Big Data Information Retrieval and Search Techniques for Big Data Big Data Applications and their Benchmarking, Performance Evaluation Big Data Service Reliability, Resilience, Robustness and High Availability Real-Time Big Data Big Data Quality, Security, Privacy, Integrity, and Fraud detection Visualization Analytics for Big Data Big Data for Enterprise, Vertical Industries, Society, and Smart Cities Big Data for e-governance and policy Big Data Value Creation: Case Studies Big Data for Scientific and Engineering Research Supporting Technologies for Big Data Research Detailed Surveys of Current Literature on Big Data

All submissions must be original, not under consideration for publication elsewhere or previously published. The Editorial Committee will review submissions for acceptance. Please send the submissions to the Editor, Vishnu S. Pendyala at visleshana@gmail.com.

* While we are working on our own templates, the manuscript templates currently in use can be downloaded

from https://www.ieee.org/publications_standards/publications/cs_template_latex.tar (LaTeX) or https://www.ieee.org/publications_standards/publications/cs_template_word.zip (Word)

October - December 2017 ^ Visleshana ^ Vol. 2 No. 1

23

Visleshana Vol. 2 No. 1  
Visleshana Vol. 2 No. 1  

Visleshana, the flagship publication of Computer Society of India, Special Interest Group on Big Data Analytics (SIGBDA).

Advertisement