Page 1

COMPUTER SOCIETY OF INDIA

Newsletter of the Special Interest Group on Big Data Analytics

Volume 1, Issue 1

October – December 2016

Chief Editor and Publisher Chandra Sekhar Dasaka

Editor Vishnu S. Pendyala

Editorial Committee B.L.S. Prakasa Rao S.B. Rao Krishna Kumar Shankar Khambhampati and Saumyadipta Pyne

Website: http://csi-sig-bda.org Please note: Visleshana is published by Computer Society of India, a nonprofit organization. Views and opinions expressed in the Visleshana are those of individual authors, contributors and advertisers and they may differ from policies and official statements of CSI. These should not be construed as legal or professional advice. The CSI, the publisher, the editors and the contributors are not responsible for any decisions taken by readers on the basis of these views and opinions.Although every care is being taken to ensure genuineness of the writings in this publication, Visleshana does not attest to the originality of the respective authors’ content. © 2016 CSI. All rights reserved. Instructors are permitted to photocopy isolated articles for non-commercial classroom use without fee. For any other copying, reprint or republication, permission must be obtained in writing from the Society. Copying for other than personal use or internal reference, or of articles or columns not owned by the Society without explicit permission of the Society or the copyright owner is strictly prohibited.

Image credit: wordclouds.com


From the Editor’s Desk The strength of a professional group is in its publications. It is with immense pleasure that we bring this inaugural edition of the Newsletter of CSI SIG-BDA, “Visleshana”. Web 2.0 unleashed several petabytes of data, so much so that Big Data became a phenomenon that we need to reckon with. The explosion of data is not limited to the Web, though. The sources of data are growing by leaps and bounds, IoT being a significant source. Economy grows as more and more people join in its core echelons. People are the most important economic resources. Big Data Analytics have the potential to get people into mainstream. CSI SIG-BDA is bound to act as a catalyst in enhancing and facilitating the impact of Big Data Analytics on the mankind. Big data is like the milky ocean, churning which by running analytics in the cloud will return amazing results. In this issue, Executive Council member, Shankar Kambhampaty discusses how to leverage the three V’s of Big Data, namely, Volume, Velocity, and Variety. Web 2.0

enabled each one of us to become a journalist and express our ideas, which is good, because it brought so many truths and so many people to the forefront. There is also a downside to it – a significant part of the World Wide Web is not entirely true. The next article discusses the problem of Veracity, the fourth ‘V’ of Big Data and how analytics can play a role in establishing the veracity of Big Data. One of the primary goals of technology is job creation. Krishna Kumar discusses career prospects in the areas of Business Analytics and Big Data in his article. It is hoped that you will find all the information presented here useful and practical. In due course, it is hoped that this will grow into a full-fledged transactions level journal. Your involvement is crucial in this journey. We invite research articles, letters, and practice papers for publication in the newsletter. Details are in the Call for Papers included inside. Happy reading!

Vishnu S. Pendyala October 23, 2016

2


President’s Message

From Prof. Dr. Anirban Basu, Ph.D. (CS) President Computer Society of India I am delighted to know that CSI Special Interest Group on Big Data Analytics (SIG-BDA) is launching a monthly e-magazine. Big Data Analytics will play an important role in the days to come and I am glad that SIG-BDA is making all efforts to enlighten Computer Scientists on this emerging science. I hope that the emagazine will be rich in contents and have articles on the advances going on in this field. SIGs can play an important role in bringing together our members interested in the same subject and I am sure SIG-BDA will be a path setter. I convey my best wishes for the success of the e-magazine all the endeavors of SIG-BDA.

Dr. Anirban Basu August 15, 2016

3


Honorary Chairman’s Message C.R.Rao, Sc.D. (Cantab), F.R.S. CRRAO AIMSCS, Recipient of National Medal of Science, India Prof. CR Rao Road, National Medal of Science Laureate, USA Hyderabad 500046. -----------------------------------------------------------------------------------------------

I am extremely pleased to note that the Special Interest Group in Big Data Analytics of the Computer Society of India (CSI SIG-BDA), founded in 2014 by PC Mahalanobis Chair Professor Saumyadipta Pyne in CRRao AIMSCS in Hyderabad, is going to publish its Newsletter, very aptly titled 'Visleshana', which means "Analysis". The analysis of Big Data rightly presents many challenges to the researchers and the practitioners. Towards this, my hope is that Visleshana will serve as a useful resource for dissemination of information on Big Data and Analytics as well as a forum for lively interaction among its readership. I hope attempts will be made to report from time to time some case studies where decisions based on Big Data analysis were implemented and the benefits resulting from it. I wish Visleshana, its editors and the SIG members all success. Best Regards,

C.R. Rao

20 October, 2016

4


Chairman and Convener’s Message Dear Reader, It gives me both pride and pleasure in equal measure, as the founder Convener and the first Chairman of CSI SIG BDA, to see the SIG, which was established in September 2014, now ready to publish the first issue of its new digital newsletter, Visleshanā. When asked to suggest a title, my first thought was that 'Visleshanā', a term which means "analysis" in Sanskrit, would help us remember the long-cherished analytical tradition of India in which the classical scholars of the past had masterfully dealt with large numbers, complex patterns, intricate rules of logic and grammar and, of course, mathematical computation. Needless to say, BIG DATA is the most exciting and potentially disruptive approach in all of analytical sciences today. In fact, it is challenging the very foundations of the scientific and technological enterprise as we know it. The mind-boggling speed and integration of information, the challenging "V"s of data, the genuine concerns of customers, the astonishing power of analytics, the convergence with high-performance computing - will all shape together a field that will continue to emerge and enrich science and engineering for a considerable period of time into the future. Towards this, my hope is that Visleshanā will serve as both a timely resource for facilitating the interested members of the CSI community to write about their views on Big Data and Analytics as well as an active forum for sharing useful updates on the latest activities in the field. Hence, I strongly encourage the CSI community in general, and the Big Data enthusiasts in particular, to consider contributing timely briefs and original articles for publication in Visleshanā and thus help us take it forward. I sincerely thank Mr. Chandra Sekhar Dasaka (Chief Editor and Publisher), Mr. Vishnu S. Pendyala (Editor), and all my fellow-members of the Newsletter Editorial Committee for their efforts, and wish Visleshanā to set out on an enriching journey. Sincerely,

Saumyadipta Pyne.

September 30, 2016.

Professor, IIPH Hyderabad. Founder Convener and Chairman, CSI SIG BDA.

5


Big Data – How do we leverage it? Shankar Kambhampaty Traditionally, the OLTP systems processed transaction data in relational databases and OLAP systems provided analytics on data in multi-dimensional data stores. This is all fine when the amount (volume) of data is in the range of Terabytes, when the rate (velocity) at which data grows is linear and when the types of data (variety) to be dealt with is limited and structured. By the way, Volume, Velocity and Variety are often referred to as the 3Vs of data. But a few things are changing rapidly. The 3Vs are increasing dramatically in certain cases such as Social Media. Such data, typically, will have the following characteristics: Huge amount Rapid unplanned growth Unstructured data types (Photos, videos, text, graph etc) This is the BIG DATA problem! How do we leverage this in the context of an enterprise? From my experience, it can be done in two clear ways. What if we could pull large amounts of data that is essentially unstructured into a central infrastructure and were able to –

1) Identify relationships among the data items from different sources. It would open up several new possibilities of leveraging the relationships. A Private Bank dealing with its high net worth clients can analyze the relationships among key data items available in Social media on their clients and position new products of interest to those clients. 2) Identify patterns and trends. It would help predict future events. A retail store would be able to look at unstructured data from various sources and predict the buying pattern of its clients with a certain profile. There are surely many more ways of leveraging Big Data. Look forward to seeing your views. Businesses have to do something different with what they already have and what is out there in an unstructured manner in the open. Can they look at it differently to generate new opportunities? That, in my view, is the promise of Big Data!

Shankar Kambhampaty is CTO & Associate Director (Technology) at CSC, India. Read more about him at https://in.linkedin.com/in/shankar-kambhampaty-b1132946 and http://www.csc.com/innovation/insights/119613-distinguished_architect_shankar_kambhampaty

6


Veracity – the Fourth ‘V’ of Big Data Vishnu S. Pendyala

It is estimated that 500 million tweets are posted in a single day. How much care and caution would have gone into that many tweets posted at that speed? The Web, being humanity’s largest source of information and interaction, probably has the most technological potential to improve the quality of life. The future potential of the Web is discussed in [1]. It can serve as a conduit for serving humanity and presents a huge opportunity to fill in some of the pressing gaps. However, substantial content on the Web is not entirely true. This is part of the problem of Veracity (truthfulness), which is often considered as the fourth ‘V’ of Big Data. Motivation Information in the age of the Web travels in microseconds to remote corners of the world and so will lies and the tendency to tell lies. The size of the digital universe is expected to touch 35 PB in another 4 years. In terms of IP traffic, most of it is from the consumers. Consumer generated data has grown 44 fold from 2009. Indeed, consumers rather than businesses mostly build the digital universe. So, what does that imply? It means, most of the data is not validated and does not come with the backing of an establishment like that of the business data. Lies in general have a substantial cost. Given the many uses of Big Data, falsity therein causes tremendous loss. If the ground truth is unreliable, the entire model

built based on it, to provide useful services to the underserved populations, collapses. In five years, one million new devices will come online every hour, creating billions of new interconnections and relationships, and producing more and more data. That is the promise of the Internet of Things (IoT). How reliable will the data measured or generated by the sensors in the IoT framework be? It is quite imperative that we need tools and techniques to deal with imprecision, inaccuracy, and plain falsehood. Veracity will therefore play a very important role in the evolution of Big Data Analytics. Methodology How do we know something is true? When it comes to the intricate matters of life, math comes to the rescue. Math is the heart of matter. Once expressed in math, the matter dissolves and yields, just like when you touch a person's heart, he dissolves and yields. But before jumping into mathematical abstractions, let us consider how this is done in the real world. After all, a lot of Computer Science solutions are derived from the real world or inspired by the real world phenomenon. It is often said that Data is like people – interrogate it hard enough and it will tell you whatever you want to hear. In serious crimes, that’s what the police do to the accused to extract truth. We need to 7


build the right kind of framework to interrogate the Big Data – that is the whole idea of Big Data Analytics. Establishing truth is the crux of legal battles. How do we tell if someone is telling the truth? We examine the associated details - some features of the scenario. For instance, a lie detector examines features like the pulse rate, Blood Pressure, breathing rate etc. Similarly, Big Data has some features, which the analytics programs will examine. Machine learning algorithms used in the analytics programs depend on features extracted from the real world to classify the data. The features are indicative of the class to which the data belongs. Features therefore are crucial to building a model to detect truth. Building the Abstraction To develop a mathematical model, let us take the instance a fruit vendor. How do we know that the apples that the vendor is selling are really apples and oranges are indeed oranges? It is from their features like color, shape, size, texture of the skin, hardness, and smell. The features are not constants – their values are not binary. They vary. We cannot expect any feature like the shape or color with certainty. For instance, we cannot for sure say that oranges are all spherical or apples are all red. Oranges could be oval as well and apples can come in maroon and other shades of red. The best tool to model uncertainty is using probability. So, then each feature can be treated as a random variable with a probability distribution. We can also call them as predictor variables, since they help in

prediction or as independent variables like the independent variable ‘x’ in a co-ordinate system. We pretty much know the probability distribution of the features of the fruits – we know what sizes and shapes apples come in, by observing several apples from a set of apples. We call this set as “training set” because we train ourselves in recognizing apples by examining the features of the apples in this set. The variable that is dependent on these features is the category of the fruits, in this case, apples. So, we need to build a model from the training set, which will help predict the category of the fruit, given its features. We call the set of fruits from which the fruit is drawn, the “test set” because the fruits from that set are used to test the model that we built from the training set. We can capture the above discussion in the following equation: P(Y=1|X) = f(X) …….(1) where X = {x1, x2, …, xn}, the set of random variables xi • xi are the random variables representing the features of the fruit. • P(Y=1|X) is the probability of a fruit to be, say, an apple, given its features, X. It can be easily seen that not all features are equally important. For instance, shape is not as important in determining the category as color or skin texture because both oranges and apples can more or less be spherical. So, not all features carry the same •

8


weight. We therefore need to weigh each feature in arriving at the decision as to which fruit it is. f(X) in the above equation, expressed in terms of weights is: f(X) = w0 +∑i=1n wixi ………(2) Before we substitute equation (2) into (1), we need to realize that the L.H.S. of equation (1) is always between 0 and 1, being a probability. So, we need to come up with ways to make the R.H.S also always to be between 0 and 1. There are a few ways to do this, one of which is to use the “logistic function,” expressed as: logistic(z)= ………(3) The above function is always assured to be positive but less than 1, ideal for our use. Using the logistic function in equation (1) will give us a model, which helps in identifying the true classification of the fruits. A model thus built using the logistic function, is called “logistic regression” model. The equation in (1) now transforms to: P(Y=1|X)=

…………(4)

The problem of determining the truth now reduces to finding the weights w0, w1,…, wi, …, wn in the above equation (4). It is easy to

understand that the goal of the weights is to reduce the error between observed and the values estimated using the above equation. We use what is called as the Maximum Likelihood to reduce the error and thereby determine the weights in the above equation, by applying the above equation to the training set. The word “Maximum” should bring the Calculus lesson on “Maxima and Minima” to mind. Indeed the same methods are used in computing the weights using the Maximum Likelihood principle. When we are given a new fruit with given features, we substitute the weights derived from the training set and the features extracted from the given fruit in the above equation (4) to determine if the fruit is indeed what the vendor is trying to sell it as. The above analogy applies to any Big Data set. The process is summarized in figure 1 below. Big Data can be in the form of social media posts like tweets, measurements from sensors in the IoT devices, or other records containing valuable data. Features can be extracted and results annotated using manual or automatic processes. Crowd sourcing tools such Amazon Mechanical Turk is typically used for manual annotations and feature extraction. An example of automated feature extraction, say, in case of social media posts like tweets, is by using sentiment analysis, which give out sentiment scores for each tweet.

9


Fig. 1. The Process of determining truth of Big Data

Once the features such as the sentiment scores are expressed in math, we no longer have to deal with the original data. We can use statistical techniques such as logistic regression used in the above example, on this mathematical abstraction to arrive at the classification. Conclusion Truthfulness of Big Data will play a pivotal role in its usefulness. In this article, we have seen how veracity of Big Data can be treated as a math problem and then how analytics can be run. Complexity can grow with the

application, but the underlying principles highlighted in this article remain the same. For a discussion on how some of these techniques can be applied to the twitter data, please see [2] in the references below. References [1] Pendyala, V. S., Shim, S. S., & Bussler, C. (2015). The web that extends beyond the world. Computer, 48(5), 18-25. [2] Pendyala, V. S., & Figueira, S. (2015, October). Towards a truthful world wide web from a humanitarian perspective. In Global Humanitarian Technology Conference (GHTC), 2015 IEEE (pp. 137-143). IEEE.

Vishnu Pendyala is a Senior Member of IEEE and Computer Society of India, with over two decades of software experience with industry leaders like Cisco, Synopsys, Informix (now IBM), and Electronics Corporation of India Limited. Read more about him at https://www.linkedin.com/in/pendyala and his publications at: https://scholar.google.com/scholar?hl=en&q=vishnu+pendyala

10


Careers In Business Analytics Krishna Kumar

Business Analytics is an area, which

is increasingly occupying the mind of every CEO in the world. We can safely assume that almost all CEOs would have given serious thought / initiated some discussions or actions about Business Analytics over the past 12-18 months. They would have wondered, as a company • If they know all the data they are generating • If they know all their customers across multiple channels • If they have tapped into all the data to unearth their customer preferences As with any emerging area – these thoughts give rise to vague plans, which crystallize over a period of time as companies experiment and realize potential of data analytics. In this article we will discuss a few of the emerging roles in Business Analytics and explore what the role is expected to deliver for the business. What exactly is Business Analytics? Business Analytics is Data-Driven decision making. It is a process of 1. Continuous exploration / investigation of data from all possible sources 2. Detecting patterns / relationship in the data

3. Identifying potential areas and experimenting in pockets to validate assumptions 4. Applying the Gained business insights to drive Business direction for the company Questions that are typically asked in this context: a) What factors influence your business growth b) Who are your most profitable customer c) What are the things do to enhance customer life-time value d) Which product / service offering helps you win customers e) Which product / service offering helps maximize your profits f) Why are customer not buying from the website after all the money we pump into advertising The questions are endless and will depend on the kind of business that is under consideration or the growth phase of the business. As you can envisage, the work to be done here, includes serious business knowledge which can define the future course of the company Typically like in any emerging field / maturing field such as Business 11


Analytics – which is often at the confluence of different expertise - throws up challenges to the CEO in the form of assembling the right team, with the right skill mix to be put in charge of defining the future of their business. For the crack team in charge of attempting to define the future direction of the company, people are drawn both, from within the company and from outside to fill the gaps in expertise in the team, which is needed for such a task. Often the roles are loosely defined to begin with and as the team works together, more gaps are identified and further expertise are sought and new roles are defined. In Business Analytics too, there are new roles emerging and are often difficult to find in the market and therefore are well-paid positions. And so anyone with the right bent of mind and willing to apply himself / herself to learning new skills - can identify suitable roles which are close to ones core strengths and experience in this Business Analytics arena. As mentioned in the initial paragraphs, the roles are emerging, slowly getting crystallized and job

descriptions are getting better as we mature in this area. The new areas in business analytics, which include Big data, Cloud computing, data management are creating several new roles. Many of the Business Analytics roles today form the hottest skills in Job surveys in 2015. For e.g. Data Mining, Data engineering, Data Scientist, Data Presentation etc. In fact Gartner declared Data Scientist to be the sexiest job in 2014. Lots of companies are looking for the right talent and you could be the one chosen, if you can mold your current capabilities with new skills learned in this area. Business Analytics is the right area for you – if you have the following basic qualities • You love working on data and have good analytical skills • You have basic knowledge of mathematics • You have the drive to get to the top in the game and can do a lot of self learning Of course there are several new courses being offered by well-known universities in the areas such as Statistics and Data Science that equip you with basic skills for that role.

12


Call for Contributions

Submissions, including technical papers, in-depth analyses, and research articles are invited for publication in “Visleshana”, the newsletter of SIG-BDA, CSI, in topics that include but are not limited to the following: • Big Data Architectures and Models • The ‘V’s of Big Data: Volume, Velocity, Variety, Veracity, Visualization • Cloud Computing for Big Data • Big Data Persistence, Preservation, Storage, Retrieval, Metadata Management • Natural Language Processing Techniques for Big Data • Algorithms and Programming Models for Big Data Processing • Big Data Analytics, Mining and Metrics • Machine learning techniques for Big Data • Information Retrieval and Search Techniques for Big Data • Big Data Applications and their Benchmarking, Performance Evaluation • Big Data Service Reliability, Resilience, Robustness and High Availability • Real-Time Big Data • Big Data Quality, Security, Privacy, Integrity, Threat and Fraud detection • Visualization Analytics for Big Data • Big Data for Enterprise, Vertical Industries, Society, and Smart Cities • Big Data for e-governance • Innovations in Social Media and Recommendation Systems • Experiences with Big Data Project Deployments, Best Practices • Big Data Value Creation: Case Studies • Big Data for Scientific and Engineering Research • Supporting Technologies for Big Data Research • Detailed Surveys of Current Literature on Big Data We are also open to: • News, Industry Updates, Job Opportunities, • Briefs on Big Data events of national and global importance • Code snippets and practice related tips, techniques, and tools • Letters, e-mails on relevant topics and feedback • People matters: Executive Promotions and Career Moves All submissions must be original, not previously published or under consideration for publication elsewhere. The Editorial Committee will review submissions for acceptance and reserves the right to edit the content. Please send the submissions to the editor, Vishnu S. Pendyala at visleshana@gmail.com

13

Visleshana 1 1 oct dec 2016  
Visleshana 1 1 oct dec 2016  

The Flagship Publication of the Computer Society of India, Special Interest Group on Big Data Analytics (CSI SIGBDA). Vol 1 No.1 Oct - Dec 2...

Advertisement