Page 1


Volume 1, Issue 3

Newsletter of the Special Interest Group on Big Data Analytics

April – June 2017

Image credit:

Chief Editor and Publisher Chandra Sekhar Dasaka

From the Editor’s Desk Dear Readers,

Editor Vishnu S. Pendyala Editorial Committee B.L.S. Prakasa Rao S.B. Rao Krishna Kumar Shankar Khambhampati and Saumyadipta Pyne Website:

Please note: Visleshana is published by Computer Society of India (CSI), Special Interest Group on Big Data Analytics (CSISIGBDA), a non-profit organization. Views and opinions expressed in Visleshana are those of individual authors, contributors and advertisers and they may differ from policies and official statements of CSI-SIGBDA. These should not be construed as legal or professional advice. The CSI-SIGBDA, the publisher, the editors and the contributors are not responsible for any decisions taken by readers on the basis of these views and opinions. Although every care is being taken to ensure genuineness of the writings in this publication, Visleshana does not attest to the originality of the respective authors’ content. © 2017 CSI, SIG-BDA. All rights reserved. Instructors are permitted to photocopy isolated articles for non-commercial classroom use without fee. For any other copying, reprint or republication, permission must be obtained in writing from the Society. Copying for other than personal use or internal reference, or of articles or columns not owned by the Society without explicit permission of the Society or the copyright owner is strictly prohibited.

We are making good progress towards making Visleshana a research publication of international quality. The content and format of this issue is a step in that direction. The primary publication of a Special Interest Group should be a transactions type of a journal. The good news is that Big Data Analytics is attracting plenty of research and there is plenty of scope for exploration. We are therefore quite hopeful of getting excellent research articles that will be cited many times. This issue of Visleshana features a detailed research article on segmenting customers based on their Customer Lifetime Values (CLV). Customer Relationship is an important paradigm contributing to the success of a business. CLV is good metric that businesses should be using to nurture long term customer relationships for effective monetization in future. Authors Nikita and Surya realized this importance and used Pareto / NBD technique to model the retail data and segment customers into gold, silver, and bronze categories and evaluate the accuracy of their methodology. It is a great experience to hear experts elucidate complex technologies in simple terms. We are fortunate that Prof. BLS Prakasa Rao, a world-renowned statistician who is also a Shanti Swarup Bhatnagar awardee, has kindly consented for his Lecture Notes on Big Data to be excerpted into an article for this issue of Visleshana. Engineering without regard to ethical considerations is like humanity without a conscience. It may actually take the civilization backwards instead of helping make progress. Dr. Aruru, who is making remarkable contributions to Health Policy and Ethics, in her article inside, writes about the emerging interdisciplinary area of Big Data Ethics, and the timely and pertinent questions that it raises. No discussion about Big Data Analytics is complete without discussing the career opportunities. Krishna Kumar delves into this topic and presents excellent insights into Big Data careers. Often dubbed as “the sexiest job of the 21st century”, the key roles and the requisite skills are described in a lucid manner. IEEE’s international conference on Big Data Services was organized in San Francisco this year. I had the opportunity to attend and also present a technical paper in the conference. I thought the readers would benefit from quick insights into the conference proceedings without actually having to sit through the entire conference or read the bulky proceedings. So, I compiled and included the notes that I took from the conference in this issue. You can read about Kinetic Drives, how Walmart is trying to solve the NP-hard problem of matching purchase items to packaging boxes among several other interesting snippets from the presentations. Happy Reading! With Every Best Wish, Vishnu S. Pendyala April 16, 2017 San Jose, California, USA

April - June 2017 ^ Visleshana ^ Vol. 1 No.3



Customer Segmentation based on Lifetime Value Nikita Naidu and Surya Putchala Abstract—Businesses are built around Customers. Understanding customer value is by far the most important thing that impacts its viability and sustainability. Customer lifetime value(CLV) gives an understanding of how profitable a customer will be throughout his journey with the business. Therefore, modelling CLV becomes one of the most critical and challenging problem. In this paper, we have described some of the ways of calculating customer value ranging from Historic CLV to predictive CLV in different business settings: Contractual and Non-Contractual. Further, we present the results of our experiments on CLV modelling in a non-contractual business settings using Pareto/NBD probabilistic modelling technique. We describe our method of classifying the customers into gold, silver and bronze classes according to their Lifetime Value (LTV). We suggest the methodology of using Soft Margin for improving classification accuracy. This improved classification accuracy of the dataset under study to 74% which is encouraging. Index Terms—Customer Lifetime Value(CLV)/ Lifetime value (LTV), Customer Segmentation, Historic CLV, Predictive CLV, Pareto/NBD probabilsitic model.





HERE are many definitions for customer value but the most popular definition is the amount of future revenue generated by the customer. Businesses can easily segment their customers into low, medium and high value segments and target their resources to the right group maximizing the returns using CLV information. If the customer value is understood, we can: • Determine which customers to invest in • Identify new customers and markets to target • Agree which product and service lines should be offered and promoted • Change pricing to extract more value • Identify the unprofitable customers • Understand where to cut costs and investments that are not generating growth A robust understanding of the lifetime value of the customers can provide a clearer view of the valuation of a business, in addition to the potential opportunities to increase the value. Customer lifetime value (CLV) is an important metric to understand the customers and attribute the value the customer brings to the business. Customers differ in their purchasing habits in terms of frequency of visits, recency of visits and the amount spent in their visits. Hence, the customers differ in the value they make for the business. The Pareto principle states that 80% of the revenue is generated by the 20% of the customers. CLV helps business understand their most profitable users and tailor their marketing strategies targeting their most profitable customers to maximize the return on investments.

There are many techniques available in literature on modelling CLV. CLV can be historic or predictive. Historic CLV can be simply calculated from user’s past behaviour without any estimation on his future behaviour. Predictive CLV, however, requires modelling of customer’s purchase

rate to estimate how frequently he will buy in the future and customer’s average life span. We have described these in detail in section 2. In [1] Peter and Bruce have explained how businesses can be broadly classified into contractual and non-contractual type and presented brilliant methods for modelling predictive CLV. We have selected Pareto/NBD modelling as an appropriate technique for the e-commerce retail data under study. The results and data description has been provided in section 3.



Historical CLV is calculated based on the past transactions of the customer and does not involve any estimate on the future transactions. On the other hand, Predictive CLV models the purchase behaviour of the customer and makes prediction on the future transactions to estimate the potential revenue that can be generated by the customer. 2.1

Historical Customer lifetime value

Average Revenue Per User (ARPU) is one of the simplest way of calculating historic CLV. ARPU : Average Revenue Per User This method is very easy and most used method. It involves calculating average revenue of every customer per month and then multiplying by 12 to get total CLV for a year. Customer James James Mary Mary Mary

Purchase Date Sep 07, 2016 Oct 10, 2016 Jan 01, 2017 Jan 02, 2017 Feb 20, 2017

Amount $190 $100 $50 $75 $150

TABLE 1: Customer purchase details

Consider a toy example showing customer purchses in table 1. Suppose, we want to calculate aggregate CLV of

April - June 2017 ^ Visleshana ^ Vol. 1 No.3



Avg. Revenue per month James Mary

($190 + $100)/ 6 = $48.33 ($50 + $75 + $150)/ 2 = $137.5

ARPU ($48.3 + $137.5)/ 2 = $92.9

TABLE 2: Aggregated CLV using ARPU

the customers as on date 01 March, 2017. We can see that customer James has bought two times in past 6 months and customer Mary has bought 3 times in past 2 months. Their ARPU can be calculated using formula:

T otal purchase amount (1) T otal months (Refer table 2 for ARPU calculations). Individual level ARPU per month(using formula (1)) can be aggregated to obtain the ARPU of the entire population. ARPU = $92.9. To obtain a 6 month or 12 month CLV multiply ARPU by 6 or 12. ARPU model can be improved by considering recency of visit as a factor to distinguish between an active or inactive customers. It may also include frequency of visits as a measure to give more weightage to frequent customers. However, there are big limitations in ARPU model. It does not capture heterogeneity in customer behaviour. It considers all the customers the same. Further, ARPU model assumes that the customer behaviour will be constant throughout the customer journey which is not true. ARPU estimations could often be misleading and give a poor estimate of CLV. 2.2

by divinding the population into cohorts and capturing the customer retention ratio dynamics. CLV can be calculated using the following standard formula: T X rt CLV = m (2) 1 + dt t=0 where, m = net cash flow per period (if active) r = retention rate d = discount rate T = horizon for calculation

Consider following illustration of a hypothetical scenario in a contractual setting. Assume: • Each contract is annual, starting on January 1 and expiring at 11:59pm on December 31. • An average net cashflow of $200/year. • A 10% discount rate Table 3 shows the cohorts by acquisition date and table 4 shows year wise retention ratio in each cohort. What is the expected residual value of the customer base at December 31, 2016? Aggregate retention can be calculated using:

2504 + 3264 + 4367 + 6334 = 0.64 3264 + 5179 + 7339 + 10000 Expected residual value of the customer base at December 31, 2016 using standard formula (2):

Predictive Customer Lifetime Value

This method takes customer dropouts/churn into consideration to determine average lifespan of the customer with the organisation. However, it is not trivial to predict the future purchase pattern for every customer as it depends on the type of business. There are two types of B2C businesses: Subscription based services (such as Cable TV, Digital Newspapers, Mobile Plans, Club memberships, Software) and Open Retail (Groceries, Electronic, food stores etc.,). In a subscription service, end of customer journey can be definitely known when the customer subscription is up for renewal. These services are driven by Contractual setting throughout the Customer Journey. The Customer/Business has to overtly and wilfully stop the service. The purchases can also happen in a regular or irregular intervals. On the contrary, in a retail stores or e-commerce set up (Non-contractual), there is no event that marks the end of customer journey. Thus, the latent parameters i.e. customer lifespan, purchase rate and monetary spent has to be derived from the deviations in customer behaviour. A customer can leave any time or even come back afterwards for a purchase. There is no event that marks customer churn. Hence, these different business scenarios will have to be modelled considering different features and metrics in order to estimate customer churn and thus, CLV. Predictive CLV in Contractual setting: In contractual settings, the customer CLV can be calculated

$26469 ⇥




0.64t 1 + 0.1t

(T can be any number of years)

Cohort 1 Cohort 2 Cohort 3 Cohort 4 Cohort 5







8559 10000

6899 6334 10000

3264 5179 7339 10000

2504 3264 4367 6334 10000






TABLE 3: Number of active customers year by year in each cohort







0.81 0.63

0.47 0.82 0.73

0.77 0.63 0.60 0.63 -





Cohort 1 Cohort 2 Cohort 3 Cohort 4 Cohort 5

TABLE 4: Retention ratio in each cohort

Predictive CLV in Non-Contractual setting: There are several different proababilistic models but they all evolve around the same assumptions and modelling framework.

April - June 2017 ^ Visleshana ^ Vol. 1 No.3



A typical probabilistic modelling framework: 1 These models estimate the latent (unobserved) parameters of the contractual setting x: Customer Lifespan, Purchase Rate and Monetory Spent; by treating behaviour as if it were random (probabilistic). 2 Selecting the distribution that can fit this behaviour x denoted as f (x/✓). These are individual latent traits. 3 Specifying a distribution that can characterize population level distribution of the latent trait ✓ denoted as f (✓) In the next section, we have described one of the most popular probablistic modelling framework: Pareto/ NBD for non-contractual setting: Pareto - Negative binomial distribution (Pareto/ NBD): This is one of the most popular probabilistic modelling technique for calculating CLV in non-contractual setting. It makes following assumptions on the latent parameters : 1 The individual level purchase rate is characterized by poisson distribution. 2 is distributed across the population by gamma distribution characterised by r and ↵. 3 The individual level churn rate µ is characterized by Exponential gamma. 4 µ is ditributed across the population by gamma distribution characterised by s and The model basically takes recency, frequency, total transactions in the calibration period and total observed time for each customer in the population. It ouputs the four parameters r, ↵, s, using which the purchase rate and the dropout/churn rate can be estimated. Thus, the expectation r of the purchase rate E( ) = and expectation of dropout ↵ s rate E(µ) = . Estimating CLV using Pareto/NBD model: The model outputs expected number of transactions in the future at customer level. It takes individual customers recency, frequency, total transactions and observation time to produce expected number of transactions T at the end of period t. CLV can then be simply calculated using: (3)


where, T = Number of transactions in time t AOV = Average Order Value Thus, there are many ways of modelling CLV. The performance of these models must be validated by comparing predicted CLV and actual customer generated revenue in the future. In the next section, we present our method of modelling CLV for a retail domain and show how it can be used to segment the customers.


based on the percentage of predicted total revenue(by CLV) generated by them. We compared the segmentation results to the actual revenue generated by the customers in holdout data. We also came up with a distinctive approach of segmentation : Soft Margin.The classes (gold, silver, bronze) are only logical segments generated using filters (revenue cutoff). The adjacent classes(gold and silver, silver and bronze) are seperated with many customers falling on the margin or very close to the margin. Hence, instead of a hard-margin we propose a soft-margin with ± 5% of total revenue on either side of the margin. Therefore, our new segments are gold(70 ± 5%), silver(20% ± 5%), bronze(10% ± 5%). The idea is to classify the people falling in the overlapping region as members of both classes. Meaning, people generating revenue between (65% ± 75%) are both gold and silver customers. Similarly, people generating revenue between (15% ± 5%) are both silver and bronze customers. Dataset and Experimental setup : We have used Online Retail dataset from UCI ML repository [3]. The data consists of 541909 instances. Preprocessing includes removing of negative transactions(return, order cancellations), Invoices with Order Value 0 and duplicate line items. We divided the data into calibration and holdout set as follows: • • • • • • •

Dataset length : Transactions between 2010-12-01 to 2011-12-31 (13 months) Cut-off date : 2011-05-31 Calibration set : All transactions with Invoice date > cut-off date Holdout set : All transactions with Invoice date 6 cutoff date Total customers in entire dataset : 5213 Customers in Calibration set (Old/Observed) : 3131 Customers in Holdout set (New/Unobserved) : 2082

Modelling and Parameter Estimation: The model was built on the calibration set and the CLV for all the cutomers (New and Old) was predicted for the next 7 months (holdout set duration). Customers were observed for only 6 months and based on their purchase patterns our model learned the churn rate and purchase rate of the customers as shown in table 5. The expectation of the Purchase rate (transaction rate) of the general population is 0.32 transactions per month. However, the dropout rate based on the inter purchase transaction time is very low (0.0000617). The distributions of both are shown in fig 1 and fig 2. r





churn rate E(µ)



TABLE 5: Model parameters


We modelled predictive CLV using Pareto/NBD model to obtain lifetime value for every customer in the dataset. The dataset was divided into calibration data and holdout data. The customers in calibration data were then segmented as gold(70% total revenue), silver(20%) and bronze(10%)


purchase rate E( )

Validation and Performance: The model fits the calibration set well with log likelihood -6575.19. As shown in fig 3, the model’s estimation of frequency of repeat transactions for the customers in the calbiration period closely matches their actual frequency.

April - June 2017 ^ Visleshana ^ Vol. 1 No.3



Fig. 1: Distribution of transaction rate

Fig. 2: Distribution of dropout rate

Fig. 3: Actual Vs Predicted Frequency of transactions in calibration period Fig. 4: Actual Vs Predicted transactions in calibration and holdout period

Similarly, the model’s conditional expectation on the number of transactions in the future given the number of transctions in the calibration period for a customer closely follows the actual purchase pattern (fig 4). Thus, we can conclude that model has learnt the purchase rate and churn rate very well. Segmentation of customers and Results: Customer lifetime value has been calcuated using formula (3) CLV = T ⇼ AOV where, T = Customer transactions in time t AOV = Average Order Value The customers have been segmented into Gold(G), Silver(S) and Bronze(B) class based on the percentage revenue predicted by CLV. We grouped the customers in two ways: Regular margin and Soft margin. Regular margin : Gold (G) - Customers generating 70% of the total revenue Silver (S) - Customers generating remaining 20% of the total revenue Bronze (B) - Customers generating remaining 10% of the

total revenue Soft margin : Gold (G) - Customers generating 65% of the total revenue Gold Silver (GS) - Customers generating remaing 35% - 25% of the total revenue Silver (S) - Customers generating remaining 25% - 15% of the total revenue Bronze (BS) - Customers generating remaining 15% - 5% of the total revenue Bronze (B) - Customers generating remaining 5% of the total revenue Further, we calculated the actual revenue generated by the customers in the holdout period and segmented them to the G, S, B buckets. Table 6-8 shows the confusion matrix obtained with regular margins for the different classes of customers. We obtained prediction accuracy of 0.64 for all the customers (refer table 9). Classification using Soft margin improves the prediction accuracy to 0.74.(Refer table13). The confusion matrix are shown in table 10-12. Any point in segment GS is classified as both Gold and Silver(same for BS). Thus, improvement in prediction accuracy looks obvious as some of the bands have overlapping classes. But, the highlight is the improvement in Sensitivity and Specificity

April - June 2017 ^ Visleshana ^ Vol. 1 No.3





actual G 535 142 28

S 323 527 257

B 234 874 2293

TABLE 6: Confusion Matrix for Regular Margin All Customers

All Old New

Gold Sensitivity Specificity 0.76 0.88 0.71 0.88 0.83 0.88



actual G 303 97 26

S 145 297 205

B 189 411 1458

TABLE 7: Confusion Matrix for Regular Margin Old Customers

Silver Sensitivity Specificity 0.48 0.75 0.46 0.8 0.5 0.69



actual G 232 45 2

S 178 230 52

B 45 463 835

TABLE 8: Confusion Matrix for Regular Margin New Customers

Bronze Sensitivity Specificity 0.67 0.84 0.71 0.78 0.62 0.93

Prediction Accuracy 0.64 0.66 0.62

TABLE 9: Performance measure for each class of customers with Regular Margin



actual G 586 91 28

S 246 604 257

B 186 563 2652

TABLE 10: Confusion Matrix for Soft Margin All Customers

All Old New

Gold Sensitivity Specificity 0.83 0.90 0.80 0.90 0.89 0.90



actual G 339 61 26

S 95 347 205

B 189 230 1639

TABLE 11: Confusion Matrix for Soft Margin Old Customers

Silver Sensitivity Specificity 0.55 0.84 0.54 0.88 0.56 0.81



actual G 247 30 2

S 151 257 52

B 45 285 1013

TABLE 12: Confusion Matrix for Soft Margin New Customers

Bronze Sensitivity Specificity 0.78 0.84 0.80 0.78 0.75 0.93

Prediction Accuracy 0.74 0.74 0.73

TABLE 13: Performance measure for each class of customers with Soft Margin

in each class. (Sensitivity and specificity are calcuated in the standard way of calculating True Positive Rate and True Negative Rate.) As per pareto principle, we expect a smaller group of people contributing to very large section of revenue. Here, the gold customers are essentially expected to generate high revenue (70%). So, to validate the CLV values we have plotted each class (B G S as predicted by CLV in the holdout period) against the actual amount of revenue generated by them in holdout period. We see that indeed 17% of the population segmented as G class by the model is actually generating 63.3% revenue. Refer pie chart below (fig.5) Thus, our model predictions are highly accurate and can be used to yield very useful insights on customer value.

fect estimation of his mean transaction value. The paper also presents gamma gamma extension to the Pareto/NBD model for estimating the mean transaction value for a customer. Our future work involves extension of our model to estimate the mean transaction value using gamma-gamma model extension and study its effect on Customer Lifetime Value.



The calculation of CLV is non-trivial and depends on the type of Business - Contractual or non-contractual business. CLV helps simplifying allocation of marketing budget to various channels. It also helps in formulating strategic plans for acquiring new customers, engage and retain existing customers and increase loyalty to create a sustainable revenue stream. It thus becomes critical to predict CLV with high accuracy. Our model can predict the CLV with 74% accuracy for both Observed and New customers. Further, the model predicts the gold class customers with 0.83 sensitivity and 0.90 specificity; successfully identifying the most important class of customers.


Fig. 5: Predicted CLV bins versus Actual revenue



[1] Probability Models for Customer-Base Analysis Peter S. Fader, Bruce G. S. Hardie cba tutorial handout.pdf [2] The Gamma-Gamma Model of Monetary Value Peter S. Fader, Bruce G. S. Hardie gamma.pdf [3] Online Retail, UCI ML repository datasets/Online+Retail

As described by Peter and Bruce in [2], Customer’s Average transaction value (average order value) is an imper-

April - June 2017 ^ Visleshana ^ Vol. 1 No.3



Nikita Naidu is a Data Scientist at Cappius Technologies with area of expertise in Artificial Intelligence, Machine Learning and Statistical Analysis. She is currently working on designing and modelling of Customer Insights framework which is a platform to understand the customers in various businesses and delivers actionable insights useful in all phases of customer journey - Retention, Engagement and Acquisition. Earlier she has worked on Root Cause Analysis and Continous Improvement of business with Accentue Services Pvt Ltd. Nikita has pursued Bachelor of Engineering, CS from University of Pune and Master of Technology, AI from University of Hyderabad and has been awarded graduation Gold Medal for academic excellence.

Surya Putchala provided thought leading consulting solutions in the areas of Business Intelligence, Data Warehousing, Data Management and Analytics to Fortune 500 Clients over the last two decades. Surya has a tremendous zeal to create Analytics products that bring significant improvements in Business Performance. Currently the vision behind the comprehensive customer experience management platform called �Capptix�. He is passionate about areas related to Data Science, Big Data, High Performance Cluster Computing and Algorithms. He constantly endeavours to evangelize the adoption of quantitative techniques for decision making in various verticals. He held senior leadership roles with firms such as GE capital, Cognizant, Accenture and HCL. He graduated from IIT Kharagpur.

April - June 2017 ^ Visleshana ^ Vol. 1 No.3


Brief Notes On Big Data B.L.S. Prakasa Rao Abstract - Without any doubt, the most discussed current trend in statistics is BIG DATA. Different people think of different things when they hear about Big Data. For statisticians, how to get usable information out of data bases that are so huge and complex that many of the traditional or classical methods cannot handle? For computer scientists, Big Data poses problems of data storage and management, communication and computation. For citizens, Big Data brings up questions of privacy and confidentiality. This brief notes gives a cursory look on ideas on several aspects connected with collection and analysis of Big Data. It is a compilation of ideas from different people, from various organizations and from different sources online. Our discussion does not cover computational aspects in analysis of Big Data.

—————————— u —————————— [Excerpted from C R RAO AIMSCS Lecture Notes Series]

1 WHAT IS BIG DATA? (Fan et al. (2013)) Big Data is relentless. It is continuously generated on a massive scale. It is generated by online interactions among people, by transactions between people and systems and by sensor- enabled equipment such as aerial sensing technologies (remote sensing), information-sensing mobile devices, wireless sensor networks etc. Big Data is relatable. It can be related, linked and integrated to provide highly detailed information. Such a detail makes it possible, for instance, for banks to introduce individually tailored services and for health care providers to offer personalized medicines. Big data is a class of data sets so large that it becomes difficult to process it using standard methods of data processing. The problems of such data include capture or collection, curation, storage, search, sharing, transfer, visualization and analysis. Big data is difficult to work with using most relational data base management systems, desktop statistics and visualization packages. Big Data usually includes data sets with size beyond the ability of commonly used software tools. When do we say that data is a Big Data? Is there a way of quantifying the data? Advantage of studying Big Data is that additional information can be derived from anal- ysis of a single large set of related data, as compared to separate smaller sets with the same total amount of data, allowing correlations to be found. For instance, analy-

sis of a large data in marketing a product will lead to information on business trend for that product. Big Data can make important contributions to international development. Analysis of Big Data leads to a cost-effective way to improve decision making in important areas such as health care, economic productivity, crime and security, natural disaster and resource management. Large data sets are encountered in meteorology, genomics, biological and environmental research. They are also present in other areas such as internet search, finance and business informatics. Data sets are big as they are gathered using sensor technologies. There are also examples of Big Data in areas which we can call Big Science and in Science for research. These include“Large Hadron Collision Experiment” which represent about 150 million sensors delivering data at 40 million times per second. There are nearly 600 million collisions per second. After filtering and not recording 99.999%, there are 100 collisions of interest per second. The Large Hadron collider experiment generates more than a petabyte (1000 trillion bytes) of data per year. Astronomical data collected by Sloan Digital Sky Survey (SDSS) is an example of Big Data. Decoding human genome which took ten years to process earlier can now be done in a week. This is also an example of Big Data. Human genome data base is another example of a Big Data. A single human genome contains more than 3 billion base pairs. The 1000 Genomes pro ject has 200 terabytes (200 trillion bytes) of data. Human brain data is an example of a Big Data. A single human brain scan consists of data on more than 200,000 voxel locations

April - June 2017 ^ Visleshana ^ Vol. 1 No.3



which could be measured repeatedly at 300 time points. For Government, Big Data is present for climate simulation and analysis and for national security areas. For private sector companies such as Flipkart and Amazon, Big Data comes up from millions of backend operations every day involving queries from customer transactions, from vendors etc. Big Data sizes are a constantly moving target. It involves increasing volume (amount of data), velocity (speed of data in and out) and variety (range of data types and sources). Big Data are high volume, high velocity and/or high variety information assets. It requires new forms of processing to enable enhanced decision making, insight discovery and process optimization. During the last fifteen years, several companies abroad are adopting to data-driven ap- proach to conduct more targeted services to reduce risks and to improve performance. They are implementing specialized data analytics to collect, store, manage and analyze large data sets. For example, available financial data sources include stock prices, currency and deriva- tive trades, transaction records, highfrequency trades, unstructured news and texts, consumer confidence and business sentiments from social media and internet among others. Analyzing these massive data sets help measuring firms risks as well as systemic risks. Anal- ysis of such data requires people who are familiar with sophisticated statistical techniques such as portfolio management, stock regulation, proprietary trading, financial consulting and risk management. Big Data are of various types and sizes. Massive amounts of data are hidden in social net works such as Google, Face book, Linked In , You tube and Twitter. These data reveal numerous individual characteristics and have been exploited. Government or official statistics is a Big Data. There are new types of data now. These data are not numbers but they come in the form of a curve (function), image, shape or network. The data might be a ”Functional Data” which may be a time series with measurements of the blood oxygenation taken at a particular point and at different moments in time. Here the observed function is a sample from an infinite dimensional space since it involves knowing the oxidation at infinitely many instants. The data from e-commerce is of functional type, for instance, results of auctioning of a commodity/item during a day by an auctioning company. Another type of data include correlated random functions. For instance, the observed data at

time t might be the region of the brain that is active at time t. Brain and neuroimaging data are typical examples of another type of functional data. These data is acquired to map the neuron activity of the human brain to find out how the human brain works. The next-generation functional data is not only a Big Data but complex. Examples include the following: (1) Aramiki,E; Maskawa, S. and Morita, M. (2011) used the data from Twitter to predict influenza epidemic; (2) Bollen, J., Mao, H. and Zeng, X. (2011) used the data from Twitter to predict stock market trends. Social media and internet contains massive amounts of information on the consumer preferences leading to information on the economic indicators, business cycles and political attitudes of the society. Analyzing large amount of economic and financial data is a difficult issue. One important tool for such analysis is the usual vector auto-regressive model involving generally at most ten variables and the number of parameters grows quadratically with the size of the model. Now a days econometricians need to analyze multivariate time series with more than hundreds of variables. Incorporating all these variables lead to over-fitting and bad prediction. One solution is to incorporate sparsity assumption. Another example, where a large number of variables might be present, is in portfolio optimization and risk management. Here the problem is estimating the covariance and inverse covariance matrices of the returns of the assets in the portfolio. If we have 1000 stocks to be managed, then there will be 500500 covariance parameters to be estimated. Even if we could estimate individual parameters, the total error in estimation can be large (Pourahmadi: Modern methods in Covariance Estimation with High-Dimensional Data (2013), Wiley, New York).

2 SOME ISSUES WITH BIG DATA (cf. Fokoue (2015); Buelens et al. (2014)) (i) Batch data against incremental data production: Big Data is delivered generally in a sequential and incremental manner leading to online learning methods. Online algorithms have the important advantage that the data does not have to be stored in memory. All that is required is in the storage of the built model at the given time in the sense that the stored model is akin to the underlying model. If the

April - June 2017 ^ Visleshana ^ Vol. 1 No.3



sample size n is very large, the data cannot fit into the computer memory and one can consider building a learning method that receives the data sequentially or incrementally rather than trying to load the complete data set into memory. This can be termed as sequentialization. Sequentialization is useful for streaming data and for massive data that is too large to be loaded into memory all at once. (ii) Missing values and Imputation schemes: In most of the cases of massive data, it is quite common to be faced with missing values. One should check at first whether they are missing systematically, that is in a pattern, or if they are missing at random and the rate at which they are missing. Three approaches are suggested to take care of this problem: (a) Deletion which consists of deleting all the rows in the Data matrix that contain any missing values ; (b) central imputation which consists of filling the missing cells of the Data matrix with central tendencies like mean, mode or median; and (c) Model-based imputation methods such as EM-algorithm. (iii) Inherent lack of structure and importance of preprocessing: Most of the Big Data is unstructured and needs preprocessing. With the inherently unstructured data like text data, the preprocessing of data leads to data matrices, whose entries are frequencies of terms in the case of text data, that contain too many zeroes leading to the sparsity problem. The sparsity problem in turn leads to modeling issues. (iv) Homogeneity versus heterogeneity: There are massive data sets which have input space homogeneous, that is, all the variables are of the same type. Examples of such data include audio processing, video processing and image processing. There are other types of Big Data where the input space consists of variables of different types. Such types of data arise in business, marketing and social sciences where the variables can be categorical, ordi- nal, interval, count and real-valued. (v) Differences in measurement: It is generally observed that the variables involved are measured on different scales leading to modeling problems. One way to take care of this problem is to perform transformations that pro ject the variables onto the same scale. This is done either by standardization which leads all the variables to have mean zero and variance one or by unitization which consists in transform the variables so that the support for all of them is the unit interval [0,1]. (vi) Selection bias and quality: When Big Data are discussed in relation to official statis- tics, one point

of criticism is that Big Data are collected by mechanisms unrelated to prob- ability sampling and are therefore not suitable for production of official statistics. This is mainly because Big Data sets are not representative of a population of interest. In other words, they are selective by nature and therefore yield biased results. When a data set be- comes available through some mechanism other than random sampling, there is no guarantee what so ever that the data is representative unless the coverage is full. When considering the use of Big Data for official statistics, an assessment of selectivity has to be conducted. How does one assess selectivity of Big Data? (vii) No clarity of target population: Another problem of Big Data dealing with official statistics is that many Big data sources contain records of events not necessarily directly associated with statistical units such as household, persons or enterprizes. Big Data is often a by-product of some process not primarily aimed at data collection. Analysis of Big Data is datadriven and not hypothesis-driven. For Big Data, the coverage is large but incomplete and selective. It may be unclear what the relevant target population is. (viii) Comparison of data sources: Let us look at a comparison of different data sources for official statistics as compared to Big Data. (Ref: Buelens et al. (2014))

For Big Data dealing with the official statistics, there are no approaches developed till now to measure the errors or to check the quality. It is clear that bias due to selectivity has role to play in the accounting of Big Data. (ix) Use of Big Data in official statistics: (a) Big Data can be the single source of data for the production of some statistic about a population of interest. Assessing selectivity of the data is important. Correcting for selectiv- ity can some times be achieved by choosing suitable method of modelbased inference (Leo Breiman (2001), Statistical Science, 16, 199-231). These methods are aimed at predicting values for missing/unobserved units. The

April - June 2017 ^ Visleshana ^ Vol. 1 No.3



results will be biased if specific sub-populations are missing from the Big Data set.

Bollen, J., Mao, H., and Zeng, X. (2011) Twitter mood predicts the stock market. Journal of Computational Science, 2, 1-8.

(b) Big Data set can be used as auxiliary data set in a procedure mainly based on a sam- ple survey. The possible gain of such an application for the sample survey is likely reduction in sample size and the associated cost. Using small area models, the Big Data can be used as a predictor for survey based measurement.

Buelens, B., Daas, P., Burger, J., Puts, M. and van den Brakel, J. (2014) Selectivity of Big Data, Discussion Paper, Statistics Netherlands.

(c) Big Data mechanism can be used as a data collection strategy for sample surveys.

Fokoue, E. (2015) A taxonomy of Big Data for optimal predictive machine learning and data mining, arXiv.1501.0060v1 [stat.ML] 3 Jan 2015.

(d) Big Data may be used irrespective of selectivity issues as a preliminary survey. Find- ings obtained from Big Data can be further checked and investigated through sample surveys.

3 COMPUTING ISSUES FOR BIG DATA (Fan et al. (2013)) As was mentioned earlier, the massive or very large sample size of Big data is a challenge for traditional computing infrastructure. Big Data is highly dynamic and not feasible or possible to store in a centralized database. The fundamental approach to store and process such data is to ”divide and conquer”. The idea is to partition a large problem into more tractable and independent sub-problems. Each subproblem is tackled in parallel by different processing units. Results from individual sub-problems are then combined to get the final result. ”Hadoop” is an example of basic software and programming infrastructure for Big Data processing. ”MapReduce” is a programming model for processing large data sets in a parallel fashion. ”Cloud Computing” is suitable for storing and processing of Big Data. We are not presenting the problems involved in storage and computation connected with Big Data in this brief notes.

REFERENCES: Aramiki, E., Maskawa, S., and Morita, M. (2011) Twitter catches the flu: Detecting in- fluenza epidemics using twitter. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 1568-1576. Benjamini, Y. and Hochberg, Y. (1995) Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57, 289-300.

Fan Jianqing, Han Fang and Liu Han (2013) Challenges of Big Data analytics, arXiv:1308.1479v1 [stat.ML] 7 Aug 2013.

Leak, J. (2014) “Why big data is in trouble; they forgot about applied statistics”, ”“Simply Statistics”, May 7, 2014. Pourahmadi, M. (2013) Modern Methods to Covariance Estimation with High-Dimensional Data, Wiley, New York. Tibshirani, R. (1996) Regression analysis and selection via the Lasso, Journal of the Royal Statistical Society, Series B, 58, 267-288. “Current trends and future challenges in statistics: Big Data” Statistics and Science: A Report of the London Workshop on future of the Statistical Sciences (2014), pp. 20-25.

Bhagavatula Lakshmi Surya Prakasa Rao won the Shanti Swarup Bhatnagar Prize for Science and Technology in Mathematical Science in 1982 and the Outstanding Alumni award from Michigan State University. He worked at the Indian Institute of Technology, Kanpur in the beginning of his career and later moved to Indian Statistical Institute, New Delhi. He was a Distinguished Scientist and Director of Indian Statistical Institute, Kolkata from 1992 to 1995. He also held visiting professorship at the University of California, Berkeley, University of Illinois, University of Wisconsin, Purdue University, University of California, Davis and University of Iowa. He was the Jawaharlal Nehru, and Dr. Homi J. Bhabha Chair Professor at the University of Hyderabad in 2006-08, and 2008–12, respectively. He is currently an Emeritus Professor at the Indian Statistical Institute and the Ramanujan Chair Professor at CR Rao Advanced Institute of Mathematics, Statistics and Computer Science in University of Hyderabad campus.

April - June 2017 ^ Visleshana ^ Vol. 1 No.3


Ethical Expedients of Big Data: Questions that confront us Meghana V. Aruru Abstract - In the current digital era, big data presents unique opportunities while also posing ethical challenges regarding individual privacy and freedom. Ensuing debates over "society" versus "individuals" will provide interesting opportunities to explore and think about issues that might call for new ethical norms as data becomes freely available across the globe.

—————————— u ——————————

1 INTRODUCTION “Technology is neither good nor bad; nor is it’s interaction with the social ecology is such that technical developments frequently have environmental, social, and human consequences that go far beyond the immediate purposes of the technical devices and practices themselves.” - Melvin Kranzberg (1985), historian of technology. The era of big data is upon us, although what constitutes big data is yet to be precisely understood. While such ambiguity could allow a multiplicity of perspectives to come together and enrich the field in its initial stages, it also leaves room for certain vagueness that may give rise to matters of serious concern. For instance, today the scientists, data analysts, sociologists and scholars of many other disciplines are increasingly demanding (and feeling comfortable while doing so) access to large amounts of data resulting from daily social interactions among people, often including those that may originate in individuals’ personal lives (including, of course, public figures). Is it always clear as to what the different ethical implications of such research and analyses are? Ethical norms that were established after the horrors of World War II are rapidly outpaced while ethicists strive to keep up with the world as it continues to evolve around them.(1) The situation is not unlike that faced by lawmakers and law enforcers who may be found trying to catch up with the times during the myriad cyber invasions and security threats that happen across the globe. With an increasing societal ob-

session to quantify human behavior and interactions, it is not always easy to comprehend the full range of costs and benefits of quantitative analysis of big data to change or effect individual behaviors (say, customer recommendation systems).(2) The excitement of gathering and churning massive amounts of data brings critical issues of quality (of knowledge) over quantity (of data) to the fore. Data generated from un-designed experiments or surveys may not be representative of the underlying population structures, much less of groups and individuals. The ethical dilemma about the overall utility of analyzing such data - whether freely available or obtained for a price – stems from the concern that, in the wrong hands, it could potentially compromise individual privacy and freedom, which are muchcherished ideals for the free societies that have developed post WWII. The question therefore is, for any individual, is personalized recommendation more important than one’s privacy or consent (in the form of freedom)? Ethical risks of big data can be broadly categorized as: (i) risk of breach of privacy and confidentiality, (ii) risk of integrative analysis, and (iii) risk of predictive analytics.


Privacy and freedom constitute individual rights in traditional biomedical ethics with measures like "informed consent" of patients supposed to prevent unnecessary harm. Even with traditional research, the

April - June 2017 ^ Visleshana ^ Vol. 1 No.3



question of whether informed consent is carried out well, and whether it hinders or fosters research, remains unanswered.(3–5) Now, in instances where newer data sources, such as social network datasets, are being merged and freely distributed, the question of informed consent must extend to: where exactly does the “individual” end and the “social” begin? Various stakeholders – government, industry, and academia among others – may use big data for specific purposes with the best interests of the population in mind but possibly at the expense of the individual. For example, educational data about individual students could be used to preferentially admit students who are “predicted” to perform better in the future, as opposed to providing a wholesome environment for all students to fulfill their respective potential, and thereby, risking the creation of a class divide by design. Even if one might assume that a prediction is correct at the time it is made, still that should not present an excuse to prevent any individual to strive and shape one’s own destiny. Such risks are well known in healthcare, where insurance companies may use data about populations to prevent access to healthcare to, or even penalize, individuals based on larger group or population characteristics regardless of whether a particular individual conforms to the group norms or not.(6–8)

3 RISK OF INTEGRATIVE ANALYSIS Free and transparent discussions of organizational and governmental interests with regards to individual privacy, freedom, reputation or ownership of data are absolutely vital toward preserving integrity of the analytical process. Concerns about data ownership, legacy, sharing and usage require collective thinking, and should lead to conscious efforts in policy-making. This is because in the digital era, data never die. The World Economic Forum has recently defined personal data as a ‘new economic asset’ in their work on ‘Rethinking personal data’.(8) In this context, big data is a real game-changer. Personal data obtained through social media can open up issues of veracity and privacy when it comes to their analysis. Worse still, given the myriad different sources of information – some public and some not quite so, and many of which might have originally been there for purposes that are very different from

the intended analytics – all of these now have the potential to be combined or triangulated to extract unanticipated insights (say, into an individual's current socio-economic or health status). While, real world information about the individual may not be readily available, an approximation based on individual interests, their social circle, employment, education may very well come close. For example, an analyst may be able to determine an individual’s age, habits and preferences, financial worth, and more, based on one's social media interactions, peer networks and details such as location or marital status. While data volume can help improve the accuracy of the automated learning algorithms, concerns regarding the privacy of individual information will remain a moot question for a considerable amount of time into the future.

4 RISK OF PREDICTIVE ANALYTICS Big data analytics in healthcare uses predictive algorithms to forecast health events in real time. From improving health of patients to lowering the cost of care to increasing accessibility to services to enabling precision medicine – all constitute merits of analytics in healthcare. Patient engagement outside the healthcare system can be harnessed for big data analytics to identify support networks, develop key messaging, improve clinical trial recruitment and retention among others. While for some types of data, patient consent is sought, for data that exist in the public domain, such consent is often not deemed necessary.(5,8–10) To consider another example of predictive analytics, PREDPOL is a predictive policing program that uses just 3 pieces of information – past type of crime, place of crime and time of crime – to make its predictions about crime events. PREDPOL tries to help law enforcement gather information and make inference about a potential event's location and time – one which has not yet taken place – supposedly with the aim of preventing this future crime.(11) Knowledge can serve as a double-edged sword. The promise and lure of big data to reveal patterns and new knowledge from unexamined troves of data make application of traditional ethics and law enforcement challenging and retrospective. Most organizations and governments understand the dynamic nature of data in response to changing environments.

April - June 2017 ^ Visleshana ^ Vol. 1 No.3



Yet, ethical questions and concerns seem to emerge in a retrospective manner and those responsible to address such issues are generally found to be "looking over the shoulder" in the context of some violation or disruption as opposed to being proactive and seeking adequate protections well in advance. (1,12)

5 CONCLUSION Big data is ultimately about money and power that result from knowledge and insights. Omniscience and omnipotence are two sides of the same coin. Ethics and best practices have key roles to play in maintaining a balance between the quest for omniscience and the rights of every individual who contributes to or is affected by decisions based on such knowledge. Individuals should indeed have maximal ability to manage the flow of their own information across third party systems. The enforcement of rules for transparency in data acquisition, provenance, retention, usage, and sale should rather be the norm than the exception. Open and lively participation in ethical dialogues must continue on determining ways to live efficiently and safely in an increasingly data-driven world while being on the guard against the ever more innovative threats to personal information.

REFERENCES [1] Metcalf J. Ethics codes: History, Context, and Challenges. 2014. Available from: [2] Adali S, Escriva R, Goldberg MK, Hayvanovych M, MagdonIsmail M, Szymanski BK, et al. Measuring Behavioral Trust in Social Networks. In: ntelligence and Security Informatics (ISI), 2010 IEEE International Conference on IEEE, 2010. 2010. p. 150–2. [3] Grady C. Do IRBs Protect Human Research Participants? JAMA. 2010;304(10):1122. Available from: a.2010.1304 [7]

Hermon R, Williams PAH. Big data in healthcare: What is it used for? Available from:


Rethinking Personal Data: Strengthening Trust. 2012. Available from: sonalData_Report_2012.pdf


Narayanan A, Huey J, Felten EW. A Precautionary Approach to Big Data Privacy. 2016. p. 357–85. Available from:

[10] Boyd D. Networked Privacy. New York, New York, USA: Personal Democracy Forum; 2011. Available from: papers3://publication/uuid/36B968CE-B823-4D0B-849A2BDFE55A253D [11] PREDPOL. Available from: [12] Big Data Ethics: 8 Key Facts To Ponder - InformationWeek. Available from:

Dr. Meghana V. Aruru, PhD, MBA, is Associate Professor in the Indian Institute of Public Health, Hyderabad. Her research interests include Health Economics and Outcomes Research, Health Policy and Ethics, Health Promotion and Communication, and Translational Science. She received her PhD from the University of Illinois at Chicago, USA.

[4] Steinbrook R. Improving Protection for Research Subjects. N Engl J Med. 2002;346(18):1425–30. Available from: 828 [5] Metcalf J, Crawford K. Where are Human Subjects in Big Data Research? The Emerging Ethics Divide. Big Data Soc. 2016;3(1):2053951716650211. [6]

The Opportunities and Ethics of Big Data. Available from:

April - June 2017 ^ Visleshana ^ Vol. 1 No.3


Big Data Careers Krishna Kumar Abstract - Increasing number of organizations are discovering that running fast on a Big Data Journey is not an easy task. On one hand there is a dearth of talent on the latest tools and technologies but on the other hand their existing business experts lack requisite appreciation of the potential / capabilities of big data and its potential and know how around to tap into it in a systematic manner. This article discusses some of the career profiles in the Big Data domain.

—————————— u —————————— Many economists in the advanced economies of the World believe that “Eventually Data will surpass Crude Oil, in importance.” In essence, the success of organization in leveraging the Big Data dream essentially boils down to its ability to put together a crack team formed from empowered people with diverse skills who can collaboratively work together to make meaningful business assumptions, take incremental risks, conduct meaningful experiments, learn to implement and scale up the successful implementations. Objectives of this core team differ, depending on the strategic imperatives being chased by the organization. Based on these, the skills sets needed in the team also changes significantly from assignment to assignment. Because of this, the Big data Roles are continuously evolving. Some of the high level objectives being pursued by the Big data teams are as below. • • • • • • • • • •

Recognition (image, text, audio, video, gestures, facial expressions.) Scoring / Ranking – (FICO Score) Segmentation (Demographic based marketing) Forecasts (e.g. sales/revenue/ expenses) Optimization (Risk management) Prediction (predict value based on some inputs) Classification (this bucket or that) Recommendation (Shopping carts add ons) Anomaly detection (Frauds) Pattern detection / Grouping (classification without known causes – buying combos)

To achieve some of these common goals as above, there are different roles that are gaining prominence in this world of big data. They are listed below.

1 CHIEF DATA OFFICER This is akin to a CXO position, reporting directly to the board or to the CEO of the organization. The Chief Data officer is Responsible for overall big data strategy within an organization. – ensuring that the data is accurate, secure and customer privacies are governed correctly. He/she is responsible for Crafting and managing standard data operating procedures, data accountability policies, data quality standards, data privacy and ethical policies….and be able to understand how to combine diff data source (from within and outside) with each other. In short – the Chief Data Officer, manages the data of the organization as a strategic asset.

1.1 Key Skills • Broad appreciation of Business Goals & how different data sources/ types enables them • Strong Leadership and Communication Skills to interact with the board and senior business leaders, effectively • Experience in leading major Information management programs in key business areas • Expertise in creating and deploying best practices and methodologies • Demonstrate Policy thinking • Knowledge of developing business cases for technical projects with lot of uncertainities

Familiarity with Modelling techniques, predictive modelling and big data tool sets

This is a new role that has evolved in the last couple of years and different organisations are actively evaluating this specialized role and its suitability for their organization. 20+ years of experience in not unu-

April - June 2017 ^ Visleshana ^ Vol. 1 No.3



sual for this position.

world today. Renowned as the “Sexiest Job of the 21 century “ - as published by HBR.

2 BIG DATA SOLUTION ARCHITECT A skilled architect with cross industry, crossfunctional and cross-domain know how. He sketches the big data solution architecture, and monitors and governs the implementation of the same. He puts the discovered data in such a organised form so that it can be analysed. He/She structures the data so that they can be usefully queried in appropriate timeframes by different users. They ensure data updation happens in a predetermined manner for it to continuously remain useful.

2.1 Key Skills • Experience in having designed normal solution architecture before coming into the big data solutions space (15+ years of experience is very normal for this position) • Experience in architecting large Data warehouses with good understanding of Cluster / Parallel architecture as well as highscale distributed RDBMS / NoSQL Platforms is important • Experience on Cloud computing infrastructure like Amazon Web Services, Elastic MapReduce, Azure etc. o Experience in major big data solutions like Hadoop / Mapreduce, Hive, Hbase, MongoDB, Cassandara o Depending on the project experience in Impala, Mahout, Flume, ZooKeeper/Sqoop are important • Firm understanding of major programming/ Scripting language – Java, R, PHP, Linux, Ruby, Python. • ETL Tool experience – Informatica, Talend, Pentaho • Knowledge of data security, data privacy etc. • Capabilities o Articulate Pros /Cons of various options o Benchmark systems, analyse system bottlenecks and propose solution to eliminate them o Document complex use cases, solutions and recommendations o Work in fast paced agile environments

3 DATA SCIENTIST It is probably the most talked about Job Profile in the


This role is a very demanding role wherein the person playing this role is to have a deep appreciation of business domain combined with an ability to statistically appreciate the nature and variety of data and have a technical capability to leverage different technologies/ tools to deal with the deal with data in such a way as to guide the business to take decisions which might lead to solving the business problem at hand. In short a Data scientist is a Business analyst, data modeler, a statistician and a developer all rolled into one. Typically, a Data Scientist is familiar with the Business domain and the datasets accompanying it. He/ She creates sophisticated analytical models, that help solve a business problem – for e.g. pricing optimization across channels, predict customer behavior etc.

3.1 Key Skills • Business Domain o Marketing, consumer behavior, Supply chain, finance, healthcare etc. • Statistics / Probability o R o Correlation, baysian clustering, o Predictive analysis • Computer Science / Software programming o Languages – Java, Python • Written / Verbal Communication Skills • Technical Proficiency o Database systems such as MySQL, Hive etc. o Data Mining, Machine learning (Mahout)

4 BIG DATA ENGINEER Sources of data are ever expanding – different types of data –files, text messages, images, audio, video, gestures – from different kinds of sources such as application data, reports, social sites, sensors etc. Based on the solution provided by the Big Data Solution Architect, a data engineer determines the way to tap into the various kinds of useful data from different sources, how to bring it into the organization (Builds data pipes into the organization), how to store them, retrieve them, combine them and serve them for the use of the different stakeholders like data scientists, machine learning scientists, analysts etc. and determines how to archive it, retire the data etc.

April - June 2017 ^ Visleshana ^ Vol. 1 No.3



Big Data Engineers builds large scale data processing systems and algorithms. They typically analyse each source of data and determines the kind of pipe they need to setup for that specific data (depending on the complexity of the different Vs – Volume, value, veracity, volatility, velocity, variety etc), cleanse it, get it ready for processing and serve it to the different stakeholders. He also develops strategies for staging and archiving data. Also, in a way – the data engineer is also a “Data hygienist” – who ensures that the data coming into the system is clean and accurate and ensures that the data stays that way throughout the data life cycle.

4.1 • • • • •

Key skills SQL & relational databases like Redis NoSQL Databases like MongoDB Apache Hadoop and its ecosystem – MapReduce, Hive Apache Spark and its ecosystem Languages – Scripting language like Java, C++, Ruby, Python & R

5.1 • • • • •

Key Skill areas Statistics Intermediate level Algebra/ calculus Programming skills – C++, Python Learning theory (intermediate level) Understanding of the inner workings of the arsenal of machine learning algorithms

6 BIG DATA ANALYSTS A big data Analyst primarily works with data in a given system and performs analysis of the given data set. He helps the data scientist in performing the necessary jobs. Many times Analysts graduate to don the role of data scientists after they gain valuable experience in the analyst role.

6.1 • • • • •


Key Skills Business acumen Should enjoy – discovering , solving problems Data Mining (Data auditing, aggregation, validation, reconciliation) Advanced data modelling Testing o A/B testing on different hypotheses – to directly/indirectly impact Key Performance indicators (KPIs) Creating clear/concise reports to explain results Technical Skills o SQL databases o BI platforms – tableau o Basic knowledge of Hadoop/ MapReduce o Statistical packages – R, Matlab, SPSS o Programming Languages

The machine learning scientists are those who are involved in crafting and using predictive and correlative tools used for leverage data.

• •

They work in the R&D of algorithms that are used in adaptive systems. They build methods of predicting product suggestions and demand forecasting and explore Big data automatically extract patterns etc. In many situations, the machine learning engineer’s final “Output” is the working software and their audience for this output consists of other software components that run automatically with minimal human supervision. The decisions are made by machines and they affect how a product or service behaves.


Machine learning Scientists creates algorithms allow for application of statistical analysis at high speeds. They design interrogation of data with enough statistical understanding to know that when the results are not to be trusted. Statistics and Programming are the 2 biggest assets to the machine learning practitioner.

They are business folks, who use the output of all the big data analytics, pilot it in the real world. They take these models and translate this into specific campaigns with the target audience to drive the business results. In a way, they help translate these models to business results. During the piloting phase, they gain first- hand knowledge and document their learnings. If they outcomes are beneficial / valuable, they fine-tune their approach to be adopted and then scale it up for an organization wide implementation. In short, These folks are often experts in the functional area and with the behavior of the target segment. They therefore define business steps which has

April - June 2017 ^ Visleshana ^ Vol. 1 No.3



the targeted items based on the model and ensure that there are follow through action items to ensure that the implementation progresses in a campaign like manner.

7.1 Key Skills • Strong Domain Skills in the relevant area – to operationalize the experimentation • Determine measures of success / failure • Understanding of data models • Strong communication skills to train final users and articulate the correct way to leverage the findings and make them apply the right method for their specific data sets.

8 BIG DATA VISUALISER In Big data, one of the key ability is to visualize the data, such that the senior management in the organization can appreciate it, play with it to find new patterns and insights. Instead of usual graphs/ pie charts etc. the data visualizer can help tell a compelling story or insights through a mixture of interactive visuals to deliver the insights. A data visualizer has the necessary skills to turn abstract information from data analytics into appealing and understandable visualisations that clearly explains the results of the analyses.

8.1 Key skils • Creative thinker - who understands UI/UX and has visualizing skills such as Typography, interface design, visual art design • Programming skills to build visualisations • Good background in Source control, testing frameworks as well as agile development practice • Use metadata, metrics, colors, size, position to highlight • Technical Skills o JavaScripts, HTML, CSS, R o Modern Visualisation Frameworks such as Gephi, Processing, d3js, o Web libraries such as JQuery, LESS, Functional Javascript o Photoshop, Illustrator, Indesign o Excellent Written / Communication

processed and used. The programmer roles are available in all the stages of the Big data journey from its source, to the models to the visualization of results and rolling out the solution on an organization wide scale. Big data programming involves various tools / languages that are used by the various role players like Big data architect, the data engineer, ETL specialists, Modellers, Analysts, Visualizers etc.

9.1 Key Skills • Programming languages like C++, Python, R • Reporting tools – Jasper reports, Kibana, Tableu, SAS etc • Big data tools – Hbase, MapReduce, Python, Hive, Spark etc. • Frameworks such as Elastic Search, • Ability to explore, self- learn, share knowledge, collaborate effectively in teams • Work with large data sets and understand the nuances of different types of data and sources will be of great help These roles are the ones usually available to fresh graduates or for the first time entrants into the big data spares. To gain proficiency on any of these tools, there are various online courses available which can kick-start their journey into this space. And depending on their area of interest and acumen, the youngsters can take seek to become statisticians, data analysts, data engineers etc. by learning others skills on the job. They can also choose to become specialists on different kinds of data and carve a specialist career for themselves. Krishna Kumar Thiagarajan received his B.Tech from NIT Surat in 1991, MBA from SP Jain, Mumbai in 1996, CFA from ICFAI, and a Business Leadership diploma from U21 Global. He is a frequent speaker on contemporary IT topics such as Big Data, IoT, and Smart City with Hyderabad University and Digital India. He has varied interests and is into designing dashboards and is a co-producer of a Marathi film, “Satrangi Re”. He is certified Six Sigma black belt holder. Krishna Kumar received International HR Leader award from USA and has certificates in Big Data from University of California and in Executive Data Science from John Hopkins University, USA.

9 BIG DATA PROGRAMMER The canvas for learning programming can be across various stages through which big data is sourced,

April - June 2017 ^ Visleshana ^ Vol. 1 No.3



Notes from IEEE BigDataService 2017 Vishnu S. Pendyala Abstract - The IEEE Big Data Service 2017 international conference was organized from April 7-9 in San Francisco, USA. The conference attracted researchers from several countries and premier research institues. The editor took quick notes, while attending the conference so that the readership can get a quick overview of the topics covered and those from the first day of the conference are provided below. Full papers can be downloaded from IEEE Xplore soon. Details about the conference itself are at

—————————— u ——————————

1 OPENING COMMENTS ▪ Three decades ago, non-conventional methods like Neural Networks and Fuzzy Logic were popular with control systems, but no one had the vision to realize the potential of these technologies. ▪ Future of Big Data is in applications and services, that’s what this conference focuses on. ▪ Average acceptance rate seems to be around 20%.

2 KEYNOTE #1 Prof. Ling Liu, Georgia Tech, Distributed Data Intensive Systems Lab

IoT and Services Computing: A Marriage made in Big Data • IoT is a killer app for Big Data. • Services provide new ways of packaging software - stackable modules. • Cloud is making everything a service. • Analytics make the “things” in IoT smart, to make them more responsible. • Humans cannot right away make out trends from Big Data, but smart devices can. • Algorithm as a service. • Tiny computers in everything, including things like the freezer in the refrigerator, to avoid heat shocks to ice creams. • Big Data, First challenge: Discrete Optimization Problems • Example of a discrete optimization problem (1st challenge): Where to insert sensors to detect water contamination. • Most of these problems are NP-hard, so greedy algorithms used as a first attempt of optimization. • If the utility function is monotonic and submodular, solution theoretically guaranteed 63% accuracy.

• Second Challenge: Deep Learning as a service. • Convolutional Neural Network is a simplified neural network. • Every layer learns something and passes what is learnt to the next layer. • Backward learning is to correct errors. • • Apple’s Siri, Google’s AlphaGo, Self-driving and face recognition technologies all use the same principles of deep learning. • In Convolutional Neural Networks, we put everything in a grid, like in map-reduce we create a number of tasks. • Third and final Challenge: Big Graph Processing very attractive big data processing algorithms. • Graph queries and iterative algorithms are two different beasts. • RDF is a good representation of Natural Language. • Graph processing is quite demanding on memory resources, so may require news ways of memory allocation, requiring us to rewrite Operating Systems. • Other way is algorithm based optimization. • Hama is a popular Apache framework for graph algorithms. • Smart things make IoT Internet of Services (Services Web). • IoT and Big Data soon becoming essential utilities like water and electricity, which will be delivered by a services network.

Q&A • Mobile devices are coming with a number of sensors, but the software using the data from them are still primitive. It is an example of a business model that needs improvement. A lot of opportunity here. • Dropbox and google drive make their introductory offerings free - that business model is quite successful. • Today, 15% accuracy is significant in CNN be-

April - June 2017 ^ Visleshana ^ Vol. 1 No.3



cause humans cannot achieve even 5% accuracy in some cases. That’s where RDF and semantic web come into picture. RDF provides an excellent and accurate representation of natural language. The bottleneck in case of RDF is processing. One day when graph processing overcomes its bottleneck, semantic web will be popular.

PAPER 1, 10:30AM Enhanced Over Sampling Techniques for Handling Imbalanced Big Data Set Classification ▪ Machine Learning does not work well with imbalanced data sets - where one sample dominates over the other sample. ▪ How to make Machine Learning work on imbalanced data sets? ▪ Solution presented is an improvement over SMOTE. ▪ SMOTE: Synthetic Minority Over-sampling Technique available at:

PAPER 2, 11AM Improved Sentiment Classification by Multimodal Fusion • Sentiment Analysis (SA) is an aspect of Data Mining. • Machine Learning techniques are too specific to the problem and are not general enough for the purpose of SA. • Naive Bayes, SVM and EM represent three different classes of algorithms. • Data Fusion techniques: Majority Voting, Borda Count (rank based), Ordered Weighted Averaging (OWA), Greatest Max / Min / Product, Maximum Inverse Rank (MIR) • Validation Technique: K-fold Cross Validation. • Run the ML algorithms to build binary classifiers and combine the results from the various ML algorithms using the data fusion techniques. • Uses the lexicon provided by NLTK. • Uses a 10,000 tweet data set also provided by NLTK.

PAPER 3, 11:30AM Towards Automatic Linkage of Knowledge Worker's Claims with Associated Evidence from Screenshots • Use OSX tools like OSXInstrumener to collect data and make the associations. • Other tools: OpenCV, Google Tesseract, Difflib, BLEU, Jaccard, WordNet. • Collaborative interaction corpus:

KEYNOTE #2 Speaker: Professor Bin Yu, UC Berkeley

Title: Mobile Cloud and Data, One Telekom Perspective ▪ Prediction Vs Interpretation: Prediction must be interpretable for human retention. ▪ Lasso is essentially L1 constrained Least Squares. ▪ Deep Convolutional Neural Networks: ▪ Does deep learning resemble the brain function? ▪ Human brain will always lead the way - humans can do much more than CNNs. ▪ Deep Dream Patterns show consistency between Lasso and Ridge. ▪ Superheat plots for visualization of stable deep dream images. R has a package to do this. ▪ Interpretable models possible through Predictability + Stability + Computability (PSC) ▪ UCB coming up with a new Data Science major. ▪ More info: GR1NTZpTzBQRGM/edit and NTR5MVJWQjhoc2s/view

PAPER 4, 2:30PM CaPaR: A Career Path Recommendation Framework ▪ Mine the resume data and job description and recommend jobs / skills using item-based Collaborative Filtering.

April - June 2017 ^ Visleshana ^ Vol. 1 No.3



PAPER 5, 2:55PM IRIS:A Goal-Oriented Big Data Analytics Framework Using Spark for Aligning with Business ▪ Used Machine Learning techniques running on spark for Business Process Reengineering (BPR).

▪ What is the best graph partitioning scheme? ▪ Train a supervised model to predict quality of CF. ▪ Try several partitioning schemes using structural features of graphs. ▪ This is the first attempt at using graph partitioning with CF.

PAPER 10, 5:10PM

PAPER 6, 3:30PM When Rule Engine meets Big Data: Design and Implementation of a Distributed Rule Engine using ▪ Distributed rule engine using (a) Map-reduce or (b) Message passing for rule matching. ▪ Rule matching via SparkRE SQL after representing rules as Relational queries. ▪ Also used Drools and compared the results. ▪ Datasets from LUBM / OpenRuleBench.

PAPER 7, 4PM Balanced Parallel Frequent Pattern Mining Over Massive Data Stream ▪ Three features of data stream: Continuity, unbound, and expiration.

PAPER 8, 4:20PM

Small Boxes Big Data: A Deep Learning Approach to Optimize Variable Sized Bin Packing ▪ Are we making good use of space in the boxes used for packing? ▪ There are many variations of this problem and all of them are NP-hard. ▪ Even using just the volume (one-dimension) to optimize the space is NP-hard. ▪ 3-D optimization time grows exponentially, so we attempt only volume (1D). ▪ Need to depend on heuristics to bring down the complexity of the algorithm. ▪ Used 8 heuristics for this solution - none is the best for all situations. ▪ Customize the heuristics for each individual instance. ▪ Used Deep Learning to train the model - less feature engineering and auto feature selection. ▪ Heuristic indicator vector to show how the heuristics performed.

Data Allocation of Large-scale Key-Value Store System using Kinetic Drives • Key-value E.g.: userID (key) userProfile (Value); movieName (key), movie (value) • Kinetic Drive: World’s first ethernet-connected hyper-scale storage, has IP address (instead of SCSI bus address). • Supports key-value pair using LevelDB and can run key-value operations by itself - easy to scale, plug-n-play. • Clients use kinetic APIs to work with the drives.

PAPER 9, 4:50PM Scaling Collaborative Filtering to large–scale Bipartite Rating Graphs using Lenskit and Spark ▪ Graphs are getting larger and processing is not able to scale. ▪ Solution is to Partition the graphs.

April - June 2017 ^ Visleshana ^ Vol. 1 No.3


Call for Contributions

Submissions, including technical papers, in-depth analyses, and research articles are invited for publication in “Visleshana”, the newsletter of SIG-BDA, CSI, in topics that include but are not limited to the following: • Big Data Architectures and Models • The ‘V’s of Big Data: Volume, Velocity, Variety, Veracity, Visualization • Cloud Computing for Big Data • Big Data Persistence, Preservation, Storage, Retrieval, Metadata Management • Natural Language Processing Techniques for Big Data • Algorithms and Programming Models for Big Data Processing • Big Data Analytics, Mining and Metrics • Machine learning techniques for Big Data • Information Retrieval and Search Techniques for Big Data • Big Data Applications and their Benchmarking, Performance Evaluation • Big Data Service Reliability, Resilience, Robustness and High Availability • Real-Time Big Data • Big Data Quality, Security, Privacy, Integrity, Threat and Fraud detection • Visualization Analytics for Big Data • Big Data for Enterprise, Vertical Industries, Society, and Smart Cities • Big Data for e-governance • Innovations in Social Media and Recommendation Systems • Experiences with Big Data Project Deployments, Best Practices • Big Data Value Creation: Case Studies • Big Data for Scientific and Engineering Research • Supporting Technologies for Big Data Research • Detailed Surveys of Current Literature on Big Data We are also open to: • News, Industry Updates, Job Opportunities, • Briefs on Big Data events of national and global importance • Code snippets and practice related tips, techniques, and tools • Letters, e-mails on relevant topics and feedback • People matters: Executive Promotions and Career Moves All submissions must be original, not previously published or under consideration for publication elsewhere. The Editorial Committee will review submissions for acceptance and reserves the right to edit the content. Please send the submissions to the editor, Vishnu S. Pendyala at

April - June 2017 ^ Visleshana ^ Vol. 1 No.3


Visleshana 1.3 April - June 2017  
Visleshana 1.3 April - June 2017  

The Flagship Publication of the Computer Society of India, Special Interest Group on Big Data Analytics (CSI SIGBDA). Vol.1 No.3 Apr - Jun 2...