Page 1

COMPUTER SOCIETY OF INDIA

Newsletter of the Special Interest Group on Big Data Analytics

Volume 1, Issue 2

January – M arch 2017

Chief Editor and Publisher Chandra Sekhar Dasaka

Editor Vishnu S. Pendyala

Editorial Committee B.L.S. Prakasa Rao S.B. Rao Krishna Kumar Shankar Khambhampati and Saumyadipta Pyne

Website: http://csi-sig-bda.org Please note: Visleshana is published by Computer Society of India, a nonprofit organization. Views and opinions expressed in the Visleshana are those of individual authors, contributors and advertisers and they may differ from policies and official statements of CSI. These should not be construed as legal or professional advice. The CSI, the publisher, the editors and the contributors are not responsible for any decisions taken by readers on the basis of these views and opinions.Although every care is being taken to ensure genuineness of the writings in this publication, Visleshana does not attest to the originality of the respective authors’ content. © 2016 CSI. All rights reserved. Instructors are permitted to photocopy isolated articles for non-commercial classroom use without fee. For any other copying, reprint or republication, permission must be obtained in writing from the Society. Copying for other than personal use or internal reference, or of articles or columns not owned by the Society without explicit permission of the Society or the copyright owner is strictly prohibited.

Image credit: wordclouds.com


From the Editor’s Desk Time flies and here we are again with the next edition of Visleshana. I used my time off from work during Christmas to visit India (and a couple other countries too). During this period, I was fortunate that Prof. B. Yegnanarayana kindly agreed to share his views with Visleshana. As you may know, “Prof. Yegnanarayana” has taught, mentored, and guided some of the big names in the Computer Science world, including the famed MIT Professor, Dr. Anant Agarwal. This issue includes snippets from Prof. Yegnanarayana’s insightful answers to my questions during my meeting with him. The then Nobel Prize winner, Sir CV Raman to the now Prof. Agarwal of MIT, he has interacted with them all. I’m sure you will cherish reading his bold views captured inside this issue. Fitting distributions to Big Data is an important process in predictions and analytics. I’m happy to present inside, an interesting article by accomplished academicians on the topic. The data set chosen for the solution is from the actuaries – real data from the insurance industry. Giving us an industry perspective is Suresh Yaram from the Computer Sciences Corporation. In his article inside, he details when and why to adopt Cloud for Big Data. Included in this issue is also an article on what data from the CRM system is important for mining to understand your customers.

During my India trip, I visited IIIT, Bhubaneswar, GSSS Institute of Engineering and Technology for Women, Mysuru, CVR College of Engineering and St. Martin’s Engineering College in my hometown, Hyderabad. I’m quite impressed by the academic thrust on Big Data Analytics in India. St. Martin’s Engineering College had a Faculty Development Program on Big Data Analytics using R, which I had the privilege of inaugurating during my visit. You can find the details of this and other events inside. I delivered keynotes related to Big Data Analytics at two IEEE sponsored international conferences, ICEECCOT-2016, and the 15th ICIT, and am happy to note the response to the talks. The talks that I gave at CVR college of Engineering, Hyderabad and to the members of Computer Society of India, Hyderabad Chapter, also on topics related to Big Data Analytics were also well attended, received and understood. I’m glad to be part of the Big Data movement and I’m sure you too will be. Please make your mark by authoring papers and make a new year resolution to contribute at least one article to Visleshana this year. Happy Reading! With Every Best Wish,

Vishnu S. Pendyala January 14, 2017

2


Fitting Distributions to Big Data: Example of Large Claim Insurance Data V V HaraGopal

K.Navatha

Department of Statistics Osmania University Hyderabad, India Email: haragopal_vajjha@yahoo.com

Department of Statistics Siva Sivani Institute of Management Hyderabad, India Email: navathastats@gmail.com

Abstract This paper carries out a methodology for dealing with the problem of predicting claims in an environment of uncertainty. This work starts with introducing how actuarial modeling comes into practice for insurance claims data. The variable used modeled is claim amounts from Insurance Regulatory Development Authority, Bombay, for the year 2010-11. The modeling process will ascertain a statistical distribution that could capably model the claim amounts, and then the goodness of fit test was done mathematically using graphically using the Probability-Probability Plots (P-P plots). Finally, this study gives a summary, conclusion and recommendations that can be used by insurance companies to improve their results concerning future claim inferences. Keywords: Threshold, PP Plot, Probability Distribution, Total Claim Paid, Goodness of fitting. We collected secondary data from Insurance Regulatory Development Introduction Authority (IRDA), Health Data regarding their policies (June 2010Uncertainty refers to randomness and 2011). We made certain assumptions which is different from lack of on the data before use: (i) All claims predictability, or market inefficiency. paid are independent (i.i.d) (ii) All the An emergent research view holds that future claims paid are predicted from financial markets are both uncertain the same distribution. and predictable. Also, markets can be efficient but also uncertain. Insurance Before fitting any statistical companies typically face two major distribution to the claim severity, the problems when they want to forecast following steps were made in actuarial future claim severity by using past or modeling process. present behavior of claims paid. For this, one has to find an appropriate 1. Selecting different families of Statistical Probability distribution for distributions. the large claims paid. Then after test 2. Estimating parameters for how well this statistical distribution different distributions. fits the claims data. 3. Verify the model fit.

3


Most of the data in Insurance is positively skewed (skewed to right). Here, for the data we fitted different statistical distributions like Gamma, Pareto, Generalized Pareto, log Normal, Weibull etc. with the help of SPSS, descriptive analysis of the data, viz., all measures like mean, median, mode, skewness, kurtosis ,standard deviation, variance and also histogram was plotted for total claims paid is shown in the graphical representation for the data. We tested that whether the data considered fits well with the assumed distributions well or not is considered. Big Data Analysis In this section we consider to evaluate and fit the distribution to the data by considering, the variable as Total claims paid. Here the data consists of 48000 observations and 48 variables like age, gender, premium, claims paid etc. we are interested in fitting a distribution for the variable of total claims paid. As per the theory on distributions which provide good fit, but can be a bad fit at tails. The main interest in this situation is in the tails of the data (Denuit .M etal,(2007)). The insurer may not be interested in the maxima of observation but also in the behavior of the large observations which exceed a threshold. Hence, for the data on claims we considered claim amounts which are greater than 100000. We started with summary statistics of the Total claims paid. That is shown in table 1 below. These are the basic measures of the data. The data has much variation.

The data indicated that mean > median implying positive right skewness with a high amount of kurtosis. We have estimated the gender wise difference by two sample Independent Z- test. The calculated test statistic value is, Z=1.82233 which is not significant. (Z tab =1.96) indicating that there is no difference in the gender claiming the amount. This also claims that the two samples have drawn from the same population (Embrechts etal., (1997), leadbetter etal (1987), Resnick (1987)) Table-1 Summary Statistics: DESCRIPTIVE STATISTICS Values Mean 189144.4 Standard Error 2520.594 Median 165000 Mode 150000 Standard Deviation 76826.46 Sample Variance 5.9E+09 Kurtosis 4.682943 Skewness 1.638403 Range 529952 Minimum 100048 Maximum 630000 Sum 1.76E+08 Count 929 Histogram Histogram is a graphical representation of the data. Histogram for total claims paid shown in fig.1 and also, Histogram for total claims paid for different distributions and also a normal curve superimposed on it 4


(from Fig 2). This shows that skewness of the total claims paid, it can be seen that the total claims paid

are heavy right tailed and there is a variation in total claims paid.

Histogram Frequency

200 150 100 Frequency

50 0

Bin

Fig. 1

Probability Density Function 0.64

0.56

0.48

f(x)

0.4

0.32

0.24

0.16

0.08

0 120000 160000 200000 240000 280000 320000 360000 400000 440000 480000 520000 560000 600000

x His togram

Gen. Pareto

Fig.2

5


Probability Density Function 0.64

0.56

0.48

f(x)

0.4

0.32

0.24

0.16

0.08

0 120000 160000 200000 240000 280000 320000 360000 400000 440000 480000 520000 560000 600000

x His togram

Gam m a (3P)

We also plotted the p-p plots which are drawn for each and every distributions (from Fig 4).Before We set up the hypothesis that claims paid

Fig.3

follows all these distributions, by formulating the objective of GPD is the best fit for the claims paid data.(vs.) GPD does not fit for the claims paid data.

P-P Plot 1

0.9

0.8

0.7

P (Model)

0.6

0.5

0.4

0.3

0.2

0.1

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P (Empirical) Gen. Pareto

From figure 4, we observe that the circles all lie quite close to the line; close enough to say that the data come from a generalized Pareto distribution. There's a little random wriggle about the line; this does not disqualify these data from being Generalized Pareto Distribution close

Fig.4

enough is good enough therefore, we accept the hypothesis that GPD fits well for claims paid data. Similarly, we set up the hypothesis that claims paid follows all these distributions, by formulating the objective of Gamma distribution is the best fit for the claims paid data.(vs) 6


Gamma distribution does not fit for the claims paid data. And also. for

other distributions.

P-P Plot 1

0.9

0.8

0.7

P (Model)

0.6

0.5

0.4

0.3

0.2

0.1

0 0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

P (Empirical) Gam m a

Before fitting different distributions, we estimated the parameters of the distribution, generally there are three methods for estimating the parameters. 1. Maximum Likelihood Estimation

Fig.5

2. Method of moments 3. Probability weighted methods The parameter estimation of all the distributions was obtained with the help of SPSS software and those are given in the table:

S.No

Distribution

Parameters

1

Beta

β1=0.70384 β2=4.3841

2

Chi-Squared

α=1.6759E+5

3 4

Chi-Squared (2P) Exponential

5

Exponential (2P)

6

Fatigue Life

7

Fatigue Life (3P)

α =77861 β =89275.0 α =5.9668E-6 α =1.4794E-5 β =1.0000E+5 α =0.36397 β =1.5724E+5 α =1.0999 β =45094.0 г=94920.0

8

Gamma

9

Gamma (3P)

α =5.2054 β =32196.0 α =0.89385 β=73866.0 г =1.0000E+5 7


13

k=0.30225 α =35661.0 Gen. Extreme Value β =1.3201E+5 k=1.071 α =5.9108 Gen. Gamma β =32196.0 k=0.85423 α =1.1531 β =55493.0 г Gen. Gamma (4P) =1.0000E+5 k=0.10053 α =61153.0 Gen. Pareto β =99607.0

14

Log-Gamma

α =1120.1 β =0.01068

15

Lognormal

16

Lognormal (3P)

17

Normal

18

Pareto

19

Pareto 2

α =0.35708 β =11.957 α =1.0518 β =10.68 г =96405.0 α =73457.0 β =1.6759E+5 α =2.2497 β =1.0000E+5 α =105.37 β =1.7624E+7

10 11

12

Goodness of Fit – Summary After treating the data fitting with 5, Serial Number 13 in the table below various distributions, we now test which ranks “1” under the three tests. these with KS-test, Anderson Darling Therefore one can conclude that, and Chi-squared test and found that, Claims paid follows Generalized the Generalized Pareto Distribution Pareto Distribution with authencity fits well with all the three tests for the and while Generalized Gamma falls claim data which is listed under Table under second and so on. Table 5 Kolmogorov Anderson Smirnov Darling Chi-Squared Sno Distribution Statistic Rank Statistic Rank Statistic Rank 1 Beta 0.07638 9 26.161 11 45.995 9

2

Chi-Squared

0.65708 18

1.01E+05

19

3272.7

18

8


3 4

Chi-Squared (2P) Exponential

0.65793 19 0.44936 17

26255 186.51

18 16

3272.7 849.39

19 16

5

Exponential (2P)

0.05736 6

21.458

7

16.24

4

6 7 8

Fatigue Life Fatigue Life (3P) Gamma

0.13111 12 0.04307 3 0.17549 14

25.303 1.2994 35.062

10 2 14

114.69 17.675 120.87

12 5 14

9 10

Gamma (3P) Gen. Extreme Value

0.03623 2 0.0579 7

19.87 5.393

5 4

12.053 31.579

3 6

11 12 13

Gen. Gamma Gen. Gamma (4P) Gen. Pareto

0.14753 13 0.04919 5 0.03365 1

34.089 20.178 0.54658

13 6 1

113.38 9.7931 9.7867

11 2 1

14

Log-Gamma

0.11884 10

21.819

8

105.54

10

15 16 17 18

Lognormal Lognormal (3P) Normal Pareto

0.1228 0.04526 0.17873 0.07499

23.294 2.1979 57.316 27.332

9 3 15 12

115.73 45.65 222 42.697

13 8 15 7

19

Pareto 2 0.44909 16 186.63 17 856.05 17 study can further be studied for various data sets on actuaries, which CONCLUSION will be helpful to process the claim and other parameters of the data. We can conclude that Total claims Thus, we can explore this procedure paid follows Generalized Pareto of fitting and estimating the Distribution and we can estimate the parameters for other than claims data number of persons claiming above any for different set of variables such as claim paid by using the above premium paid, etc. Generalized Pareto Distribution. This REFERENCES 1. Denuit .M, Marechale .X, Pitrebois .S. and Walhin J.F (2007) “Actuarial Modeling of Claim Counts: Risk Classification, Credibility and Bonus Malus Systems” John Wiley & Sons Ltd 2. Embrechts, Paul; Klüppelberg, Claudia; Mikosch, Thomas (1997). Modeling extremal events for insurance and finance. Berlin: Springer Verlag. 3. Leadbetter, M.R., Lindgren, G. and Rootzén, H. (1983). Extremes and related properties of random sequences and processes. Springer-Verlag. ISBN 0-38790731-9. 4. Resnick, S.I. (1987). Extreme values, regular variation and point processes. Springer-Verlag. ISBN 0-387-96481-9.

11 4 15 8

9


“Good teachers come by passion. Not by algorithm.� Interview with Prof. B. Yegnanarayana [Dr. Bayya Yegnanarayana is a renowned name in Computer Science. An Institute Professor at the International Institute of Information Technology (IIIT) Hyderabad and Professor Emeritus at BITS-Pilani, Hyderabad Campus, his teaching experience spans over five glorious decades. He was professor & Microsoft Chair at IIIT-H from 2006 to 2012. Prior to joining IIIT-H, he was a professor at IIT Madras (1980 to 2006), a visiting associate professor at Carnegie-Mellon University, Pittsburgh, USA (1977 to 1980), and a member of the faculty at IISc, Bangalore, (1966 to 1978). He has published over 400 papers in the areas of signal processing, speech, image processing and neural networks. He has supervised 30 PhD and 42 MS theses. He is a Fellow of the Indian National Academy of Engineering, a Fellow of the Indian National Science Academy (INSA), a Fellow of the Indian Academy of Sciences, a Fellow of the IEEE (USA) and a Fellow of the International Speech Communications Association (ISCA). He was the recipient of the 3rd IETE Prof.S.V.C.Aiya Memorial Award in 1996. He received the Prof.S.N.Mitra memorial Award for the year 2006 from the Indian National Academy of Engineering. He was awarded the 2013 Distinguished Alumnus Award from the Indian Institute of Science Bangalore. He was awarded "The Sayed Husain Zaheer Medal (2014)" of INSA in 2014. The Editor, Vishnu S. Pendyala met Prof. Yegnanarayana at Hyderabad on Thursday, January 5, 2017 to elicit his views for publication in Visleshana. Below are some excerpts]

1. According to Google scholar, your famed book on "Artificial Neural

Networks" has 762 citations since 1999, its year of publication. However, the most citations are in 10


2015 (172 citations) as contrasted to the year of publication, 1999 (5 citations) or the next, 2000 (4 citations), when the publisher markets it the most. How much of this recent upsurge can be attributed to the Big Data phenomenon? I don’t think it has anything to do with Big Data phenomenon. I was told that the book is being adopted by many institutions as a text book for this course, and hence the increased awareness. In fact the publisher wanted to stop printing, as it is over 15 years old now, but then he told me that the demand is not coming down. They asked me several times to revise it, but I not only do not get enough time, but my ideas are changing on this subject continuously. I used to tell in the lighter vein, that I wrote the book when I did not know the subject, but now it is difficult to write, as I know a little more about it. 2. Signal data is one of the oldest forms of Big Data, having been used even before the term Big Data was coined. The Megasamples per second that are used in signal processing are indeed Big Data. How similar or different is running analytics or algorithms on the signal data different from other forms of Big Data, say, the video data from You Tube or text data from real time tweets or the business data from the various enterprise applications? There is something wrong in this statement. Signal data cannot be called data, if you call text characters, words, or numbers as data. Just to convince

you on this point, can you say that a scanned printed character expressed in pixels as data, or the characters/words in print as data? I tried to argue about this in many of my talks, but failed to convince many of my audience engrossed in Big Data area. When we halve the values of signal data, we don’t lose information because the values are relative. But when you halve the values of numbers in Big Data, say the census data, the data becomes useless. 3. Both the areas entail extracting the features, parameter estimation, recognizing patterns, Principal Component Analysis etc. The nature of data is also quite similar, particularly in terms of the attributes listed on Page 16 of your book: Fuzzy, probabilistic, noisy, and inconsistent. What other concepts, tools, or techniques can we possibly borrow from the signal processing world to the Big Data area? Actually you are not supposed to do any large scale analysis of signal or image data, unless you know what and how to represent the information in the signal. This is indeed a great scientific challenge, and unless this is understood and solved, it is not at all correct to talk about Big Data analysis on signals. Regarding the other attributes of the data you mentioned above, in retrospect, they are all the terms used to escape from reality. Otherwise, it is impossible to classify any real data into any one or more of those categories. They were introduced in this domain, since there are some theories developed for them, not necessarily for any specific tasks. That 11


is why now one can see the fate of fuzzy logic, as an example. It has practically reached a level where no further improvement is possible. Also it is not useful for many real world problems. 4. One of the limitations of Artificial Neural Networks that may come in the way of its widespread application to Big Data is its unsuitability for prediction tasks due to overfitting. Is there a way to work this limitation around? If I understood correctly, this statement appears to be wrong. There should not be any overfitting, if Big Data is used. Also, the idea of overfitting is due to lack of our understanding of working of the ANN. Of course the way it is implemented also leads to such questions. If you compare with any biological NN (BNN) interpretation, there is no concept of overfitting. Also, one should understand that BNN is a dynamically reconfigurable architecture (the term I am using), which no one knows what it is, and how to implement it. 5. Deep learning seems to hold substantial promise in the evolution of Big Data Analytics. What are some of the disruptive applications that have not been explored enough till now that you think can Deep Learning make possible? To respond to this, we should have a common understanding of what is Deep Learning (DL). In my opinion learning by deep neural networks (DNN) is not deep learning. And, DL need not involve DNN. I know most people hate this statement of mine, but there is very

little I can do about it. For example, if a grandmaster in chess comes up with moves to checkmate his opponent in a small number of moves, then I feel that is due to deep learning, not acquired by playing many more games than others, but analysing each pattern deeply. 6. Can you please comment on the evolution of the field of Artificial Neural Networks? Could it have progressed any faster over all these decades? There is so much to talk about this, as each person may have his/her own interpretation. In my opinion, the field of ANN has not evolved much at all, after the machines took over the process of dealing with them with big data. Just contrast the way people tried to look at different architectures during 80s, in the form of CNN, TDNN, ART, Avalache, neocognitron, counter propagation, AANN, each of them was intended to address some issues specific to a task. That, in my opinion, is evolution, although it did not take off much, due to dominance of machine learning, machine power and data. There is very little chance of making progress if we are stuck with CNN, RNN, AANN etc, and small variants of them. I feel that people are desperately waiting for the next event/breakthrough to take place, as almost everyone agrees in private that the current approaches have reached a dead end. Note that the idea of DL would not have arisen but for the RBM proposed by Hinton, although that is almost forgotten in the current pursuit of using CNN, AANN, RNN and their 12


variants. 7. Do you foresee anything beyond Deep Learning in that direction? Yes of course. It will definitely happen, but not likely by the people engrossed in the current way of thinking and using. I only hope that it will happen in my life time so that I will be happy that I am right to some extent. 8. You have had a glorious teaching career spanning over 5 decades. How has the student profile changed over the years? Some students just don’t understand these algorithms. Has that ability improved over the years? In fact students understood better earlier. There are many distractions today. Now we are not letting anyone to understand/think. As you mentioned, I have been teaching for 5 decades, and the standards of students are much lower on the average, than what there were in 60s and 70s and even in early 80s, when CS depts were not there. 9. How about the hi-tech education landscape in general? Has it improved substantially in these five decades? High tech has definitely improved, but at the cost of education in a big way. Many so called education SW developers, some of whom as I understand may not have even a degree, met me saying that they want to help me how to improve my teaching skills. I humbly told them that it is too late in my life to learn now how to teach better. As I wrote in my article for the Indian Academy of Engineering (INAE)

Newsletter recently, I envisage that there are many excellent ways of enhancing learning by the technology advancements, but it appears, currently at least, they are only reducing our learning abilities significantly. To counter this, people come up with new definitions and measures of learning to show improvement of learning with their gadgets. I keep arguing in several forums, that the moment you type a character on your keyboard, instead of writing on a paper, your IQ level comes down by a small fraction. I will not be surprised that in less than 10 years people will dig the past for good ideas on education, after seeing the current products from institutes of learning, how ignorant they are about many basic things. We don’t find the same kind of commitment levels that we found before 80s. An incident with my friend Prof. Pramod Moharir comes to my mind. He was a scientist in NGRI and a Bhatnagar award winner. When he was at IISc as my colleague in 70s, I read a paper in JASA on the application of number theory for diffusion of sound in rooms, and told him about it around 8 O’clock that night. The next day at 12 O’Clock, he came up with a 10 page paper on the topic that he wrote overnight. The paper was accepted for the journal within one month. Do we find that kind of passion anywhere now? 10. You have seen technologies like Machine Learning and the very concept of Big Data evolve over the decades. Can you please give some personal insights into your experience in the frontiers of technology from time to time? 13


We are just generating a lot of jargon. We don’t realize the implications when we introduce a new term. By the time freshers try to understand these terms, new terms are coming up. These terms are loosely coined, and most of the time their scope is not well defined. What is Machine Learning? I asked this question to several candidates whom I interviewed for teaching positions in the IITs and other places over the years. So far, I haven’t received a coherent response to this question. A common theme in the answers appears to be “Pattern Searching”. You take Data Mining. Take the example of the census data. From the data, you extract the purpose. Data itself is linked with purpose. But now, we keep collecting Big Data without purpose. Humans select first and then collect. Machines collect first and then attempt to select. Good teachers come by passion, not by algorithm. I met Sir CV Raman (Nobel Prize Winner in Physics) when I was about 24 years old. I could feel his passion for science and also for violin, when I was explaining the experiments in my Acoustics lab to him. I was fortunate to rub shoulders with many noted people in the field of Artificial Intelligence and speech signal processing. Some of my students now teach in top universities like MIT. Prof. Anant Agarwal of MIT did his B.Tech final year project under my guidance. The anechoic sound chamber I built from basic principles nearly 45 years ago in IISc, Bangalore for INR 8 lakhs those

days, is still being used. 11. You have a passion for Neural Networks. Do you use Neural Networks concepts in your day-today life? Do you practice “Deep Learning” as a human? I also had a passion for the cards game, Bridge. Bridge helps you focus. You concentrate on what you are doing. It was my favourite game for 15 years from 1965 to 80. I used to spend 4.5 days for work and 2.5 days for play. I played other games too, like snooker and billiards, but Bridge was my passion. I won some state level Bridge tournaments in Karnataka, and in the process made contacts with many big people through the game. I combined the two passions and wrote a journal paper in 1989 / 90. (Editor: A Google search returns the following citation for the paper: “Yegnanarayana, B., Deepak Khemani, and Manish Sarkar. "Neural networks for contract bridge bidding." Sadhana 21.3 (1996): 395413.”). I used to subscribe to around 15 journals, spending about 20% of my pay for them. I wouldn’t have time to read each and every article inside these journals, but applied deep learning ideas to get what I wanted from them. People say “A picture is worth thousand words.” But I say, “A word is worth a thousand pictures.” A word brings to mind many vivid images. You get more from a picture than from a video. That is deep learning. Humans and machines are complementary, but we are trying to replace one with the other. 14


need to produce quality papers, not a We can’t draw a parallel between Von large quantity of papers. Neumann bottleneck and human brain. Nobody knows how human memory 13. What is your message to the works. But we know every bit that budding data scientists who will be moves in a computer. We should be driving the Big Data innovation exploiting the complementary nature of using various tools and algorithms the two, rather than trying to replace including your favourite Artificial one with the other. Neural Networks and Deep Learning? 12. What are your plans for the future? Normally I don’t give advise or Good question. I have dozens of message, as I feel that I am not that research issues to ponder over in competent. But I strongly feel that speech and signal processing. I should everyone should do what they are only get time to think about them. doing with passion 80% of the time, Frankly, I feel that I can engage dozens and the remaining time they should of research scholars, provided they are look at their work critically, in the interested in research, not necessarily sense that ask what it cannot do and only in the research degrees. Kalman why. (of the Kalman Filter fame) may have had only 30 or 40 papers to his credit. Forget about ANN and DL. Before long But there are estimated 100,000 papers these terms will vanish into history. with “Kalman Filter” in the title. We Details of a few Recent Events 1.

2.

Keynote Address, “Quantifying Truth in Big Data: Mathematical Approaches," IEEE sponsored international conference, ICEECCOT 2016, Mysuru. Dec 10, 2016. http://www.iceeccot.com/ Detailed at https://www.facebook.com/vishnu.pendyala/posts/10212247585694715 Invited Talk, "Math Never Lies: Machine Learning for Fact Finding," IEEE sponsored international conference, The 15th ICIT, Bhubaneswar, Dec 24, 2016. http://icit.github.io/ Detailed at https://www.facebook.com/vishnu.pendyala/posts/10212192168829328.

3.

Guest lecture, "Tools and Techniques for Trusting the Web," Student Branch of Computer Society of India at CVR College of Engineering, Hyderabad, Dec 5, 2016. Recording available at https://www.youtube.com/watch?v=JcpQMBZWs7o

4.

Invited Talk, “Evolving a Truthful World Wide Web,” Computer Society of India, Hyderabad Chapter. Jan 7, 2017. https://www.meetup.com/CSI-Tech-Talk-Meetup/events/236605086/

5.

Inauguration of Faculty Development Program on “Big Data Analytics using R” at St. Martin’s Engineering College, Hyderabad. Dec 29, 2016. https://www.linkedin.com/pulse/educations-roleupholding-value-system-vishnu-pendyala

15


International Workshop on Clinical Data Analytics, WCDA 2016: A Report Dr. Meghana V. Aruru Indian Institute of Public Health Hyderabad, India Email: meghana.a@iiphh.org

The International Workshop on Clinical Data Analytics (WCDA 2016)

was held on December 23rd, 2016, at the Indian Institute of Public Health (IIPH), Hyderabad. WCDA 2016 was organized by the Health Analytics Network (HAN) and cosponsored by The International Indian Statistical Association (IISA) and the Indian Association of Statisticians in Clinical Trials (IASCT). CSI Special Interest Group in Big Data Analytics was a partner of the event. WCDA 2016 was publicized at various venues including local universities, industry, interest groups, professional organizations, etc. The workshop brought together researchers, scholars and scientists from across India including University of Hyderabad, Thrombosis Research Institute, SAS Institute, Osmania Medical College, IIPH and Novartis Healthcare. Prof. N. Rao Chaganty from Old Dominion University, USA, and past president of IISA, delivered the first plenary talk on “New methods for longitudinal analysis of health and clinical data”. Dr. Ramesh Hariharan, CEO of Strand Life sciences, discussed “Big Data Analytics for genomic medicine” in his plenary talk followed by Dr. Vishwanath Iyer, Head of Exploratory Safety and Statistical Analytics at Novartis Healthcare who presented Case Studies in Analytics in Clinical Research. Prof. Sujit K. Ghosh from North Carolina State University, USA, and current President of IISA, delivered a plenary talk on “Bayesian Sample Size determination for clinical trials”. Mr. Souvik Bandyopadhyay of IIPH conducted a tutorial on “Functional Data Analysis using R” and Mr. Gunasekaran Manogaran from the Vellore Institute of Technology demonstrated the use of tools in modeling disease dynamics. Dr. Meghana Aruru of IIPH delivered a special talk on "Clinical Analytics Ethics" and the role of clinical analytics in Pharmacovigilance. Prof. Saumyadipta Pyne from IIPH delivered the last special talk on “Decision making based on high velocity data streams” where he discussed big data methods for predictive analytics in pharmacovigilance, and monitoring data streams to develop early warning signals. The workshop successfully concluded with a panel discussion led by Dr. Rajani Kanth Vangala, Director of Research, Thrombosis Research Institute, Bengaluru, and Dr. Viswanath Iyer from Novartis Healthcare, discussed the current state of clinical analytics and future research directions.

16


The Health Analytics Network (HAN) brings together scientists, researchers and professionals across domains and different countries to enable consortia science in areas overlapping data analytics and healthcare. As part of this overarching goal, HAN conducts a series of workshops in various specific areas of health analytics. Its future workshops will cover areas in risk modeling, safety and pharmacovigilance among others. WCDA 2016 Organizers and Speakers: from left – Prof. Sujit K.Ghosh, Mr. Gunasekaran Manogaran, Prof. N.Rao Chaganty, Mr. Souvik Bandopadhyay, Dr. Rajani Kanth Vangala, Dr. Meghana Aruru, Dr. Viswanath Iyer, Prof. Saumyadipta Pyne (not in photo: Dr. Ramesh Hariharan)

Pictures from the Event

WCDA 2016 Organizers: (from left) – Prof. Saumyadipta Pyne and Dr. Meghana Aruru

17


Mining the CRM data to understand your Customers Surya Putchala

Sonali Singh

Global Head, Data Science Cappius Technologies Hyderabad, India Email: surya@cappius.com

Data Scientist Cappius Technologies Hyderabad, India Email: sonali@cappius.com

A

business cannot survive without conducting ongoing efforts to better understand customer needs to deliver a product/service with meaningful and compelling value proposition. In this hyper technological world, the Customers are more informed, have more options, and have higher expectations than ever before. Hence, the more you know about your customers, the more effective your sales and marketing efforts will be. It is important to understand customer aspects like: § who they are § what they buy § why they buy it With the advent of Analytics, collecting and analyzing data of the customers is increasingly used to understand their behaviour. Generally, customer relationship management system (CRM) is a treasure trove of valuable information about customers. Customer insights allow you to up-sell and cross-sell and thus increasing profitability. One of the first steps to understand the customers is through group them by categories, ie., segments that display similar behaviors and fulfil need for that specific need. This involves evolving various engagement plans for different segments of the customers. Segmentation provides us insights about the following: §

§ § § § §

your ‘active customer’ segment engaged with your marketing messages, and whether this segment is growing. How many new leads and customers you’re acquiring How many customers you’re losing How many customers you are reactivating from ‘lapsed’ or ‘at risk’ status. How successful you are at converting one-time customers into repeat customers What your retention rate is - compared to industry benchmarks

A very fundamental (naive to very advance) strategies by means of which typical customer segmentation are outlined below: Based on Order Status: Here we segment customers based on their Order status. Segmenting them will help determining incentive strategy. §

First time Customers: whether they are logged-in users or guest users, whether they are repeat buyers or first-time shoppers. The behavior of the repeat buyers and first-time shoppers is not the same. If it is a first-time buyer, they can be offered a special discount so that they are more likely to complete their purchase.

How successful you are at keeping 18


§ Customers with Abandoned Carts: Those customers loved your products but something wrong happened made them leave your store. It might be your shipping fees or something else. § Customers with Cancelled Orders: You need to know why Customers have changed their minds on their purchases to fix and develop your website shopping experience. § Repeated Customers: These are the Customers with more than two orders. Repeated customers are most likely to become your “Loyal Customers”, you need to follow up with them and make sure they are satisfied with your products and service. § Loyal Customers: These are the Customers with orders more than a certain number or value. Loyal Customers are your raving fans; they expect a special treatment, such as offers and coupons, and they always share their positive word of mouth across their network of friends and families.

§

Products or categories most likely to result in cross-sell

Customer lifecycle segmentation One approach is to look how active customers are (recency), how frequently they’ve shopped (frequency) and how much they’ve spent (lifetime value). There is a big difference in revenue gained from a regular shopper who only buys discounted products and one who consistently buys high-value items. Lifecycle segmentation is a powerful approach that focuses on tailoring the messaging of marketing to where a customer is at in their journey with your product / brand/ service. a) Recency Recency refers to the how long back a customer purchased. Though the boundaries you set will depend on what type of business you’re running, you’d typically want to segment your customers into the following: §

Active - those who have shopped recently. These customers may not need a special promotion as the shopping experience and your products are still fresh in their minds

While the above section covers the ‘right timing’ aspect of relevancy, getting the ‘right message’ in front of your customer is crucial, too, if you’re going to attract their interest.

§

At risk - those who have previously purchased from you, but have not returned to make a purchase in the timeframe you’d usually expect (e.g. between 6 and 12 months)

Using data from an individual customer’s behavior and from trends in your customer base as a whole, it’s possible to personalize your messaging by segmenting based on:

§

Inactive Customers: These are the Customers with no purchases for some time. Inactive Customers are easier to win back than acquiring new customers. They already know you, and they bought from you once before. Customers are forgetful; they might just need a

Product affinity segmentation:

§ §

Products or categories viewed Products or categories purchased

19


reminder, or might just need a coupon to come over and buy again. §

§

Churned/Lapsed - those that have purchased previously but have gone way beyond the point you’d usually expect them to return to make another purchase. Any other Customer cohorts

b) Frequency Frequency refers to how often somebody has shopped with you. § § § §

Prospect/lead - someone who hasn’t shopped with you at all One-off customer - somebody who has made a single purchase from you Repeat customer - somebody who has made more than one purchase from you Loyal customer - someone who has a sufficient number of times to be considered ‘loyal’

c) Lifetime value Utilize your transaction history to analyze the purchasing habits of segments of your customers and create individually targeted marketing campaigns. Consider the following KPIs for segmenting the customers: § § § § §

Share of wallet Purchase Frequency Average Basket size Product/Category interests Buying cycle

One of the most basic pieces of segmentation an online retailer can do is to recognize who their best customers are. Typically, top 10 per cent of customers will produce 30-45 per cent of the revenue. It is worth investing some time in this segment as

research suggests that a high value buyer can be as much as 30 times more valuable than the rest. So marketing strategies that improves spend performance will have a good effect on the bottom line. Usually a VIP/top/medium/low scale might suffice, ranked either by average order value (AoV), historic customer lifetime value (CLV) (i.e. the total amount that a customer has spent with you), or even predictive CLV (a projected view of how valuable a customer will be to you). Typically, CLV yields Customer clusters that can be categorized like: § § § §

Economical customers Bargain hunters Big Spenders Evangelists

Demographic segmentation You may also be collecting lots of personal information about your customers such as age, gender, their preferences, income, etc. All this information can again be used to personalize how you target them and craft offers for them. At its most basic, you can segment your marketing efforts based on demographic data that you’ve accumulated about your customers. This might be: §

Gender (e.g. creating a male and female-focused version of your newsletter if you sell to both genders)

§

Age (e.g. creating a student discount campaign for those that fall into the right age bracket)

§

Location: This involves breaking down customers based on their 20


location. This way you can understand their local preferences and behaviors and then tailor your marketing messages. e.g. creating personalized ‘visit your local store’ campaigns to get people shopping offline). Conclusion: We have discussed 4 major segmentation strategies. In most cases, the segmentation is based on simple rules such as value, geography, order aging etc., these could be accomplished with simple segregation of customer demographic or transactional data. This type of segmentation is achieved by machine learning algorithms. Authors: Surya Putchala: Surya, provided thought leading consulting solutions to Fortune 500 Clients. Surya has a tremendous zeal to create Analytics products that bring significant improvements in Business Performance. Currently the vision behind the comprehensive customer experience management platform and a Data Science Platform called -

"ZettaMine". Earlier, he architected commercial Analytical Applications for Product MDM (epaCUBE) and Procurement Optimization (SolPro). Wearing several hats such as Vice President of TDWI (India Chapter), Gardener of Hyderabad Hadoop User Group, Founder of Hyderabad Data Science Group, he constantly endeavors to evangelize the adoption of quantitative techniques for decision making in various verticals. He held senior leadership roles with firms such as GE capital, Cognizant, Accenture and HCL. He graduated from IIT Kharagpur. Sonali Singh: Sonali is currently a Data Scientist at Cappius Technologies working on Customer analytics. This includes all phases of customer journey with an enterprise – acquisition, engagement, understanding and retention. She has utilized Machine Learning and Statistical Analysis for solving different customer scenarios. She has 4 publications to her credit. She graduated from Indian Institute of Space Science and Technology majoring in Geoinformatics.

21


Cloud adoption for Big Data: When and Why Suresh Yaram

prohibitively expensive. In such a scenario, it is recommended to move these peak workloads to a cloud platform where storage and compute capacity can be provisioned on demand. Typically, one should provision certain amount of reserve capacity to handle constant workloads and on-demand capacity to handle varying workloads. Alternative solution cloud be – implement a hybrid cloud solution by having bare-metal (private cloud) for high I/O workloads and public cloud for sporadic workloads. As someone rightly mentioned, own the base and rent the spike.

Senior Architect Computer Science Corporation Email: yaramsuresh@gmail.com

The cloud has become an inevitable platform for handling huge volumes of data that is being ingested at very high speeds from variety of data sources (structured and unstructured); another demanding requirement is to deliver insights from the data to the intended business stakeholders in real-time or near real-time. The cloud platform made this possible with its unlimited elasticity of compute and inexpensive storage offerings. Factors affecting Cloud platform adoption for Big Data workloads • Unparalleled economics of scale with the cloud environments and at the same time transfer of infrastructure management risk to the cloud service provider • Minimal capital expenditure and reduced operating expenditure with its pay as you go billing model instead of provisioning for peak capacity at all the times Public Cloud Usage Scenarios for Big Data processing • Handling of varying/transient workloads: it is very common that huge volume of data comes for processing from the on-line transactional processing (OLTP) systems during month-end or holiday season. In on-premise deployments, it is mandatory to provision the compute and storage infrastructure to meet peak workload requirements, which is

• Handling huge amount of data from external data sources: Leverage cloud processing capability, if in case, a huge amount of data is being ingested from social media sources for pre-processing, say customer sentiment analysis or clickstream data. • Build sandbox environments: This involves provisioning of very large but short-lived Hadoop sandbox cluster for development and testing purposes. Here provisioning can happen elastically based on storage and compute requirements for a short duration of data analytics project and bring down the cluster when the project is over. The necessary storage and computer power can be provisioned on demand as required by the project. Major Cloud Service Providers There are a wide range of cloud providers offering big data services that includes Amazon Web Services (AWS), Microsoft, Oracle, Rackspace and Google. 22


• AWS Big Data ecosystem includes Amazon Elastic MapReduce (EMR) for data processing using HadoopMapReduce framework, Amazon DynamoDB NoSql database for handling JSON-like data, Amazon RedShift petabyte-scale data warehousing service on scalable Massively Parallel Processing (MPP) architecture and Amazon Simple Storage Service (S3) for storing any amount of data (EMRFS). Also, provides Amazon Machine Learning (AML) and Amazon QuickSight for predicative analytics and data visualization. • Microsoft Azure has Azure HDInsight offering which is built on Hortonworks Data Platform (HDP), which is fully compliant with Apache Hadoop on Windows/Linux environments. Azure leverages Windows Azure Blobs for storing large amounts of unstructured data (HDFS). • Oracle Big Data Cloud Service provides complete big data environment that includes Cloudera Hadoop Distribution (CDH) with Hadoop and Spark. This offering provides an ability to interact with on-premise Oracle and cloudbased Hadoop and NoSql data sources seamlessly using ANSI-SQL; • Rackspace Big Data offering includes Hortonworks Data Platform (HDP) which is built on OpenStack (an opencloud platform supporting either public or private clouds). Uses “Cloud Files” as a "traditional" cloud-based object storage service. • Google Big data offering includes Google App Engine MapReduce framework or Google Compute Engine Hadoop MapReduce framework for data

processing, Google Big Query which is heart of Google’s big data offering for data analysis similar to Hive, Google Prediction API and Google Chart Tool for machine learning and data visualization. • AWS and Rackspace are the leading contestants in this offering. AWS is best suitable for large enterprises due to world-wide availability of data centers and its competitive pricing; Rackspace is more stable than AWS, but expensive than AWS, hence suitable for mid-sized enterprises. • Microsoft Azure HDInsight is ahead of AWS is in its end-user tools. Azure comes with a Hive ODBC driver and a Hive add-on for Excel to deliver analytics using Excel. Benefits of cloud adoption for Big Data processing • Significant savings on Capex and Opex components of the enterprise cost structure • Unlimited scalability, selfmanageability, elasticity and automatic fault-tolerance of infrastructure • Focus on drawing insights out of the data instead of focusing on managing the big data infrastructure • Ability to manage (hybrid) cloud environment with very limited administration resources Risks and Challenges with cloud platform adoption • Lack of maturity in data governance practices, possibility of data leakage and data loss and lack of established data security and compliance standards have been recognized as possible obstacles for cloud adoption. 23


Call for Contributions

Submissions, including technical papers, in-depth analyses, and research articles are

invited for publication in “Visleshana”, the newsletter of SIG-BDA, CSI, in topics that include but are not limited to the following: • Big Data Architectures and Models • The ‘V’s of Big Data: Volume, Velocity, Variety, Veracity, Visualization • Cloud Computing for Big Data • Big Data Persistence, Preservation, Storage, Retrieval, Metadata Management • Natural Language Processing Techniques for Big Data • Algorithms and Programming Models for Big Data Processing • Big Data Analytics, Mining and Metrics • Machine learning techniques for Big Data • Information Retrieval and Search Techniques for Big Data • Big Data Applications and their Benchmarking, Performance Evaluation • Big Data Service Reliability, Resilience, Robustness and High Availability • Real-Time Big Data • Big Data Quality, Security, Privacy, Integrity, Threat and Fraud detection • Visualization Analytics for Big Data • Big Data for Enterprise, Vertical Industries, Society, and Smart Cities • Big Data for e-governance • Innovations in Social Media and Recommendation Systems • Experiences with Big Data Project Deployments, Best Practices • Big Data Value Creation: Case Studies • Big Data for Scientific and Engineering Research • Supporting Technologies for Big Data Research • Detailed Surveys of Current Literature on Big Data We are also open to: • News, Industry Updates, Job Opportunities, • Briefs on Big Data events of national and global importance • Code snippets and practice related tips, techniques, and tools • Letters, e-mails on relevant topics and feedback • People matters: Executive Promotions and Career Moves All submissions must be original, not previously published or under consideration for publication elsewhere. The Editorial Committee will review submissions for acceptance and reserves the right to edit the content. Please send the submissions to the editor, Vishnu S. Pendyala at visleshana@gmail.com

24

Visleshana1.2 Jan Mar 2017  

The Flagship Publication of the Computer Society of India, Special Interest Group on Big Data Analytics (CSI SIGBDA). Vol 1 No.2 Jan - Mar 2...

Visleshana1.2 Jan Mar 2017  

The Flagship Publication of the Computer Society of India, Special Interest Group on Big Data Analytics (CSI SIGBDA). Vol 1 No.2 Jan - Mar 2...

Advertisement