Visleshana Vol. 2 No. 2

Page 1


Volume 2, Issue 2

The Flagship Publication of the Special Interest Group on Big Data Analytics

Image Credit:

Jan – Mar 2018

Chief Editor and Publisher Chandra Sekhar Dasaka Editor Vishnu S. Pendyala Editorial Committee B.L.S. Prakasa Rao S.B. Rao Krishna Kumar Shankar Khambhampati and Saumyadipta Pyne Website:

Please note: Visleshana is published by Computer Society of India (CSI), Special Interest Group on Big Data Analytics (CSISIGBDA), a non-profit o rg a n i z a t i o n . Vi e w s a n d opinions expressed in Vi s l e s h a n a a r e t h o s e o f individual authors, contributors and advertisers and they may differ from policies and official statements of CSI-SIGBDA. These should not be construed as legal or professional advice. The CSISIGBDA, the publisher, the editors and the contributors are not responsible for any decisions taken by readers on the basis of these views and opinions. Although every care is being taken to ensure genuineness of the writings in this publication, Visleshana does not attest to the originality of the respective authors’ content. © 2018 CSI, SIG-BDA. All rights reserved. Instructors are permitted to photocopy isolated articles for non-commercial classroom use without fee. For any other copying, reprint or republication, permission must be obtained in writing from the Society. Copying for other than personal use or internal reference, or of articles or columns not owned by the Society without explicit permission of the Society or the copyright owner is strictly prohibited.

Dear Readers,

From the Editor’s Desk

Welcome back to yet another issue of Visleshana with interesting articles and information. The holiday season has been hectic with conferences and courses. I had the privilege of delivering two keynote addresses at international conferences sponsored by IEEE and more importantly, offering a 7-day course on “Big Data Analytics for Humanitarian Causes,” sponsored by the Ministry of Human Resources Development, Government of India, through its GIAN program. Read more about the course contents and a perspective on research in these areas, based on the experience from the program, in two different articles inside. Healthcare is probably the biggest challenge and opportunity of this century. In their article, Dr. Aruru and Prof. Pyne point out that WHO has identified over a thousand epidemics and talk about using Big Data Analytics for watching over such disease outbreaks. Visleshana has always endeavored to bring a perspective from the industry in every issue. Mr. Surya and Ms. Sreshta, in their article discuss about use cases of companies using analytics to predict sales prospects and future revenues. The article presents interesting insights into opportunity scoring, channel analyses and pipeline forecasting. Text continues to be a substantial part of the Big Data revolution. Dr. Padmaja et al. present a Knowledge Representation framework for Textual Analysis, in their article inside. Representation of Knowledge is key to its processing, manipulation, and value generation. Algorithms substantially depend on the representation of knowledge. There is still ample scope for enhancing the representation for improved value generation from knowledge. We can expect more research in this area as Big Data Analytics evolves. New year always brings with it, new hope, new engagements, and new directions. I am happy to inform that CSI SIGBDA has signed an MoU with NITW to offer courses that count towards either a Certificate or a PG Diploma. Read more about it inside. Another development is the progress in orienting the SIG towards the industry needs. Please also read inside about the SIG’s first meet with the industry leaders, the beginning of membership enrollments, and news about the international conferences where our executive council members played a major role. Please remember that Visleshana is now indexed by Google Scholar, as can be seen from Also, you can now follow Visleshana on Twitter and Facebook and participate in the discussions. I’m also glad to report that in about 7 months, Visleshana earned more than 10,000 impressions, 1,000+ reads on, and more than 100 followers on FaceBook. I strongly urge you to contribute to the publication and be part of its increasingly growing popularity. I thank my colleague, Radhika Pakala for helping me with some editorial work for two of the articles included in this issue and look forward to more help from more professionals. As always, happy reading and happiness always! With Every Best Wish,

Vishnu Pendyala San Jose, California, USA

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Sales Funnel Analytics

Use Cases in Sales Funnel Analytics Surya Putchala, Sreshta Putchala ABSTRACT— Traditional Sales funnel analysis and forecasts reported by Marketing and Sales teams are based on deterministic rules; subjectively derived based on experience and are driven by biases. This result in pipelines containing skewed forecasts which are often missed. This adds to the unpredictability of revenue projections of an Enterprise. Quantitative methods eliminate such biases by applying predictive analytics to sales pipelines and address two specific problems; Predicting the likelihood of deals that could be won in a given time period (opportunity scoring), and predicting size of the deal which could determine revenue. This will help in a decisive action on the information on pipeline, goals, sales performance and marketing. INDEX TERMS — Sales Funnel, Lead Scoring, Opportunity Scoring, Machine Learning, Customer Segmentation, Customer Satisfaction, Sales Conversion, Big Data, Customer Relationship Management (CRM), Sales Forecasting

—————————— u —————————— 1.



Most companies struggle to create an accurate sales forecast. Poor pipeline visibility and inaccurate, intuition-based predictions from sales representatives leads to a culture where the end of a quarter is often a surprise and attaining sales quota is left up to chance. And even those deals that actually do come through are often significantly different than the deal originally forecasted. The goals of Sales Funnel Analytics (SFA) are to Optimize Sales effort and Marketing spend to: § Understand Customer personas to run relevant Marketing campaigns § Predict the various phases in the lifecycle of Customer journey. § Predict Customer life-time value to device an appropriate Marketing strategy § Rank and Predict Likelihood of customers to buy or respond or engage different segments of Customers § Recommend relevant Products to Customers § Convert Customers and improve value creation. § Retain and engage more Customers Acting on Insights gained from understanding Customers, products, marketing strategy and sales could: § Achieve forecasts with goals and quotas for your sales team. § Analyze Sales Representative performance § Monitor progress at the sales rep level and understand who needs assistance. § Understand the “real” value of sales pipeline/deals § Benchmark for Sales rep’s performance and drive productivity. § Comparative Analysis of Sales representatives, align sales process according to their strengths (negotiation, lead generation, engagement etc) § Act on the integrated information on pipeline, goals, sales performance, marketing and calls.


In order to achieve superior SFA effectiveness, deploying Big Data analytics can improve the forecast accuracy by using external Open data, Internal CRM systems and real-time tracking tools and Machine Learning algorithms.

2.1 Internal Sources There are many factors that affect business success – identifying the most relevant data and capture is critical to exploit the power of predictive analytics. The most accessible of all the sources of information are internal data as indicated in the picture below:

Fig.1 Data Sources that are within an enterprise Although, Customer data plays extremely important role, details about the following is critical for analysing various KPIs related to Sales and Marketing activities:

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Sales Funnel Analytics

§ § § § § § § §

Leads Details (Opportunity or a lead, deal size, customer, rep assigned) Marketing Actions (Campaigns, Channels) Representative details (Basic information, previous leads, current leads) Lead Cycle Details (Times for each stage of the lead, Effort) E-mail/calendar activities (interactions with potential consumers) Web/Social Media activity (Page views, clicks etc.,) Customer Demographics (General preferences, geographic preferences, income etc.,) Product/Service Features

2.2 External Sources The quality of lead may have several Qualitative factors that needs to be captured and quantified. For example, the lead quality is different than the lead progress through the funnel. The lead progression through the funnel is quantitative and could help in planning capacities, forecasting the revenues or conversions.

Fig.2 Data Sources that are external to enterprise As the sophistication of Analytics improves so as the need for additional data which needs to be sourced externally as outlined in Fig 2. In order to leverage data, not only new experiments and variations of the processes needs to be carried out, but external data needs to be put in context as well.



SFA is essentially a Sales function. However, it has implication with marketing, product placement and customer satisfaction. The following are some of the use cases of this broad Sales and Marketing area.

3.1 Lead Scoring A lead scoring considers all closed leads or opportunities from a CRM system; extract or derive features – either demographic or activities or attributes associated with the lead. A lead scoring mechanism is constructed from past lead behavior.

It addresses and derives the following Insights: § Win Predictions – Used to predict if a lead will win given a set of features extracted from data? This prediction will also find a pattern to determine the probability if this lead to convert in pre-defined period. § Age Prediction – How long does it take the lead to move to the next stage. § Lead momentum analysis - Identify Hot, Warm leads § Ranking the leads in order of priority § Historical Lead Health - Open leads, Lost leads, Average time of leads in each stage § Identify Inputs/actions required to close the deals

3.2 Opportunity Scoring Opportunity scoring models make use of the funnel phases which are time-dependent. The features are extracted at a specific stage of the funnel during its lifecycle, rather than the outcome (won/lost) of the Opportunity. Hence, at every stage, only the stage progression is used to score the opportunity. Hence the Sales team has the visibility at the phase level as well as the probability of that opportunity to convert. The following questions are very important for Sales teams to allocate their resources: § The probability an opportunity to convert § The probability an opportunity to won in a specified period/quarter? § How is funnel health to meet the expected sales/revenue projections and commits made by the Sales reps. § Identify positive and negative factors that are influencing the outcome of a deal Evaluate Opportunities currently in pipeline. This helps in focusing on effective opportunities. The goal is to: § Identify opportunities that are promising but not closed or Committed. § Identify opportunities that are at risk but are committed by the Sales Reps. § Identify unrealistic targets by assessing quality of the pipeline § Understand the “real” value of sales pipeline/deals

3.3 Sales Rep Scoring Understanding Sales Representative behaviour deeply and rating or scoring him helps in optimal allocation of leads to the him. § Identify strength/weakness of a Sales Representative by assessing his performance record and understand his sales effectiveness. § Monitor progress at the Representative level and understand who needs augmentation of inputs or needs assistance. § Comparative Analysis of Sales reps, align sales process according to their strengths (negotiation, lead generation, engagement etc.)

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Sales Funnel Analytics

Understanding the Sales rep’s behavior and propensities is important as understanding the customer segments. The scoring will help in: § Benchmarking a Sales rep with his peers § His Propensity to certain of Types of Deals § His strength/susceptibility Analysis § Understanding Open Opportunities for each Rep § Average Deal Size for each rep § Conversion Rate Stage wise for each rep

3.4 Channel-Partner Analyses Understanding the effect of channel partners towards conversion rates of leads at each stage will allow us to determine the positive and negative factors influencing the outcome. Channel effectiveness analysis goes beyond immediate impact and helps in promotional activities. Increase each channel conversion rates through comparison analysis with other channel partners whether they are direct partners or third-party channels. Channel partners play key role in bringing deep insights of the business as they are direct contact of the leads. Therefore, it is essential to know who are the key and have an ongoing feedback as it allows you to make confident business decisions. Analyzing optimal campaign channels could be based on Channel Source, Channel Type, Demographic influence and Timing (seasonality). For each Channel Partner, at least three data points (previous, after and current records) are needed to conduct analysis on channel effectiveness.

3.5 Pipeline Forecasting The Sales funnel generally represents the progression of leads till they convert. The Fig. 3 illustrates a marketing funnel and the phases represent. A customer journey goes through various phases represented in the diagram:

Some of the expected Insights for this use case are: § Historical Lead Health - Open leads, Lost leads, Average time of leads in each stage § Identify Inputs/actions required to close the deals

3.6 Marketing and Lead Generation The 4Ps of Marketing strategy involves the following: § Product line (Competing products in the market) § Pricing (Comparative pricing in the market) § Promotion (Competitor strategies) § Placement (market place dynamics) Some steps in order to lay foundations for marketing analysis are: § Building customer profiles by using Demographic, Behavioral, Transactions and Customer Preferences. § Identifying Customer Segments. Number of groups, Characteristics of each group § Targeting those customers through various channels including Social Media.

3.7 Campaign Effectiveness (AB testing) Campaigns are run through various channels (emails, web events, social media channels) etc., Its effectiveness is measured in order to allocate marketing Dollars.

3.8 Customer Satisfaction for effective Sales All the customer touch points are rich with signals that could give an understanding of his behavior and thus engage them pro-actively and effectively. Failing to meet customer’s growing expectations negatively impacts as customers respond with disloyalty and defection. [1] Reduce Churn: Model Churn by considering Shrinking Customer Base, Declining product Profitability, Customer Service Requests, Social Media Sentiment, Competitor Market Share and Wallet Share. [1] Improve Loyalty (brand): Provide customers with relevant and timely information to ensure they remain satisfied and engaged. Referral Programs (measure the effectiveness) Increase Service levels: Understand the customer preferences in their lifecycle journey and customize service strategies. Customer Surveys: To access the current customer engagement and satisfaction levels. This is critical for NPS and Reputation Management.

3.9 Smart Lead Allocation

Fig.3 Typical Sales Funnel Awareness (when a product or service of a company is advertised through various channels), Interest (when the potential customers get interested in the product or service that is advertised. This phase can be considered as “leads” for the Business. Decision phase is when the customer is making a serious consideration to buy the product. This is tantamount to the “opportunity”. Action results in either the lead is converted or lost. We construct a lead scoring from past lead behavior.

Improve the odds of winning a lead by allocating right resources at the right time. Allocate suitable leads to each rep for increased efficiency. § Match the sales representative to the lead § Identify the best sales rep to a lead


Product and Customer Matching

Product Recommendation helps personalize and interact with the customer utilizing the most relevant to

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Sales Funnel Analytics

each customer. [1] To know which combination of channel, product and customer is effective. Identifying the propensity of enrolments (which student, which university and which degree) and perhaps help build recommendations § Recommendation - Which product is a better fit § Propensity Analysis - Lead/Customer Preferences and actions § Measure better combinations and choose the best possibilities § Determine the weak links between channels and Products and strengthen them by various incentives


preferences, brand traction. Hence the traditional reliance on Time-series may perform sub-optimally dealing with Sales Funnel Forecasting. The bottom-up forecasting is a time insensitive when the influencer variables are visible and available to predict either sales/demand as in fig 4.

Sales Funnel Forecasting

Numerous factors affect accuracy of Sales forecasts. These factors are dynamic in nature, the sales forecast models will have to adapt dynamically, as these factors change. Hence, a finely tuned engine of ensembled “Machine Learning” and “Ranking and Scoring” algorithms shall be part of the Enterprise’s core capability. Economic Conditions such as Market Behavior and Economic Indicators (Inflation, Income)

Fig.4 Bottom Up Forecast The top-down forecasting is a temporal prediction with past behavior of the response variable is important. Traditional time series methods of Moving averages, ARIMA or ARIMAX could yield exceptional forecast accuracy. It is represented below in Fig. 5

Consumers segments- Products Consumption trend and Demographics Forecast Factors such as Time Horizon (Short, Medium, Long term) and Trends (Seasonality, Cyclicity) affect the Forecasting accuracy.



The Machine Learning algorithms should be used according to the nature of the data. However, there are some rules of thumb that could be used for Use cases detailed above.

4.1 General SFA use cases Some Analytical scenarios and their probable analytical treatments are given below. § Overall Sales Forecast (by period, region, sales force) could be achieved by OLAP and Visualization techniques § Win/Revenue Likelihood (Opportunity Scoring) using Random forests § Predict Sales cycle length (duration for sales closure) by utilizing Poisson regression § Predict true value of a lead by Regression methods § Deal/Sale Health Check could be achieved by Dashboarding and Visualization techniques § Sales Representative level Sales Quota prediction or target could be achieved by Regression methods § Predict Benchmark Sales Quota with Regression methods

4.2 Sales Funnel Forecasting method Sales forecasting could be treated as either a timeseries or as a regression problem. Sales Funnel’s not only needs to consider both the trends, cyclicity, seasonality of the products and services that are being sold but also the substitute products, product versions, user

Fig.5 Top-down Forecast Integrating both the Bottom-up forecasting which can use Artificial Neural network (ANN) or Regression could yield a good short-term prediction.

Fig.6 Integrated Sales Forecasting for Sales Funnels Aggregating Forecasts at different levels of Product, Category or Region could identify macro and micro factors of business environment. Consider prediction vs. Forecasting depending on Time horizon

4.3 Sales Funnel Analysis The fundamental questions about a sales lead as to its likelihood of positive closure, timeframe when it could be closed, and the critical factors that affect a deal success can be addressed by the techniques outlined in the Fig.7.

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Sales Funnel Analytics

§ § § §

Accurately predicting future revenue Understanding lead conversion ratios Objectively coaching your team and analyzing its strengths & weaknesses Implementing strategies to improve your team and prioritizing sales pipeline

SFA is very broad area that covers domains such as marketing, sales, order fulfilment. SFA yields a wide variety of use cases and analytical techniques, needs data from both internal and external data sources.



Fig.7 Sales Funnel Prediction models


The fundamental questions about a sales lead can be addressed by the technique outlined below:


4.4 Some Key Dashboards in SFA


There are few important visualizations about the Sales Funnel that are important for the Sales team. They are: Opportunity Scoring Dashboards: § Lead age in pipeline, stage, stalled § Number of pushes § Average cycle length and inactive period § Bench marks (in comparison to other leads: Deal size, Customer efforts) Sales Representative Scoring dashboards: § Rep Score, Average Deal Size for each rep § Conversion Rate (# leads converted/# leads allocated) § Converted Revenue (Converted revenue/Tot. Expected Revenue) § Bench marks (in comparison to other reps: Deal size, cycle time, conversion rate, revenue earned)



Sales Funnel analyses cannot function without related areas of Marketing, Customer Analytics. The Analytical areas also draws upon insights and information from customer Analytics. SFA helps in a decisive action on the information on pipeline, goals, sales performance and marketing. Sales insights with exceptional efficiency requires capability of an enterprise to [1]:

Cappius (2017) Sales Analytics, Cappius Technologies, Insidesales (2016) “Applying Data Science to Sales Pipelines for Fun and Profit”, Andy Twigg, InsightSquared(2013) “The definitive Guide to Data-driven Forecasting” Zorian Rotenberg, Mike Baker, InsightSquared Infer (2016) Infer’s Guide to Predictive Lead Scoring Marketo (2010) The definitive guide to lead scoring.: A Marketo Workbook, Marketo, Hubspot blog “Traditional vs. Predictive Lead Scoring: What's the Difference?” Rachel Leist

AUTHOR PROFILES Surya Putchala is CEO, ZettaMine Analytics. He provided thought leading consulting solutions to Fortune 500 Clients for over 2 decades. He is passionate about areas related to Data Science, Machine Learning, High Performance Cluster Computing and Algorithms. He held senior leadership roles with large IT service providers. He graduated from IIT Kharagpur.

Sreshta Putchala interns with ZettaMine Analytics. She has used various statistical methods using SQL and R and cleansed Marketo and salesforce data. She is currently pursuing her Batchelor’s degree in Computer Science from Chaitanya Bharati Institute of Technology (Osmania University). Her interests are in the fields of Big Data, Machine Learning and Artificial Intelligenc

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Surveillance of Disease Outbreaks

A Big Data Approach to Surveillance of Disease Outbreaks Meghana Aruru* and Saumyadipta Pyne Abstract— Disease surveillance data collection over geographical space and time is historically a big data activity. Storage and near real-time analysis of public health data, and metadata, of large volume, velocity and variety, are increasingly supported by emerging computational platforms. Guiding principles for standardization of such data are being drafted and implemented by new international initiatives thereby ushering a new global culture of big data in health. Index Terms—Big Data, Disease Outbreak, Epidemics and pandemics, Health Analytics

—————————— u ——————————



n a world increasingly connected by travel and trade, the risk of emerging epidemics is rising at the rate of 1 new disease per year [1]. In the last 5 years, World Health Organization (WHO) has identified more than 1,100 epidemics including viral diseases such as polio, HIV, Marburg virus, Nipah virus, Ebola and avian flu [2]. In a globalized world, new and re-emerging pathogens can spread rapidly and infect large populations in many countries. Historic pandemics such as the "black death" and the "Spanish flu" are well studied. The 2009 H1N1 Influenza pandemic was first detected in USA but it rapidly spread around the world. The U.S. Centers for Disease Control and Prevention (CDC) estimated in a modeling study that between 151,700 and 575,000 people died from the 2009 H1N1 viral infection worldwide [3]. More recently, a devastating Ebola outbreak in several West African countries led to more than 28,000 cases and more than 11,000 deaths between 2014 and 2016. In order to be prepared to tackle such sudden and severe stresses to their health (and other) systems, many, if not most, countries rely on disease surveillance mechanisms that are based on systematic generation and analysis of disease outbreak data. Indeed, public health decision-making and action depend critically on the availability of such data. It is important to systematically monitor, report and re-

spond to emerging crises in to reduce the burden of epidemics. Naturally, such data has significant volume, velocity and variety.

2 DATA COLLECTION, ANALYSIS AND SHARING Information is disseminated quickly through public health networks initially, and later through peer-reviewed journals and accompanying datasets. In unfolding emergencies, such timely available and readily usable information is critical for deciding the appropriate course(s) of action. Data sharing enables researchers to model critical paths and interventions towards preparedness for future events. To begin with, acquisition of surveillance data requires establishment of a public health network of various systems including specialized clinical microbiology and pathology laboratories, emergency preparedness and response centers, and hospital reporting systems, among others. Coordination is essential between automated, semi-automated and manual data gathering, and, its subsequent dissemination for public health action. In developed countries like the United States, such sentinel surveillance systems are put in place to identify emerging threats and monitor existing threats through a wide network of public health laboratories, hospitals, epidemiological survey units, etc. A sentinel surveillance system uses high quality data about diseases that cannot be monitored through passive surveillance systems.


• Dr. Meghana Aruru is Vice-President of Pramana Analytics. She is also an Adjunct faculty member at the MediCiti Institute of Medical Sciences. Her work focuses on health policy and communications. • Prof. Saumyadipta Pyne is the Scientific Director of the Public Health Dynamics Laboratory, University of Pittsburgh, USA. He holds Adjunct Professorship at the National Institute of Medical Statistics of Indian Council of Medical Research (ICMR). *Correspondence:

Data collected through sentinel surveillance systems can be used to identify trends, detect outbreaks and monitor disease burden in communities, thus providing timely information to policy makers and public health planners [4]. Naturally, data quality and standardization are important aspects that determine the utility of collected data. In an unfolding emergency, there is a critical time window during which data must be quickly gathered, analyzed and

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Surveillance of Disease Outbreaks

disseminated for rapid response. In the absence of standardization, data gathered may be unreliable or unfit for immediate analysis. One of the first aims in developing surveillance systems is to generate data that is deemed to be ready-to-use. Many agencies such as the U.S. National Notifiable Disease Surveillance System (NNDS) have detailed methods for data collection, analysis, standardization and sharing to achieve usable data [5].

3 DATA STEWARDSHIP Data stewardship involves good data management practices beyond proper collection, annotation and archival. Data quality maintained in the ‘long run’ leads to high quality research and publications. Several guidelines exist on preserving and maintaining data. One such set of guidelines, established by the FAIR Data Initiative, an international consortium, is based on the following four foundational principles about a given dataset – 1. Findability, 2. Accessibility, 3. Interoperability, and 4. Reusability. Together, the FAIR Data Principles serve to optimize outcomes associated with such data use [6]. (See box) Data availability is simply not enough. Available data should be properly linked with well-established protocols on how to use them. Patterns may emerge from functionally linked datasets, calling for the subsequent steps to rationalize and conduct confirmation studies. It is therefore critical that relevant and useful metadata be systematically included and saved with all datasets for researchers to track provenance and justify the evidence from the uncovered patterns. The FAIR guidelines promote data sharing and accessibility for researchers and scientists around the world in the interest of furthering science. Data sharing for public health response is an emerging cultural phenomenon and there are many projects that aim to make data - that are standardized - available for use by public health scientists. Many such data repositories have adopted FAIR standards, e.g., Mendeley data, Dataverse, Figshare. In public health, one such effort is Project Tycho® at the University of Pittsburgh in USA. It “aims to advance the availability and use of public health data for science and policy” [7]. Researchers have digitized all available city and state notifiable disease data from as early as 1888 until as recent as 2011 (in the first version of the database), obtained mostly from hard copy sources. Information collected corresponding to nearly 88 million cases was stored in a database that is open to interested parties without any

restriction through its online archive ( This database is arguably among the earliest examples of systematic public health data integration where millions of cases were digitized from hard copies and integrated with existing data in a standardized, usable format for researchers and public health scientists to access free of cost and add to the globally growing body of knowledge on outbreaks and epidemics. To further this contribution, researchers at Project Tycho® have collaborated with different countries in Southeast Asia to gather data on dengue surveillance and detect disease patterns at regional levels. This resource integrates data from the WHO DengueNet provided by WHO surveillance networks for its member countries. Findability ü Data are uniquely and persistently identifiable ü Data are re-findable at any point in time i.e. have rich metadata ü Metadata is actionable and allows distinction from other data ü Metadata are registered or indexed and searchable Interoperability Ø (Meta) data use vocabularies that follow FAIR principles Ø (Meta)data include references to other (meta)data

Accessibility • Data is accessible through a well defined protocol • Protocol is free, open and universally implementable • Data is accessible upon appropriate authorization • Metadata are accessible even when data is unavailable Reusability - (Meta)data are well-described and can be linked easily with other sources - (Meta)data meet community standards

4 CONCLUSION A Big Data approach to epidemiology would benefit from data stewardship principles to ensure transparency, reproducibility, and reusability of high volume and high velocity data. Multiple stakeholders can benefit from such data stewardship including researchers who are willing to share their data and software, organizations, funding agencies, etc., and, indeed, a rich data science community that is interested in mining both new and existing data. Applying standardization techniques such as the FAIR guidelines aids in integration of massive data repertories in a systematic and automated manner to save time and aid pattern discovery. Big data surveillance repositories like Mendeley data, Dataverse, Figshare, Project Tycho® etc. illustrate the value of data collection, standardization and sharing to drive

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Surveillance of Disease Outbreaks

research progress on major public health challenges. Evidence gathered can lead to exploration of confirmatory studies as well as framing and evaluation of policies for appropriate and timely public health responses. Importantly, it seems more certain than ever that a new and deeper Big Data "culture" - that goes well beyond the traditional tasks of data collection and analysis - involving also data curation, harmonization, standardization, annotation, and sharing of data freely - is here to stay.




[5] [6]


WHO | Disease outbreaks. WHO. Published 2017. Modjarrad K, Moorthy VS, Millett P, Gsell P-S, Roth C, Kieny M-P. Developing Global Norms for Sharing Data and Results during Public Health Emergencies. PLOS Med. 2016;13(1):e1001935. doi:10.1371/journal.pmed.1001935. First Global Estimates of 2009 H1N1 Pandemic Mortality Released by CDC-Led Collaboration | Spotlights (Flu) | CDC. WHO | Sentinel Surveillance. WHO. 2014. Accessed January 3, 2018. NNDSS | Centers for Disease Control and Prevention. Accessed January 3, 2018. Wilkinson MD, Dumontier M, Aalbersberg IjJ, et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 2016 3. March 2016. van Panhuis WG, Grefenstette J, Jung SY, et al. Contagious Diseases in the United States from 1888 to the Present. N Engl J Med. 2013;369(22):2152-2158. doi:10.1056/NEJMms1215400.

How would you forecast an epidemic? How to create a Big Data platform for disease surveillance like Project Tycho? What are the disease-modeling scenarios in India? The upcoming International Symposium on Health Analytics and Disease Modeling (HADM 2018) presents a unique opportunity to learn and discuss about such key areas in depth. HADM 2018 will be held on March 8-9, 2018, at the National Academy of Medical Sciences auditorium in New Delhi. It will be jointly organized by the Public Health Dynamics Laboratory of University of Pittsburgh, USA, and the National Institute of Medical Statistics of Indian Council of Medical Research (ICMR-NIMS), and in partnership with SHARE India and Health Analytics Network. Distinguished experts from USA, UK, France, Vietnam and India will present and discuss their research in modeling of infectious and non-communicable diseases, present case studies and different analytical approaches and big data resources. Registration and further information at

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Humanitarian Causes

Big Data Analytics for Humanitarian Causes Dr. N.Ch. Bhatra Charyulu and Dr. G. Jayasree Abstract - Big data Analytics is crucial for advancing Humanitarian causes, like Medical and Health care and it is needed to transform relevant information locked in text into structured data that can be used by computer processes aimed at improving patient care and advancing medicine. This is the report on a short term course on “Big Data Analytics for Humanitarian Causes” sponsored by the Ministry of Human Resource Development (MHRD), Government of India, under the scheme, Global Initiative of Academic Networks (GIAN), Organized by the Department of CSE, UCE, Osmania University, Hyderabad, 20th to 27th Nov 2017. This paper reports on the main outcomes of this course, including an overview of the state of the art strategies for advancing in the field, and obstacles that need to be addressed, resulting in recommendations for a research agenda intended to advance in the field. Index Terms—Big Data Analytics, Textmining, Humanitarian Causes, issues, challenges and future directions

—————————— u —————————— Highlights This report summarizes the Sequence of sixteen one and a half hour sessions of inspiring lectures and 4 tutorial sessions, delivered by Vishnu S. Pendyala, who works for Cisco, USA, as a Technical Leader. It focuses on the current state of the art Humanitarian Causes, Research strategies for advancing Big Data Analytics, and discussions of obstacles and challenges in this field.

1. INTRODUCTION Big data is not just big; it constitutes a comprehensive definition and focuses on the attributes mainly on volume, variety, velocity and veracity and some more V’s (42-V’s). They bust the myth that big data is mainly about data volume, tera-bytes, sometimes peta-bytes. Big data isn’t new, but the effective analytical leveraging of big data is recent. It taps the sources for analytics from unstructured data (text and human language) and semi-structured data (it comes from audio, video, and other devices). There is an exponential increase of web data sources, including logs, click streams, and social media, geospatial data in logistics, text data from call center applications, etc. Big data analytics is really about two things, big data and analytics. Usage of big data is a not only a technical challenge. It is also a business opportunity, providing trends in business intelligence.

2. HUMANITARIAN CAUSES The applications of big data in various Humanitarian causes were discussed relating to Health, Medicine, Agriculture, Technology etc. Scenarios of Twitter attacks, Falsity and fraud on the Web and its remedies were also part of these sessions. The role of social media in big data are discussed in detail. Discussions of various humanitarian causes come under (i) Crisis management- Analyzing social media, ————————————————

Prof. N.Ch. Bhatra Charyulu is with the Department of Statistics, University College of Science, Osmania University, Hyderabad, Telangana, 500 007, E-mail: Dr. G. Jayasree is with the Department of Statistics, University college of Science, Osmania University, Hyderabad, Telangana, 500 007, E-mail:

posts for recognizing & predicting trends, (ii) Health care automated medical diagnosis on diseases based on decoding human DNA, (iii) Humanitarian verticals using BDA & Internet techniques. (iv) Data Computing can be enabled for participation in crowd sourcing projects from under served countries. The lecture raises many important questions like, can the web contain only truth and nothing but the truth, can we impose syntax for truthful information, how do we detect lies from truth, are there features associated with truth or lies on the web, can we identify those features, if any, can we model the expertise programmatically, etc.

3. TECHNIQUES USED FOR ANALTYICS The sequence of lectures cover basic concepts on text mining, need for the study of text mining, categories of text mining, search engines, indexing, information extraction, information retrieval, web mining, document classification, document clustering, text mining process, stemming, extracting phrases, case folding, lemmatization, auto data sources, various data corpus, decision optimization, analytical output, levels of text mining, natural language processing. Various Mathematical and Statistical techniques with suitable text mining data related to medical and health data, social media data are used for mining. a) Usage of Vector Space Model for extracting the text information was illustrated for the text data. b) The usage of Zipf’s and Heap’s laws for text data and their properties were illustrated. In addition, life distributions like Exponential and Pareto distributions are discussed with its applications. c) Computation of various Similarity and Dissimilarity

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Humanitarian Causes

measures, like, binary relation of Jaccard measure, correlation measure, Euclidian, Manhattan, Minkowski’s distance, etc., were discussed. d) Computation of text frequency (tf) and inverse document frequency text (idf) are illustrated.

m) Bayesian Classifier is a probability based classification technique which uses the word probability to classify text data. Based on previous text and patterns, data is evaluated and the class possibility is measured. n) The determination of sample size and Sequential Probability Ratio Test procedures are illustrated. o) Change detection techniques like Cumulative Sums (CUSUM), Receivers Operating Characteristic Curves (ROC), Anomaly detection techniques line KNN are compared. p) Fitting of Multiple linear regression and Logistic regression using Odds Ratio are discussed.

Figure 1 Vishnu S. Pendyala delivering a GIAN lecture

q) Argumentation is understanding the content of serial arguments, their linguistic structure, the relationship between the preceding and following arguments, recognizing the underlying conceptual beliefs, and understanding the specific topic with comprehensive coherence. Different state-of-the-art techniques on machine learning and context free grammars and predicate logic and First order logic were discussed.

4. FUTURE CHALLENGES AND REMEDIES e) Computation of Recall and Precision measures, tradeoff between Recall and Precisions measures E and F, Gain, Cumulative Gain, Discounted Cumulative gain, Normalized Discounted Cumulative Gain. The concepts of TREC and Teleporting were outlined. f) Computations of Page ranking algorithm and Confusion matrix are illustrated for web mining data.

Various future research challenges on Big Data Analytics like, (i) Privacy and security: Hacking can be Fatal, (ii) Cost: Technology still a commercialization (iii) Dataset: Government mandate may be needed (iv) Human Expertise: Intervention Needed at line (v) Quality of availability of Data Set (vi) Multilingual Support: 22 official and 1,1652 different mother tongues in India alone. (vii) High Cost of Type-I Type-II errors are posed.

g) Usage of Boolean Model for text data is discussed. h) Details the supervised and unsupervised machine learning techniques. i) Nearest neighbour classifier is used to perform query on text k. It estimates the distance between two strings for comparison and classifies the text on the basis of distance. In the text mining domain it is an often used technique which is illustrated for the data. j) Explains and illustrates the classification techniques like Likelihood ratio test and Linear discriminant classification. k) Support Vector Machine is one of the most effective and accurate classification algorithm. In this approach, concepts using hyper-plans and dimension estimation based techniques are used to discover or classify the data. l) K-means technique, a classical approach to text categorization, uses distance function to cluster data and is an effective method of text mining in order to preserve the resource which was iullustrated with a suitable example.

Figure 2 Valedictory function

Future directions suggested to the participants for pursuing research in conjunction with text information are also in the areas of (i) Processing images, videos and Audio in conjunction with text information (ii) Real time state fullness big data processing (iii) Aggregation of Health data to detect or predict epidemics and health trends (iv) Add QA interface using NLP, IVR Machine translation (v) Extend the ideas to monitor and proac-

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Humanitarian Causes

tively remedy abnormalities (v) Change detection techniques on news media (vi) Machine learning algorithms in the ensemble for use in conjunction with change detection (vii) Tempered correlation analysis on live data (viii) Knowledge representation techniques can help in detecting and preventing anomalies and inconsistencies in the represented statements. (ix) So far approaches work after the fact, and hence we need ways to prevent posting false data. (x) Other approaches: content syndication & viral postings make voting ineffective on the web. (xi) Investigate more ML and KL techniques, Deep learning KL divergence. The lecture suggested various remedies to avoid fraud and false data created on web like (i) Governments maintaining websites that contain fake news (ii) A number of independent organizations and websites like which are constantly involved in debunking false information (iii) Suspension of fake accounts, awareness campaigns, software enhancement to flag fraud in the news feeds and allowing users to provide credibility rating are some of the experiments that the internet companies are trying (iv) Government has been passing laws against online crime from time to time. (v) FBI internet crime complaint center. (vi) Deploy biometric identification, two factor authentication, strong passwords etc. (vii) Threat intelligence platform (viii) Technology is far better at prevention than legal remedies. (ix) Prioritizing trustworthiness over popularity in showing the search results. (x) How can we build automated expertise in separating

truth from lies which improves with experience? What level of classification accuracy is acceptable?

ACKNOWLEDGMENT The authors wish to thank Vishnu S. Pendyala for delivering such wonderful sequence of lectures. We are also thankful to MHRD for sponsoring the course under GIAN scheme and to the Department of Computer Science and Engineering, Osmania University for organizing such a good short term course. Prof. N.Ch. Bhatra Charyulu is working at Osmania University. He was awarded his M.Phil in 1991 and Ph.D. in 1994 in Statistics. He produced 3 M.Phils and 5 Ph.D.’s from Osmania University. Around 50 published research articles in the various national and international reputed journals are accredited to him. He contributed to presenting more than 30 papers in the National and international level conferences. He completed one Major Research Project sponsored by UGC and is presently handling a project under UGC-UPEPURSE. His major area of research is Design and Analysis of Experiments and Multivariate Data Analysis. Dr. G. Jayasree is also working at Osmania University. She completed her Ph.D in 2003. She published 10 research articles in the various national and international reputed journals. Her major area of research is Distribution Theory and multivariate analysis.

CSI SIGBDA MEETS INDUSTRY LEADERS The first meeting of the CSI SIGBDA Executive Council with the Big Data Analytics industry leaders happened on January 20, 2018 with an intent to get the market pulse to further charter the course of direction of the SIG for the next three to four years. The four hour interactive event turned out to be a good networking opportunity and also served as a membership drive. Membership enrollment will begin in February 2018. The meeting was attended by key people from Deltamarch, Tech M, Cappius Technologies Kony, , Google, OLA, BestinTown Analytics, ZettaMine Technologies, Microsoft, Pyramid Consulting, and ITelligence. Prof CR Rao of University of Hyderabad also graced the occasion, representing academia. Sri Chandra Dasaka, secretary, spoke about the SIG and set the tone for discussion by highlighting the recent MoU with NIT Warangal. Everyone present were highly appreciative of the SIG activities and agreed to meet again in two to three weeks’ time. The event venue, food and drinks were sponsored by ITelegence. Sri Chandra informed that he initiated the work to integrate payment gateway to the SIG’s website so that the membership can be obtained online. There are plans to rope-in some of the dignitaries who were present into the Executive Council against the vacancies and to form an Advisory Committeee to review any consultancy / project requirements that the SIG may undertake in future. The committee will also help with selecting the most suitable company to take up such project work.

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Object Knowledge Model

Textual Analysis of Big Data with Object Knowledge Model: A Knowledge Representation framework Dr. S Padmaja, Mr. Sasidhar Bandu and Prof.S Sameen Fatima Abstract — This study aims to extract greater semantics from the E n g l i s h l a n g u a g e sentences of big data by proposing a frame-based knowledge representation framework called Object Knowledge Model (OKM). This model has only three prime non-terminals, i.e., set of descriptors d, noun n, and verb v, which converts a sentence to attributed sentence for enabling semantic/knowledge/context- based search on the contents of big data. This approach can be applied in various knowledge representation domains like big data analytics, semantic web, cross lingual and multilingual search and also in information extraction of named entities. Index Terms — Textual analysis of big data, Natural Language Processing, greater semantics, context- based search

—————————— u ——————————

1 INTRODUCTION A novel approach to extract the greater semantics from textual big data has been proposed in this paper. It can be performed in any language with OKM, a frame-based knowledge representation framework, for enabling textual analysis of semantic knowledge/content-based search in big data. A. The limitations of the existing semantics in big data and motivation for OKM. Currently the semantics from textual big data using web search engines depend on searching RDF/RDFS/OWL documents which contain metadata about the web pages, but do not contain any semantic information on the content of the relevant web pages. In reality, the user of the semantic web will be more interested in making a semantic search on the content of the semantic web documents, instead of a search on only the metadata of those documents (as is being done today), as carried in the RDF/RDFS metadata descriptions of these documents. This is due to the web surfing user having a particular meaningful or semantic query in his/her mind while searching. The query may be expecting answers so as to find more of the content, given a few keywords representing the entities or relationship comprising the content. RDF, RDFS, and OWL can describe the identification, description and classification of objects which are Web Resources only, i.e., web pages, Authors, Publishers, etc. All possible metadata about the web pages can be captured in RDF and RDFS or OWL. They do not address the internal subject content of the individual web pages. It’s peak time for researchers to work on Sir Tim Berners Lee dream to make semantic web the next version of WWW and implement it into practice [1]. This study aims to enhance the semantic/content/knowledge-based search of not only web pages, but also research papers, equity reports

etc. [4], [5]. This work was implemented using Python Natural Language Processing libraries NLTK and Goslate. The heart of the OKM is its grammar, proposed in this paper, which will parse any given English sentence. It is written in such a way that given any sentence it will generate the parse tree with non-leaf elements as a set of descriptors (d), nouns (n) and verbs (v), and leaf elements as attributed words of the sentence ex: Ram (111). B. Related Work Research in this area was conducted by many re-searchers, but most of the prominent work was done by Stanford by developing the pos tagging [2] grammar to parse a sentence and generate dependencies in English which was further enhanced to include other languages ex: Arabic, Chinese, and Spanish etc. [3]. C. Knowledge Search using OKM OKM enables frame-based Knowledge Representation. The knowledge content of a web page in any given Indian language can be converted to the Object Knowledge Model (OKM) based Knowledge Representation. For each sentence, there will be an equivalent frame of knowledge created. For the whole page, a complete knowledge base will be created. Once the knowledge base is created by using this approach, it can be searched by using the key words denoting the objects or their relationships. Given a keyword or input, it has to be used as an input for searching the knowledge base equivalent of a web page. If the input keyword is indicating an object, then all the objects related to the given input object can be located very easily in the frame of the knowledge base. Each knowledge frame can be searched with the given input literal (for object) for string (i.e., keyword) matching. If a match is found, then

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Object Knowledge Model

all the relationships of that string or keyword (object) with other objects and the nature of their relationships can be extracted directly and displayed as: Input object relationship other object(s) object. Multiple matches can be shown in multiple displays.

2. IMPLEMENTATION A. Simple English sentence First, let us see the structure of a simple sentence in English grammar consisting of a Set of descriptors (Adjective, Adverb, Articles) + Noun phrase + Verb phrase. Ex: Ram went to school by bus. Brother, a letter for you. B. Word Attributes The attributes which are assigned are divided into two parts: •

Noun Attributes


Verb Attributes

Noun Attributes: Given a sentence, all the nouns and pronouns in the sentence are attributed by some features such as, gender g (1: male, 2: female, 3: neutral), number n (1: singular, 2: dual, 3: plural) and case c (1-8) ordered as per Sanskrit Grammar cases. Cases in order are as follows: 1) 2) 3) 4) 5) 6) 7) 8)

Subjective Objective Instrumental Dative (indirect object) Ablative Genitive (possessive) Locative Vocative

Attributed noun word will look like: Noun word (g n c) ex: Ram (1 1 1).

The parentheses which are assigned to the sentence are nothing but segmenting a sentence using OKM grammar. C. OKM Grammar The OKM grammar is given below: 1) 2) 3) 4) 5) 6) 7) 8) 9) 10) 11) 12)

13) 14) 15) 16)

k ::= (d o)* d ::= (D)* D ::= set of descriptors (adjectives, adverbs) o ::= on * ov * on ::= nr na na ::= (g n c) * g ::= 1..3 ;; gender n ::= 1..3 ;; number (singular, dual, plural ) c ::= 1..8 ;; case nr :: set of noun roots ov ::= vr va vr ::= set of verb roots/procedure/process/web service (with Oni as input and Oni+1 as output) va ::= (p n t) * p ::= 1..3 ;; person (first, second, third) n ::= 1..3 ;; number (singular, dual, plural) t ::= -1,0,1 ;; tense (past, present, future) and mood An asterisk (*) denotes zero or more repetitions of the preceding element.

3. R ESULTS Parsed sentence was extracted with gender g and number n attributes of noun words and number n and tense t attributes of verb words. But for case c, which is noun attribute, and person p, which is verb attribute, the dependencies (noun and verb) were considered by sentence segmentation using OKM grammar and Stanford universal dependencies. The sentence was parsed and the parse tree generated using the OKM framework. In the tree, the non-leaf element as descriptor d, noun n and verb v, and leaf elements as attributed words of the sentence are compared with the parse tree generated by the grammar of Stanford. The number 999 is the default value given to D in all the figures. Further this work can be extended for any language. The output generated is shown in the figure below:

Verb Attributes: Given a sentence, all the verbs in the sentence are attributed by some features such as, person p (1: first person, 2: second person, 3: third person), number n (1: singular, 2: dual, 3: plural) and tense (-1: past tense, 0: present tense, 1: future tense). Attributed verb word will look like: Verb word (p n t) ex: went (3 1 -1). Now, given a sentence, let us say: Ram went to school by bus. When parsed through OKM converted into (((Ram 111) (went 31-1)) ((school 312)) ((bus 313))). Prepositions are neglected in OKM framework.

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Object Knowledge Model

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Object Knowledge Model

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Object Knowledge Model

4. C ONCLUSION Given any sentence, OKM grammar is able to parse a sentence and segment the sentence into features, but assigning the attributes to the word depends on the accuracy of the Stanford POS Tagging and Universal dependencies. There are certain conclusions which were drawn from the results. 1)



• • •

The OKM is easy to understand, as can be seen in the Fig 3(b), there are some nonterminals used which were not used in any of the Figures (example sentences above) of Stanford and there are many nonterminals to remember in Stanford grammar when compared to OKM grammar. This is due to OKM having only three prime non-terminals, i.e., set of descriptors d, noun n, and verb v, whereas Stanford has many nonterminals which are hard to remember. There are certain cases where the sentence segmentation is same in both OKM and Stanford as seen in examples in Fig 4 and Fig 6.

It can be applied in language-based learning, by deploying OKM as a Knowledge Representation methodology with learning as input and modification/extension of the knowledge represented. REFERENCES

[1] T Berners-Lee, J Hendler, and O Lassila. Scientific american: Feature article the semantic web, 2001. [2] Marie-Catherine De Marneffe and Christopher D Manning. Stanford typed dependencies manual. Technical report, Technical report, Stanford University, 2008. [3] David D Lewis. Feature selection and feature extraction for text categorization. In Proceedings of the workshop on Speech and Natural Language, pages 212–217. Association for Computational Linguistics, 1992. [4] Kirti D Pakhale and SS Pawar. Focused retrieval of e-books using text learning and semantic search. International Journal of Innovative Research and Development, 3(7), 2014. [5] Jane Zhang. Ontology and the semantic web. 2007

The sentences, here my pen is and Jai Mata Di, are segmented as (here) (my pen is) and (Jai Mata Di) respectively. Whereas in OKM each and every noun and verb are attributed with the semantic knowledge which is not present in Stanford. OKM is more powerful than Stanford in sentence segmenting which can be seen in some of the cases, for example, in Fig 2, Fig 5 and Fig 7.

Dr. Padmaja S received her PhD in computer science from Osmania University. She is an Associate Professor at KMIT, Hyderabad. She regularly contributes to scholarly journals, conferences and is also reviewer. She is a resource person for various courses offered at research and academic institutions of repute. Natural Language Processing, Machine Learning, Big data Analytics are few of her areas of research interest. She can be reached at for research collaboration.

For the sentence in Fig 2, the sentence is segmented as: Stanford: (Brother, a letter) (for you) OKM: (Brother) (, a letter for you) For the sentence, She is Rohits friend in Fig 5, the sentence is segmented as: Stanford: (She) (is Rohits friend). OKM: (She is) (Rohits friend). For the sentence, The leaf is falling from the tree in Fig 7, the sentence is segmented as: Stanford: (The leaf) (is falling from the tree) OKM: (The) (leaf is falling) (from the tree).

Mr. Sasidhar B is an EFL Lecturer and heads Professional Development Unit at Prince Sattam Bin Abdulaziz University, Saudi Arabia. He is a resource person for various institutions and corporate houses in India on ELT, Teacher Training, CALL and Educational Technology. He can be reached at

In the above three examples one can clearly see the difference in sentence segmenting where OKM is more intelligent when com- pared to Stanford grammar.


OKM can be applied in Semantic Web as a tool for better knowledge representation and better semantic search of the Web.

It can be applied in Search Engine technology for cross lingual, multilingual search.

It can be applied in text analysis and information extraction for machine learning classification of the Named Entities in a given text.

Prof Syeda Sameen Fatima has over 33 years of experience in teaching, research and administration in India, USA and UAE. She took over as Principal in July 2016, and holds the distinction of being the first lady Principal, in the history of the College of Engineering, Osmania University. Currently she is a Professor at the Department of Computer Science and also the Director, Centre for Women’s Studies at Osmania University. She has published several papers in national and international journals and conferences. Her areas of interest include Machine Learning, Text Mining and Information Retrieval Systems. She is a life member of the Computer Society of India. She received the “Best Teacher Award” by The Government of Telangana, India in the year 2017. She can be reached at .

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Research in India

Big Data Analytics Research in India – A Perspective Vishnu S. Pendyala Abstract—Big Data Analytics is one of the main technological drivers of economy today. There is ample scope of research and application of the research of Big Data Analytics in day-to-day lives. The author reviewed several research papers from scholars in India and other Asian countries in topics related to Big Data Analytics for multiple conferences and journals. Based on that experience, the author felt the need to use the opportunity that the Ministry of Human Resources Development gave to him in the form of the one-week Global Initiative of Academic Networks (GIAN) course, to orient scholars in the country towards higher standards of research. The following is a perspective on the need to spruce up research in India and how courses such as the ones organized under the GIAN initiative can contribute to this noble cause. Index Terms—Research Standards, Big Data Analytics, India, GIAN

—————————— u ——————————



he world is advancing because of research, particularly in the hi-tech areas. India has talent that is matchless. From times immemorial, Indians have proven their aptitude for in-depth understanding and discovery of subtlest of the subtle aspects of science, math and engineering, which are the basis for research in the hi-tech. In spite of this, the western world has always taken a lead in research in the modern times. This is mainly because of the exposure, funding, and environment that the developed world provides. paper. The Government of India’s Global Initiative of Academic Networks (GIAN) program [1] is a step forward in bridging this gap. The article examines how. According to the portal for GIAN, “Govt. of India approved a new program titled Global Initiative of Academic Networks (GIAN) in Higher Education aimed at tapping the talent pool of scientists and entrepreneurs, internationally to encourage their engagement with the institutes of Higher Education in India so as to augment the country's existing academic resources, accelerate the pace of quality reform, and elevate India's scientific and technological capacity to global excellence.” There have been detailed studies such as [2], on where India’s knowledge economy is headed. This article, however focuses on only initiatives such as the GIAN program. Several hundreds of courses were approved and organized all over India in the past one year through the GIAN program. As part of the program, the author taught a 7day course on “Big Data Analytics for Humanitarian Causes” at Osmania University, Hyderabad from November 20 – 27, 2017. It was attended by about 50 research scholars, mostly professors in all grades from New Delhi ————————————————

in the North to Chennai in the South. The intensive course covered several topics and research problems in the Big Data Analytics area. This article is a perspective on the role that programs such as GIAN can play in bringing India on par with the United States when it comes to research in Hi-Tech.

2 THE WESTERN SCENARIO Within a month or two of the author’s landing in USA, a book-store near his home hosted a free-to-attend talk by none other than the legendary Turing Award winner, Professor emeritus, Donald Knuth of the “Art of Programming” fame. He just walked-in casually, and interacted with the audience, answering each and every question patiently. Over the more than two decades the author lived in the USA, he got many such opportunities to interact with legends and get inspired. He did not make a lot out of those great opportunities because of his own personal commitments. But if someone did, he would probably become another Raj Reddy, Satya Nadella or Sundar Pichchai. These legends or soon-to-become legends are not too different from the many engineers and research scholars in India in terms of upbringing. They just made excellent use of the opportunities with utmost focus and dedication. Success begets success. Due to its advanced placement, the west continues to lead in world-changing innovations and unfortunately, countries like India are unable to make the impact they are highly capable of because of the lack of exposure. In spite of the infrastructural and environmental drawbacks, India’s contributions in some areas are exceptional, as acknowledged by foreign authors such as in [3].

• Vishnu S. Pendyala is the editor of Visleshana. He can be contacted on E-mail at pendyala(at) and on Twitter at @vishnupendyala.

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Research in India

It is really heartening that the Government of India is proactively coming forward to bridge this gap through programs like the GIAN. Such programs have the potential to give the much-needed exposure to the research methodologies, standards, and efforts that are advancing the world. It is also appreciable that institutions like Osmania University are taking advantage of these programs and playing the much-needed role of being a catalyst in advancing research in the field of Computer Science and Engineering.

4 RAISING THE BAR FOR RESEARCH Learning requires substantial efforts. We should not forget that the Sanskrit word for training is !"#$ℎ& – the same word that also means punishment. This word literally came true to the author, when he took what was then known as one of the toughest graduate Computer Science course at Stanford University. The author of the book taught the course and covered approximately 1200 dense pages full of math and abstractions in less than 3 months. GIAN programs need not necessarily go through that kind of a rigor, but the participants will make the best of the program if they practice similar kind of focus and dedication. Standards are high when it comes to research. Any good conference does not have an acceptance rate of more than 20% these days. That mean, 80% of the papers received by these conferences are rejected. That itself indicates how difficult it is to do research. The readers may find it interesting to note that even the “Page Rank” algorithm based on which, the company Google was formed, was rejected by a conference organizers, when the founders of the company were doing their PhD.

Figure 1 Inauguration of the GIAN course

3 MORE THAN TECHNOLOGY It is not just the technology or the subject that is important for learning. I’m sure you have the courses, which teach most of the advanced topics. Even if courses are not available, excellent books, lucidly written on many topics are readily available in the market at affordable prices. Learning the subject should not therefore be too difficult. It is the methodology, the standards, and the practice, which make a difference to the research efforts. When programs like the GIAN are organized, it is important to focus more on these than the topics. Many years back, when the author graduated, the then Principal, Prof. BN Garudachar, said education teaches not the subject, but how to learn to learn the subject. It is very true. Learning how to learn the subject is more important than learning the subject itself. It is for these reasons that when the author received the information about this program from the college, he decided that the workshop should be research-oriented and not just a series of topics covered sequentially. The participants tried several innovative ways of learning and researching topics. For instance, they generated a concept map of the topics covered in each class. They were encouraged to use LaTex for their reports and class notes. The tutorials included thought-provoking problems. The final exam was open-book and open-Internet and included similar kind of thought-provoking problems.

The standards are high, and rightfully so. Publishing a paper that no one reads or doing research that no one uses is probably worse than not doing any research. Quality is important and standards should be high. The other aspect that is clearly evident in the west is teamwork. A lot of work in the USA happens in teams. Everyone in the team really works hard and contributes selflessly to the overall goal.

5 PRACTICING HIGHER STANDARDS OF RESEARCH One becomes an expert, by thinking like experts. By critically analyzing the work done by experts, we gain deeper insights. Brainstorming the work in a team further leads to ideas. One can then build upon those insights to comeup with new ideas. In programs such as those offered through GIAN, the participants should endeavor to practice all these principles of research. In the course that the author offered, participants were asked to fully understand a paper from a journal with a high impact factor, preferably on a humanitarian topic and present it to the class in teams. Economy grows as more and more people join its core echelons. One of the reasons the government exists is to make sure that the rich do not keep getting richer at the cost of the poor. It does not do good to the economy or the civilization, if the rich vs poor divide continues to increase. While mandates like Corporate Social Responsibility help in this respect, there is an increasing need for the research community to also share this social responsibility. IEEE organizes its signature conference every year

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Research in India

called the Global Humanitarian Technology Conference, GHTC every year to highlight this responsibility. Programs such as GIAN should also give impetus to research towards humanitarian causes. The participants should take the best advantage of the opportunity to contribute to the global advancement in wholesome steps, not leaving anyone behind. Each of the participants in the course offered by the author, for instance, submitted a future directions report, that focused on humanitarian causes.



Prof. Sameen Fatima, Principal, Prof. K Shyamala, HoD (CSE), other faculty, staff, and students of Osmania University College of Engineering for their help in making the GIAN course achieve its goals and become cherishable one to all involved.



India played a substantial role in advancing the world, from times immemorial. Programs such as the Government of India’s GIAN initiative has the potential to bring India back into the forefront of the research in areas driving the world economy. Research scholars, particularly in India are on a noble mission - research is the highest form of education and has ample scope of application within India.

MHRD GoI, GIAN Program 2017 VijayRaghavan, K. "Knowledge and human resources: educational policies, systems, and institutions in a changing India." Technology in Society 30.3-4 (2008): 275-278. Hyndman, Kelly G., Steven M. Gruskin, and Chid S. Iyer. "Technology transfer: What India can learn from the United States." Journal of Intellectual Property Rights. (2005).

Vishnu S. Pendyala is a Senior Member of the Computer Society of India and IEEE. He has delivered multiple keynote addresses in International Conferences sponsored by IEEE and has over 400 citations of his publications. His research interests include Machine Learning, Big Data, Information Retrieval, Semantic Web, and Artificial Intelligence.

ACKNOWLEDGMENT The author wishes to place on record, his gratitude to —————————— u ——————————


Sri Chandra Dasaka, secretary seen signing the MoU along with NITW officials. Please see detailed report and newspaper coverage on the next page.

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


CSI SIGBDA Signs MoU with NITW Chandra Dasaka Abstract— An MoU for conducting PG Diploma and certification level courses in Big Data Analytics and its related areas was jointly signed by both NIT Warangal and SIGBDA during a specially organized function at NIT Warangal on 3rd January 2018. NIT W is represented by Dr. NV Ramana Rao Director, NITW and SIGBDA is represented by Chandra Dasaka, Secretary SIGBDA. This is a trend-setting and a very important step to help reducing the gap between what the industry is expecting and what the academia is producing. This article is a report on this path-breaking event. Index Terms—CSI, SIGBDA, MoU, Certification, NITW, PG Diploma

—————————— u —————————— [Article URL: ]

1 EXTRACT FROM “THE HINDU” DAILY NEWSPAPER, DATED JANUARY 5, 2018 The E&ICT Academy established under the Na-tional Institute of Technology (NIT) Warangal, has a signed a memorandum of understanding (MoU) with the Computer Society of India (CSI) to offer modern interactive classroom-cum-laboratory on Thursday.

2 SALIENT FEATURES OF THE MOU The MoU signed has the following agreements recorded in it. • •

Speaking on the occasion, CSI secretary Chandra Dasaka said that it’s Special Interest Group on Big Data Analytics (SIGBDA) would offer a certifica-tion program for both industry personnel and grad-uate students. Academic expertise, industry experi-ence and sound technical skills would help the trainees become industryready in the area of big data analytics, he stated.

• • • • •

NIT director N.V. Ramana Rao signed the MoU and said that E&ICT Warangal had received an ap-preciation letter from the Union Ministry of Elec-tronics and Information Technology for its practic-es. “This academy is a feather in the cap of NIT-W and has tie ups with IITs, state level technical acad-emies, MoUs with engineering colleges and has been improving standards of technical teaching in the states under its jurisdiction,” he said.

• • •

The validity of the MoU is for the next three years. Courses offered towards PG Diploma and Certification. The target audience are college students for PG Diploma and working professionals for certification. There will be an advisory panel consisting of CXOs from Industry and Professors from NIT and IITs. The certification programs will be conducted over weekends. The PG Diploma courses will be of duration of 9 – 12 months. The certification courses will be of 3 to 4 months duration. The program will be conducted predominantly in Hyderabad and the assessments, Project presentations will be in NIT Campus Warangal. Industry will be involved in framing the syllabus, teaching and mentoring project works. NIT will issue the certificates / Diplomas.

The programs will commence from 15 March 2018. th

[Pictures from the event are included on the previous page.]

D.V.L.N. Somayajulu, chair of the academy, in-formed the gathering that E&ICT Academy had conducted about 170 faculty development pro-grammes and trained 6,839 faculty membersRegis-trar K. Laxma Reddy and chief investigator N.V.S.N. Sarma were also present.


Chandra Dasaka is the Secretary of CSI SIGBDA and the Chief Editor and Publisher of Visleshana. He can be reached over e-mail at secretary(at)

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


NEWS FROM THE SIG A successful IEEE Workshop series co-chaired by Prof. Saumyadipta Pyne and collaborators, 'International Workshop on Foundations of Big Data Computing (BigDF)' was held in Jaipur on Dec. 17, 2017 Event URL: IEEE Page: ID=41629 Executive Council member of the SIG, Vishnu S. Pendyala presented keynotes at the following IEEE sponsored conferences held in December 2017, proceedings from which are expected to be published in Xplore. International Conference on Soft Computing and its Engineering Applications (icSoftComp-2017) Keynote: In Pursuit of Truth on the Web: Soft Computing for Truth Finding Event URL: IEEE Page: ID=42570 Report: International Conference on Intelligent Communication and Computational Techniques (ICCT 2017) Keynote: Addressing Uncertainty on the World Wide Web Event URL: IEEE Page: ID=41378 He also made a presentation on “Mining for Medical Expertise� to the IEEE Members of the Hyderabad section on December 28, 2017. Details of the talk are at: Photos from the talks are printed below.

Photo Courtesy: Organizers of the conferences and event

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2



The Flagship Publication of the Special Interest Group on Big Data Analytics

Call for Papers

Visleshana is the official publication dedicated to the area of Big Data Analytics from Computer Society of India (CSI), the first and the largest body of computer professionals in India. Current and previous issues can be accessed at Submissions, including technical papers, in-depth analyses, and research articles in IEEE transactions format (compsoc)* are invited for publication in “Visleshana”, the flagship publication of SIGBDA, CSI, in topics that include but are not limited to the following: • Big Data Architectures and Models • The ‘V’s of Big Data: Volume, Velocity, Variety, Veracity, Value, Visualization • Cloud Computing for Big Data • Big Data Persistence, Preservation, Storage, Retrieval, Metadata Management • Natural Language Processing Techniques for Big Data • Algorithms and Programming Models for Big Data Processing • Big Data Analytics, Mining and Metrics • Machine learning techniques for Big Data • Information Retrieval and Search Techniques for Big Data • Big Data Applications and their Benchmarking, Performance Evaluation • Big Data Service Reliability, Resilience, Robustness and High Availability • Real-Time Big Data • Big Data Quality, Security, Privacy, Integrity, and Fraud detection • Visualization Analytics for Big Data • Big Data for Enterprise, Vertical Industries, Society, and Smart Cities • Big Data for e-governance and policy • Big Data Value Creation: Case Studies • Big Data for Scientific and Engineering Research • Supporting Technologies for Big Data Research • Detailed Surveys of Current Literature on Big Data All submissions must be original, not under consideration for publication elsewhere or previously published. The Editorial Committee will review submissions for acceptance. Please send the submissions to the Editor, Vishnu S. Pendyala at

* While we are working on our own templates, the manuscript templates currently in use can be downloaded from (LaTeX) or (Word)

January - March 2018 ^ Visleshana ^ Vol. 2 No. 2


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.