Page 1

COMPUTER SOCIETY OF INDIA

Volume 1, Issue 4

The Flagship Publication of the Special Interest Group on Big Data Analytics

Jul – Sep 2017

Image Credit: wordclouds.com


Chief Editor and Publisher Chandra Sekhar Dasaka Editor Vishnu S. Pendyala Editorial Committee B.L.S. Prakasa Rao S.B. Rao Krishna Kumar Shankar Khambhampati and Saumyadipta Pyne Website: http://csi-sig-bda.org

Please note: Visleshana is published by Computer Society of India (CSI), Special Interest Group on Big Data Analytics (CSISIGBDA), a non-profit organization. Views and opinions expressed in Visleshana are those of individual authors, contributors and advertisers and they may differ from policies and official statements of CSI-SIGBDA. These should not be construed as legal or professional advice. The CSI-SIGBDA, the publisher, the editors and the contributors are not responsible for any decisions taken by readers on the basis of these views and opinions. Although every care is being taken to ensure genuineness of the writings in this publication, Visleshana does not attest to the originality of the respective authors’ content. © 2017 CSI, SIG-BDA. All rights reserved. Instructors are permitted to photocopy isolated articles for non-commercial classroom use without fee. For any other copying, reprint or republication, permission must be obtained in writing from the Society. Copying for other than personal use or internal reference, or of articles or columns not owned by the Society without explicit permission of the Society or the copyright owner is strictly prohibited.

From the Editor’s Desk Dear Readers, Time flies! It’s almost a year since Visleshana was conceived. It’s been a pleasure evolving it into the flagship publication of CSI SIGBDA and present this fourth issue of the magazine. Thanks to the help of everyone involved, we have been able to keep the quality of the publication substantially high. Through the process of reviews and friendly guidance, we are able to encourage scientific approach among the authors. One of the authors of an article in the previous issue remarked in his reply, “It is indeed a great learning experience for us to improve the Article for Visleshana. It turned out to be a (full-fledged journal) paper!” Indeed, the plainly written word document they submitted originally ended-up being written in Latex, with a lot more rigor and detail and adhering to the template recommended for this publication. Same happened with one of articles submitted for this issue as well. The authors worked diligently to substantially enhance their work, based on the review comments and evolved it into a scientific paper in Latex format from the original word document. We are now running a plagiarism check on submissions and either rejecting the submissions in case of substantial overlap with the existing literature or having any plagiarism issues rectified, before publishing the articles. We have also incorporated a copyright transfer process for the published articles. The format of the articles has already been standardized starting with the previous issue. For the first time, we already have a substantial article in the pipeline accepted for the next issue. In fact, we received a few articles, which had to be rejected because of their relative quality. With the participation of the researchers all over the world, we hope to maintain and further enhance the standards of the publication in terms of quantity and quality. We are now utilizing the services of the issuu website to provide our readers a book-like reading experience online. All the issues of Visleshana can now be read like physical magazine copies at https://issuu.com/visleshana. Security is of utmost importance, when it comes to Big Data. We are fortunate to feature Prof. Salman’s framework for security of stored data in this issue. The framework is unique in taking the characteristics of Big Data, such as Volume, Velocity, and Variety into perspective when designing a security solution. Storage is a challenge with the humongous quantities of Big Data. Inside, you can find how compression techniques help in meeting this challenge, while maintaining substantial fidelity. Mining emails to facilitate investigations, automating medical diagnosis from Big Data, and a report on NASSCOM’s deliberations on enhancing Data Science skills in India are the other topics presented inside. Happy Reading! With Every Best Wish, Vishnu S. Pendyala Tuesday, July 4, 2017 San Jose, California, USA

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

2


EMAIL ANALYTICS

Email Analytics Gajanan Mirkhale, School of Computing Science and Engineering , VIT University, Chennai. S. A. Sajidha, School of Computing Science and Engineering , VIT University, Chennai. Abstract—In the present article, our main focus is on criminal and civil investigation from large email dataset. It is very challenging for the investigator to perform the investigation due the to large size of email dataset. This paper offers an interactive email analytics alternative to the current and manually intensive technique which is used in the search evidence from large email dataset.In the investigation process, many emails are irrelevant to the investigation so it will force the investigator to search carefully through all the emails in order to find relevant emails manually. This process is very costly in terms of money and time.The proposed method helps to reduce the length of investigation process. We combine Elasticsearch, Logstash and Kibana for data storing, data preprocessing, data visualization, data analytics and displaying results. This method reduces the number of emails which are irrelevant for investigation. Index Terms—Email, Enron, Data Visualization, Email search, Visual Analysis, Data Analytics.

F

1

I NTRODUCTION

T

HIS paper is written for the project which is under development known as Email Analytics. In Email Analytics, large email dataset is received and generated.Email analytics represents techniques to discover the evidence and information in investigation from a large email dataset. So in any large email dataset to prevent the investigator from conducting a manual search and to reduce the effort and save a lot of business time we automate such activities.

1.1

Email Communication

Email is extensively used in communication in our dayto-day tasks. Not surprisingly, it has become one of the most accepted form of communication medium. This form of communication is easy to use and costs virtually nothing per message. In the digital age, people use written communication far more than ever before. In fact, email communication is not only used instead of letter writing, it has also replaced telephone calls in many situations and in professional environments. In the book, Visualization analysis and Design (Tamara Munzner 2015), Tamara Munzner has given explanation about the visual analytics when the exact questions are not known. Email analytics provides the ability to find the patterns, trends and anomalies. Doing a manual investigation is very difficult when content of emails are change. Email analytics is also used in analyzing email corpus based on topic relation using text mining. In text summarization a large collection of emails are transformed to a reduced and compact email dataset, which represents the digest of the original email collections. This can be done using topic modeling algorithm. A summarized email helps in understanding the gist of the large email corpus quickly •

Gajanan Mirkhale - VIT University, Chennai Campus. He is currently pursuing his Masters in Computer Science and Engineering with specialization in Big Data Analytics. E-mail: mirkhalegajanan.rajendra2015@vit.ac.in Prof. S. A. Sajidha - School of Computing Science and Engineering , VIT University, Chennai Campus. E-mail: sajidha.sa@vit.ac.in

and also saves a lot of time by avoiding reading of each individual email in a large email corpus. 1.2 Objectives Now-a-days, in the investigation of email through keyword of both headers and contents of email, it is still unclear which would be the best keywords for searching the emails in the result. This methodology provides the best keyword for searching the emails in the result. It also reduce the number of emails from large dataset. Data visualization provides relationships in the context of the data, finding human pattern, trends and anomalies are easier. Topic modeling provides summarization a large collections of emails which are transformed to a reduced and compact email dataset 1.3 Challenges of Email Analytics First, the data sets are very large and are growing rapidly which provides a challenge in finding relevant information. Second, our interviews revealed a lack of a good set of investigative tools to deal with many of the issues created by large email data sets. These issues include: • • • • •

Reducing the size of keywords search results. Removing duplicate, irrelevant or unimportant emails from large email datasets. Discovering inconsistency in the email data. Inability to summarize search results or different subsets of emails data. Finding indirect connections between email accounts.

Currently, in the market right now there are no specific techniques or tools to automate this process. The main reason being that the problem which we have to solve is very specific to an organization. Email analytics does not provide 100 percent accuracy, but it will be enough to automate the process.

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

3


EMAIL ANALYTICS

TOOLS

UTILIZED IN THIS

P ROJECT

2

ELK Stack ELK Stack is combination of Elasticsearch, Logstash, and Kibana used in storing, visualization, and analysis of logs and other time-series data. It provides an end-to-end stack that delivers some insights from structured and unstructured data. Elk Stack is an open source product, that makes searching, analyzing and visualization of data easier.In ELK (Elasticsearch, Logstash and Kibana) Stack: • • •

Elasticsearch for deep search and data analysis. Logstash for centralized logging, log enrichment and parsing. Kibana for beautiful and powerful data visualization tool.

Fig. 1. Elk Stack Architecture

P ROGRAMMING L ANGUAGES : JSON (JavaScript Object Notation) JSON is short form of Javascript object notation is a lightweight text-based way of storing information in an organized structure for easy to use and its open standard designed for human-readable data collection from database, It act as a middleware in between front end and back end. Conventions used by JSON are known to programmers, which include C, C++, Java, Perl, Python etc. • • • •

JSON is faster in action for pulling data from the databse. It has a schema support for browser compatibility with small effort. Using JSON we can transfer images, audio and video files of any size by using arrays. JSON is inherited from Java Script Language and has an extension of .json

Python Python has been a very popular programming language for a long time, used by many companies, scientists, casual professional programmers (apps, cloud/web services, and web sites), and app scripters. Python has become one of the most popular dynamic programming language along with the other languages such as Perl, Ruby and others. Python and ruby have become especially popular for building the websites using numerous web frameworks like Rails (Ruby) and Django, flask (Python) and Web2Py. Such languages are often called as the scripting languages.

R ELATED W ORK

There are many areas of analysis related to our project. The first research associated with Investigation analysis of email and textual based documents. Second there has been research in the Analysis and visualization of emails and collections of text documents. Finally, there is related work in topic modelling and relationships between different users from email collections. the purpose of the paper(Haggerty J, Karran AJ, Lamb DJ, Taylor M., (2011)) for the forensic investigation of unstructured email dataset that legitimizes the need to create strategies and tools to investigate email data collections. In any case, their proposed system utilizes perception in just the last introduction organize which restrains the benefits of visual examination. In this follow up paper (John Haggerty , Sheryllynne Haggerty , Mark Taylor , (2014) ) the objective is to automate the visualisation of quantitative(network) and qualitative data (content) within email dataset. Nowdays, emails are key source for evidence during the digital investigation and investigator examination may be required to triage and analyse the large amount of data.Currently, we utilize tools and techniques that are manual through such data.This is a time consuming process. The main purpose of this paper(Jay Koven(2016)) is to perform criminal and civil investigation.In large size of data, investigation usually contain many emails which are not related to an investigation.So investigator manually arranges through the email data in order to find relevant emails.The aim of this investigation process is to automate the reduction of the number of emails in search result.Using our technique elk stack , investigation is faster and it reduces both time and cost. In the paper(Kerr(2003)) the author shows the relationship between the sender and receiver in emails threads using a unique visualization are which displays collections between sender and receiver using a series of arcing arrows to showing the relationships. This work shows the importance of tracing the sender and receiver connections in search result to email threads. Our work proposes a technique to showing relationships between sender/receiver with email subjects and connects to give clearer picture of the data.

3

T EXT P REPROCESSING

Text processing is an important procedure when we manage with textual data. With a specific end goal to apply any examination on the content, we initially need to preprocess the information or data. For preprocessing of the information we utilize the Natural Language Processing (NLP) strategies. • • • • • •

Tokenization Lower case conversion Stop word removal methods Stemming Removal of punctuation marks Lemmatization

3.1 Tokenization Tokenization is a technique that is utilized to split stream of content up into words, expressions, symbols or other

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

4


EMAIL ANALYTICS

important components called tokens. The list of tokens created then moves toward becoming contribution for further preparing, for example, grouping, content mining or parsing. 3.2

Stop word removal methods

The stop words are the words which are to be expelled before doing any examination on the content. They are fundamentally the most widely recognized words in the English dictionary. 3.3

Stemming method

Stemming method is used to identify the root form of the word.

need database with a schema less data models for the purpose of aggregated queries and fast searching. There are two options Elasticsearch and Solr(both are based on Apache Lucene).We decided to go with the elastic search because of the full stack and AWS support. A better understanding will be got on how to run all three components of elk stack used for analyze data. 4.3 Elasticsearch Elasticsearch is full text search engine.It is provides a REST API over the multiple indexes that can be search and queried.In elk stack, indexes are automatically created when we post a JSON document to an index scheme. The index scheme having three parts: • •

3.4

Lemmatization

Different forms of word in a corpus is grouped together to analyze as a single item is known as the Lemmatization in Linguistics.To find the Lemma for the given word, we have the algorithmic procedure in linguistics.

4

A RCHITECTURE

OF

E MAIL A NALYTICS

Index name Index type Document ID

MAPPINGS: We need to define a schema for the index, before we run the aggregate queries.So that means Elasticsearch needs to know the data types(integer,String,double) of the attributes in the schema. Elasticsearch does try to guess the attribute type, but also get predictable results with a schema. Snippet from a mapping configuration is as follows:

The figure given here is the architectural design of our Email Analytics. In our proposed system, the data is provided to the system. The data taken from Enron development.

Fig. 2. Basic Email Analytics Architecture

The architecture of email analytics are following: 4.1

Email Dataset

In Enron dataset, the data set size is 2.5 GB and it having 517,440 total email messages. Enron email dataset available at http://www.cs.cmu.edu/ enron/.Enron Dataset having different format of emails like Eml, mbox, pst etc. 4.2

Data Preprocessing

In data preprocessing, Email analytics consist of two integrated parts.Elasticsearch is used to create the search indexes for various email fields and extracted entities. The Enron dataset having approximately 517,000 emails.The lucene indexing is efficient and used with other forensic tool as well as Solr and Elasticsearch which is also used for some forensic tools. ELK stack is a popular tool of the Elasticsearch, Logstash and Kibana. ELK is an end to end stack which can handle everything from data aggregation to data visualization. We

Fig. 3. Mapping In Elasticsearch

Analyzed Fields: String attributes are analyzed for the full text search. This has unwanted side effects when you want to use these kind of fields for aggregation. If an attribute is not needed for fulltext-search, it is enough to add a ”not-analyzed” property. Logstash Logstash is an event of collection and forwarding pipeline. Here various input, filter and output plug-ins serves in easy transformation of events. Input and output plug-in may be added to a configuration file.

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

5


EMAIL ANALYTICS

We can use the search bar for terms and page update with results and new graphs. The ELK stack is used to provide a simple way to analyze data sets. It is not meant to be a statistical analysis tool, but more suited for business intelligence use cases. We found that elasticsearch is very fast while it reads, this is because writes create the indexes on the attributes and the analysis of some attributes as well. If the destination like elasticsearch is not able to keep the velocity, Logstash may drop the data due to back pressure. Elasticsearch is not used as the authoritative data source as it may drop data in case of network partitions.

Fig. 4. Not analyzed In Elasticsearch

Visual Analytics Visualize: Visualize tab open the possible visualizations-the most visualization start with simplest metric-total count of all documents. subsubsection*DASHBOARD: We can create dashboard from the saved visualizations and save dashboard. We can also set it to auto refresh, which will modify the visuals as new data is collected by Elasticsearch.

Fig. 5. Creating Logstash configuration Fig. 7. Dashboard Visualization

Kibana Kibana is visualization tool for exploratory data analysis. Kibana connects with Elasticsearch node and has access to all kind of indexes on the node. We can select one or more indexes and attributes in the available index for queries and graphs. DISCOVERY: Open Browser http://localhost:5601/ and Kibana dashboard will open up. You should see the Settings where you need to select at least on index. DISCOVER tab presents the count of records spread over the selected range of time by day. You can adjust this to be less granular (week, month, year) or more granular (hour, minute, second).

Fig. 6. Discovery of time by day

Relevant Emails In the process of analyzing an enron email data set starts with a user defined keyword search that can be filtering for specific senders and receivers. The elasticsearch search engine gives the relevant data based on the search keywords. This gives the investigator greater flexibility to expand and reduce the email dataset. The loop of the search algorithm gives the better results.

5

R ESULTS

AND I MPLEMENTATION WORK

We represent technique to discovery of evidence and information in investigation from a large email dataset to prevent the investigator form conducting a manual search. In the book, Visualization analysis and Design (Tamara Munzner 2015) Tamara Munzner given explanation about the visual analytics when the exact questions are not known. Email analytics ability to find the human pattern, trends and anomalies.

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

6


EMAIL ANALYTICS

5.1

Development of the Email Analytics

Elk Stack is most popular open source solutions for not only logs management but also for data analysis. Elk stack is used to effectively and efficiently perform big data analysis. For analysis take some mailbox data like Gmail, yahoo, Hotmail, rediffmail etc. In general each message contains the communication which individuals share. Every organization wants to analyze their corporate mails for trends and patterns. As a reference, we take our own mails from Gmail account. After downloading the mails from Gmail account we save that data into local machine. Downloading mails form Gmail Manually: • • • • • • • • •

Login to the Gmail - > Go to My Account Go to Personal info and privacy Then Go to Control your content Right side Download your data option - > Click on CREATE ARCHIVE -> Then Select data to include only Gmail - > Click on Next - > then Customize archive format :File type -> Zip Archive size (max) -> upto 50 GB Delivery method -> Add to Drive Click on Create Archive -> Download the data and Save in Drive Save Data in .EML format.

But here since the size of the mailbox is very small due to that we take the well-known Enron email dataset as it has a huge collection of Enron emails. That data having Mbox format. We need to again transform Mbox format into a single JSON file. 5.2

Fig. 8. Enron mapping

We can verify that the mapping has indeed been set.

Getting data from Enron corpus:

Enron email dataset in a raw form is available at http://www.cs.cmu.edu/ enron/. It is organizes a collection of mailboxes by person and folder. Convert the enron mail data into mbox format in the next step. This mbox is a large text file which has concatenated enron mail messages that are easily accessible by text-based tools. We have used python script to convert enron mbox format. After this the mbox file would be converted into JSON format. JSON format is compatible with Elk Stack. When the volume of data is very huge, the data to be pushed into Elasticsearch is imported in bulk by using the data file. Each enron mail message is in separate lines with an additional entry which specifies the index (enron) and emails. In Elasticsearch the id is automatically specified. Data used in the Elasticsearch may be of two types :Exact values and Full text. The exact values refer to the actual values. For eg. 2014 is not same as 21-3-2014 or orange is not same as ORANGE . Full text refers any complete textual data. First, you need to specify the mapping, which can be shown below.

Fig. 9. Verify mapping

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

7


EMAIL ANALYTICS

Now lets load all the mailbox data by using the json file, in the following manner:

Fig. 10. Load Mailbox Data Using josn File

We can check if all the data has been uploaded successfully.

Fig. 13. Enron histogram on a weekly basis

that week. Now lets see who the top recipients of messages are.

Fig. 11. Data uploaded successfully

We can see total 41299 records having different messages, have been uploaded. Now lets start with some analysis on this data. Kibana provides very good analytic capability and associated charts. The above diagram shown configuration of the Kibana. In that you can create different indexes in Kibana. An index pattern identifies one or more Elasticsearch indexes that you want to see the exploration with Kibana. Kibana looks for index names that match the specified pattern Setting the default Index pattern: In Kibana, the default index loads by automatically when you click on Discover tab. Kibana when you press on the star to the left of patterns list of the setting. The default index pattern: • • •

Fig. 14. top recipients of messages

We can see that Pete, Jae and Ken are the top senders of messages. In case you may be wondering what exactly Enron employees used to discuss, lets check out the top keywords from message subjects.

Goto settings - > Indices. Select default pattern you want in index pattern list. Click on Favorite button.

Fig. 15. top keywords from message subjects.

Fig. 12. Enron Searching

The below histogram shows the messages which spread on a weekly basis. The date value is in terms of milliseconds. You can see that one particular week has a peak of 3546 messages. There must be something interesting happening

It seems that most interesting discussions focused on enron, gas, energy, power. There a lot more interesting analysis can be done with the Enron mail data. Enron data can be visualized using different faces found across within each email and multiple email archives. Kibana indexes provide full text search features and visualization for better decision-making.

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

8


EMAIL ANALYTICS

Create a kibana visualization dashboard to view all senders and receivers of the Enron email within any archive.

investigators found a new and interactive technique of visuals which helps them in their decision-making and These emails are found relevant which were hidden from the initial search through our search result reduction techniques can be brought back into the final detailed examination. Our study and approach demonstrated visual interaction which can reduce the size of results set so that the remaining emails can be examined in detail to find relevant emails through investigation.

ACKNOWLEDGMENTS The authors thanks Dr. Bharadwaja Kumar, Program chair, School of Computing Science and Engineering, VIT University, Chennai Campus in guiding us to achieve this feat. We appreciate all the help we received from our peers in the due course of this research.

Fig. 16. Enron Visualization 1

Type in kibana search box for visualization and view result. Simply click on any value within any chart of Kibana

R EFERENCES [1]

Fig. 17. Enron Visualization 2

to filter down the results and the visualization.

Bernard Kerr. Thread arcs: An email thread visualization. IEEE Symposium on Information Visualization, 2003. [2] C.Ramasubramanian and R.Ramya. Invest: Intelligent visual email search and triage, dfrws usa 2016-proceedings of the 16th annual usa digital forensics research conference, digital investigation. DFRWS USA 2016, 18, 2016. [3] John Haggerty, Sheryllynne Haggerty, and Mark Taylor. Forensic triage of email network narratives through visualisation. Information Management and Computer Security, 22, 2014. [4] John Haggerty, Sheryllynne Haggerty, and Mark Taylor. Enron corpus dataset. Information Management and Computer Security, https://www.cs.cmu.edu/ ./enron/. [5] Haggerty J, Karran AJ, Lamb DJ, and Taylor M. A framework for the forensic investigation of unstructured email relationship data. International Journal Digital Crime Forensics, 2011. [6] https://lucene.apache.org/. [7] https://lucene.apache.org/solr. [8] https://www.elastic.co/. [9] Enron Dataset,http://www.cs.cmu.edu/ enron/. [10] Maguire E. ,Munzner T. Visualization analysis and design. AK Peters visualization series. Boca Raton, FL- CRC Press; 2015.

Gajanan R. Mirkhale is pursuing the M.tech CSE with Specialization in Big data analytics degree from VIT University, Chennai, India, in 2017, the Bachelor of engineering degree from the M.S.Bidwe college of engineering, Latur, in 2013, and the Diploma from the Y.B.Patil Polytechnic at pune in 2010. He is currently employed by Innova Solution Pvt. Ltd. in Chennai,India working on Big data Analytics. Fig. 18. Enron Visualization 3

6

CONCLUSIONS

Here We have introduced a new era of analytics that is Email Analytics by proposing a new methodology of searching and mining of useful E-mails from large email datasets. Analytics over email dataset makes easier to investigate for investigators to identify hidden relationships and anomalies within the email datasets. This will improve and speed up the results of the investigation process. To find relevant emails, entities, and correspondents in the email data sets,

S.A Sajidha is a working as an Assistant Professor (Sel Grade) at VIT University Chennai. She as around 17 years of academic experience and is mainly working on Data Mining,Artificial Intelligence and Big data.She is currently pursuing her Ph.D. in Data Mining.

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

9


DATA SECURITY

A Framework for Securing Data at Rest in Big Data Domain Salman Abdul Moiz Abstract—Data is growing exponentially due to enormous use of smart devices, internet and social media. Big data is huge collection of data sets which is of the order of petabytes and exabytes. Since the traditional databases systems are not effective in managing huge voulmes of data, it is often stored and mainained by cloud service providers. Security during transmission of data is guaranteed using SSL (Secure Socket Layer). However, the issue is with securing the data at rest which is currently trusted by Service Level Agreement (SLAs). Encrypted database concepts were applied in cloud environment to transform the plain query into encrypted query and getting back the encrypted results. However, these solutions directly don’t apply to big data domain. The Volume, Velocity, Variety and Veracity in big data domain adds extra challenges to existing encrypted cloud database solutions. The article presents a framework for securing data at rest in big data domain. The architecture of the proposed framework can be seen as an enahancement to the framework of of encrypted databases in cloud domain. Index Terms—Big data, Cloud environment, Data at rest, Encrypted databases

—————————— u ———————— in presence of Volume, Variety and Velocity of the data. 1 INTRODUCTION From the year 1990 to 2010 the amount of spam mesccording to ISO 2015[3], the big data is character- sages increased from 2 to 200 billion per day. The malized by 5V’s. Data that is coming from several wares detected only in January 2008 were more than sources is huge (Volume) and often arrives at high all malwares reported prior to 2008. With the increase speed (Velocity), which is of dierse types (Variety) with of social media tools the volumes of data have innoise (Veracity) and which changes too fast (Variabil- creased exponentially. Since data is streamed from varity). Traditional database systems can’t manage data ious devices at the same time, velocity becomes a chalwith these characteristic. Managing big data and real- lenging issue. The data generated can be in varied forizing today’s threat environment is a challenging issue. mats ranging from databases, documents, emails to The huge volumes of data are often stored on the cloud environment which is often untrusted as the key transactions, audios, videos etc. There is a need to management is also realized by the service providers. manage variety of forms of data. The security and encryption mechanisms can be reThe data which is stored on a device or backup medium in a digital form is often referred to as data at alized based on the state of digital data. There are three rest. Several solutions to secure data at rest using en- states of data: Data at rest, Data in transit and Data in crypted database mechanisms are proposed by re- use. searchers. These mechanisms often work well for the Data at rest is a state that indicates that the data is structured data. In which the request for data retrieval stored in repository in a digital form. This state of data is sent to cloud service provider as an encrypted struc- basically refers to the data stored on cloud which is tured query. However the other dimensions such as currently not being transmitted across the network. Variety, Velocity, Veracity etc., requires addition of new Data in transit refers to the data that is travelling wrappers to existing solution of data at rest in cloud environments. In this paper a framework to realize few across network. It could be in transit from local to charateristics of the big data environment is presented. cloud storage or vice versa. The data should be in enThe strategies that can be applied for key management crypted form and the Secure Socket Layer (SSL) enis presented and certain open issues and challenges are sures that data in transit remains integral. discussed. Data in use refers to the data that is currently being The remaining part of this paper is organized as fol- processed. Data in use may also be prone to threats. lows: Section-2 gives a start of art of security in big However, device authentication and authorization data environments; section-3 presents a framework for mechanisms help in dealing with these threats. securing data at rest in big data environments. Section4 highlights the open issues and challenges and sec2.1 Encrypting Critical Data Items tion-5 conclues the paper.

A

2

RELATED WORK

According to Trend Micro [7] todays threat environment deals with how security vendors manage threats

The entire data stored by cloud service provides is generally stored in encrypted form. Raghava et.al [1] proposed that the entire database need not be encrypted to allow efficient query processing. Instead

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

10

2


DATA SECURITY

only the critical data items may be encrypted. However the challenge is to dynamically select critical data items from the huge volumes of data. Secondly when more and more features are added to the repository the existing critical data items set may not be sufficient. Further the intruder can at least know as to which database objects and attributes are being accessed and may guess the critical data items from the other data items.

2.2 Security measures in Cloud Computing Environment Venkat et.al [5] proposed security measures in Cloud Computing Environment (CCE) that would improve security of cloud environments using the mechanisms like File encryption, Network encryption, Logging, Node maintenance, Rigorous testing of map reduce jobs, layered framework for assuring cloud and third party secure data publication to the cloud. These approaches may not be suitable for big data domain. There is a need for authentication of devices and the encryption mechanisms used may not directly be suitable for big data domain. There is a need to revisit these approaches by considering the characteristics of big data domain. .

2.3 Data mining techniques for Preserving Privacy Neetul et al [12] presented a survey of techniques used for preserving privacy using data mining approaches. The selection of a particular technique depends upon the type of data set used [4]. In the Anonymization technique, the sensitive attribute value(s) is replaced by other value(s) with an intention of not disclosing the sensitive data. Each tuple of a relation must be indistinguishably associated with a value to deal with identification or risks. In randomization technique an extra field is added to the current field(s) known as noise. With the inclusion of the additional field the correct individual information can be presented but the combine effects are preserved. Individual fields of the relation are randomized but the aggregated values will yield correct results. In “Probabilistic results for Queries” approach the results of queries can be null or it can be probabilistic replies instead of producing the actual results [4]. There is a need of modification of these approaches keeping in view the characteristics of big data environment. Several strategies like encrypted database concepts help in managing data at rest but they are realized for the structured queries and don’t consider the velocity and veracity of data. Hence there is a need to include

the wrapper modules to secure seamless processing, transmission and storage of data.

3

FRAMEWORK FOR SECURING DATA AT REST

Traditional database systems are suitable for management of structured data. In order to store and process huge volumes of data, the storage was outsourced to the data centers or cloud service providers. The data in transit is secured using SSL and data at rest is managed using encrypted databases. However these solutions are not effective in analyzing semi structured and unstructured data at the same time (Variety). Further the traditional database systems can’t manage data deluge. The data which is generated by humans and devices is referred as data deluge. Enormous amount of data is generated each hour from various devices and is stored in cloud databases. This data may be generated in various formats.

3.1 Cloud Database Service Traditionally maintenance and management of data was the responsibility of the customer’s using their private servers. With the increase in volume of data, it was difficult to manage it. Hence the organizations started outsourcing the same. Cloud databases is an essential service of cloud computing environment as the operational and management cost increases exponentially with the size of the data. There are two approaches for deploying database in the cloud. The first mechanism is referred as “Do-it-yourself”. In the approach the consumer subscribes to IaaS (Infrastructure as a Service) and uses its own database instance to realize the velocity, variety and veracity characteristics. In the second approach the consumer deploys its database on to the cloud and it is managed by Cloud Service Provider (CSP). Cloud services provides two types of environments to realize database services viz., multi-instance model and multi-tenant model. In multi-instance model, each consumer is provisioned with unique DBMS running on a dedicated virtual machine subscribed by specific customer. In this environment, the consumers have better control over administrative and other security solutions. In multi-tenant model, tagging method is used which provides pre-defined database environment that is shared by many tenants. Data of each residing tenant is tagged by identifier that is used for unique identification.

3.2 Architecture for Managing Encrypted Databases One of the effective techniques to ensure confidentiality of sensitive data in cloud environment is to use encryption of data in transit as well as data at rest.

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

11

2


DATA SECURITY

Even if the database is outsourced to cloud service provider (untrusted servers), encryption can be used to improve database security. However encryption shouldn’t hamper the normal functionality of databases. Ideally queries should be executed directly over encrypted databases. Database may be deployed on untrusted or compromised machine. The cloud service providers which are in untrusted domain shouldn’t be in a position to hold the key for encrypting or decrypting the data. There is a need for a middleware or proxy, which is in trusted domain. The keys for encryption and decryption are available with the proxy or middleware. The proxy based solutions were realized using tools like CryptDB. However they don’t directly apply to the big data domain as the data may not be always in the structured form. The following architecture is used to manage encrypted databases in big data domain.

The data collection is a module which is used to capture data. The data collector must be in a position to manage heterogeneous formats of data. The data be in varied forms and may also contain noise. The functionality of the pre-processing module is to remove noise from the captured data. Effective preprocessing mechanisms may help in effective removal of noise from the captured data. Since the data can be structured or semi-structured, No SQL databases like MongoDB can be used. The other approach is to convert the semi structured or unstructured data into the structured tuples. As the data at the server of data center is stored in encrypted form, the plain query needs to be translated into cipher form. The respective keys for encryption and decryption are maintained at the middleware. The data is stored in encrypted form at the Cloud service provided. When the information has to be retrieved from the cloud service provider, the middleware receives the data in encrypted form. The middleware then can decrypt the data and sends it to the device which has requested the information. The deterministic encryption [9], order preserving encryption [8] and homomorphism techniques can be applied to retrieve data, realize range queries and to realize the aggregate functions using the above architecture. The onion layers can be realized after implementation of preprocessing and transition modules. The onion layers [8,9] used in encrypted databases in cloud environments can be used in big data domain. However Data collector and preprocessing modules will act as outer layers of the onion.

3.3 Key Management and Fault tolerance

The data comes from various devices. There is a need for IPV6 enabled devices to realize device authentication. All the applications reside on these devices. These devices may range from desktop to mobile and smart devices which can be in the scope of Internet of Things. The cloud service providers stores the data in encrypted from. There is a need for realizing strong middleware which can handle huge volumes of data of various forms with noise and are captured through fast streaming. The middleware is expected to realize five modules.

Key management is a challenging issue in the cloud and big data domain. The key may be stored at client, server, middleware or it may be shared among various devices. It may not be feasible to store the keys at clients as the devices may be prone to failures and may be lost. As the server is in untrusted domain, it is not desirable to store the keys at the server. Keys may be shared among several trusted devices where each device maintains a partial key. In order to encrypt or decrypt the data, the key has to be reconstructed by collecting the partial keys from all devices. If any of the device fails then it will be difficult to store or query the data. In the proposed framework, it is proposed to store the keys on the middleware or proxy. The middleware or proxy is in trusted domain. However if the middleware fails, the system stops functioning. Hence there is

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

12

2


DATA SECURITY

a need to realize fault tolerance of middleware. It is proposed that the middleware id to be connected with its mirror copy in synchronous mode. Both images are expected to be in active-active mode, such that if one of the middleware fails, the other middleware automatically starts realizing the transaction management without any delay. When the traffic over the network increases, it also helps as a load balancer. The logs which are maintained at various mirror copies helps in recovery of the failed middleware. Since the data is stored at cloud service provider, it is expected that only transaction replication will be used between the primary and backup copies of middleware.

4

OPEN ISSUES AND CHALLENGES

The big data threat model has certain open issues and challenges: • •

• •

The proposed framework has to be implemented on high volumes of data to know its robustness. Device authentication is needed before the data is processed at various devices. Since various forms of data re generated from multiple devices, the same has to be realized on the proposed framework. The two modules of middleware i.e. preprocessing and transition to structured query require design of optimization techniques as huge data in varied forms is collected. In order to guarantee integrity property, effective key management strategies such as key based encryption are expected to be included in the proposed framework. Novel approaches for key management are desirable. There is need for proposing mechanisms to implement failure recovery of middleware.

5 CONCLUSION

REFERENCES [1]

[2]

[3]

[4] [5]

[6]

[7] [8] [9]

Raghav Toshnival, Kanishka Ghost Dastidar, Asoke Nath, “Big Data Security Issues & Challenges”, International Journal of Innovative Research in Advanced Engineering (IJIRAE), Issue 2, Vol. 22, pp.15-20, 2015 Neetu Chaudhari, Satyajee Srivastava, “Big Data Security Issues and Challenges”, International Conference on Computing, Communication & Automation (ICCCA 2016) pp. 60-64, 2016 Big Data Preliminary Report 2014, ISO/IECJTC1, Information Technology, www.iso.org/iso/home/about/iso_members.htm, prelimnary report 2015. Arun Thomas George, Arun Viswanathan, Kiran N.G, Phil Shelley, Dough Cutting, S. Gopalkrishnan, Big Data Spectrum 2012. Venkata Narasimha Inukollu, Sailaja Arsi, Srinivas Rao Ravuri, “Security Issues associated with Big Data in Cloud Computing”, Internatinal Journal of Network Security & its Applications (IJNSA), Vol. 6, No. 3. PP. 45-46, 2014. P. R Anisha, Kishore Kumar Reddy C, Srinvasulu Reddy K, Surender Reddy S. “Third Party Data Protection Applied to Cloud and Xacml Implementation in Hadoop Environment with Sparql”,IOSR Journal of Computer Engineering, pp. 39-46, 2012. “Addressing Big Data Security Challenges: The Right Tools for Smart Protection”. A Trend Micro white paper, pp.1-7, Sept 2012. Alexandra Boldyreva et al. “Order Preserving Encryption Revisited” Improved Security Analysis & Alternate Solutions”, LNCS,Crypto 2011. Alexandra Bldyreva, Serger Fehr, Adam O Neill, “On Notions for Security for Deterministic Encryption & Efficent Construtions Without Random Oracles”, CRYPTO 2008, LNCS pp. 335-339, 2008

Salman Abdul Moiz is working as an Associate Professor at School of Computer & Information Sciences, University of Hyderabad. His areas of interest includes Software Engineering, Distributed Databases, E-Learning and Fault tolerance. He did his Ph.D from Osmania University, M.Tech (CSE) from Osmania University, MCA from Osmania University and M.Phil(CS) from Madurai Kamaraj University. He is a Fellow of IETE, Senior Member of IEEE, Senior Member of ACM, Life Member CSI, Life Member EWB and Member ISRS.

According to Gartner, “Big data information security is a necessary fight”. The characteristics of big data environment pose new security challenges. The existing database encryption solutions are not sufficient as it is expected that the data is captured from several devices with varied forms. In this paper, a framework is presented which will be used to realize the security solutions in big data domain. The open issues and challenges highlights the enhancements that can be made to the proposed framework.

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

13


IMAGE COMPRESSION

Minimizing Big Data Storage Demand Using Polynomial Curve Fitting: The Case of Block Based Compression of Image Data R Srinivasa Rao, Vignana Jyothi Institute of Management V. V. Haragopal, Osmania University Abstract—Polynomial curve fitting is an efficient tool to approximate the data and used for minimizing data storage. Curve fitting is concerned with estimating pixel values using parameters from the obtained estimated model. A set of data points are given that are approximated using these parameters. Reconstructed data points gives approximated values, to find model parameters, the mean square error minimization is used. When the first order polynomial is used, the difficulty is that it results high amount of degradation in the quality of the image. Furthermore, a polynomial of order one alone is not appropriate all the time, when there is much variation in pixel values then higher order polynomial especially 2 and 3 would be a better choice to minimize error sum of squares and hence a better compression can be achieved without degrading the quality of the image. Index Terms—Curve fitting, block based compression, mega pixel, compression ratio

—————————— u ——————————

1 INTRODUCTION

W

ITH the wide spread of digital devices such as digital cameras, personal computers, and due to the accessibility of internet now days it has become very to exchange more and more images through various media, and due to growing amount of visual data, it has become necessary to compress images to store data and also to make efficient transfer of them.

1.1 Image Compression A common characteristic of most images is that neighboring k pixels are correlated and therefore contain redundant information. The idea image compression is eliminating this redundant information in such a way that a high compression ratio without degrading the quality of images. 1.2 Compression Types Image compression techniques can be classified into two categories: lossless and lossy schemes. In lossless scheme, the exact original data can be recovered, while in lossy scheme, only a close approximation of the original data can be obtained. Lossless Compression A compression scheme which reduces the data redundancy without affecting the information contained in a file is called lossless compression. Lossless compression retrieves every single bit of data that contained in a file after the file decompressed i.e. complete information is restored. Lossy Compression A compression technique that does not retrieve data back to 100% of the original but reduces a file by eliminating certain redundant information permanently is referred to as lossy compression. Lossy compression is more suitable for still images and in many cases, it may not be noticeable to the human eye or ear, even noticed but it will not affect the application. The image which is reconstructed by lossy compression may differ from its original image. It gives

better compression ratio compared with the lossless compression techniques as it encodes only the information which is required for the reconstruction of the image and eliminates the unnecessary data.

1.3 Segmentation schemes With the widespread of digital devices such as digital cameras, personal computers, more compound images containing text, graphics, and natural images are available in digital forms. Due to the compound nature of an image there are three basic segmentation schemes: object-based, layered based and block based. Object based scheme divides an image into regions where every region contains graphical objects i.e. photographs, text, etc, but the difficult is to get good compression because edges of the objects take extra bits in memory. It results less compression and increase the complexity [Said. A and Drukarev . A.I, (1999)]. Layered based scheme divides an image into different layers i.e. natural pictures, buildings, text, colors. Every layer is treated with different method for compression. Block based is more efficient as compared to the two above schemes. The advantage of this scheme is that image can be easily segmented like can be divided into N x N blocks, and each block is treated with same compression technique and less redundancy.

1.4 Performance Indicators In order to measure the performance of compression scheme and to quantify, some performance indicators are used and they are root mean square error (RMSE) and the measure quality of the reconstructed image called peak signal noise ratio (PSNR). The root mean square error in image compression is one of

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

14


IMAGE COMPRESSION

many ways to quantify the difference between an original image and the approximated image, and is measured as:

é 1 M -1N -1 ù 2ú [ ] RMSE = ê g ( x , y ) f ( x , y ) åå ê M ´ N x=0 y =0 ú ë û

1/ 2

………………1.4.1

2 MATHEMATICAL FORMULATION OF POLYNOMIAL FITTING

An image can be approximated by using polynomial functions of different orders by subdividing the original image into various block sizes. A mathematical frame work using polynomial interpolation for the image compression was presented by Eden et al (1986).

Where f(x,y), x=0,1,2, …M-1, y=0,1,2,…N-1 is the original image and g(x,y), x=0,1,2, …M-1, y=0,1,2,…N-1 is the reconstructed image of the original. Peak Signal Noise Ratio (PSNR) refers the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation and is usually expressed in terms of the logarithmic decibel scale. The PSNR is most commonly used as a measure of quality of reconstructed image and is defined through mean square error.

Here the two-dimensional block is taken as a one-dimensional array.

æ 255 ö PSNR = 20 log10 ç ÷ è RMSE ø

a ,a ters 0 1 , in case of 2.1 of (2), f(x, y) is estimated using

…………………….1.4.2 Here Max (f) is the maximum pixel value of the image, when the pixels are represented using 8 bits then it becomes 256. Typical values for the PSNR in lossy image and video compression are between 30 and 50 db. The decompressed image having the higher PSNR is assumed to be better quality of the original image and its value is inversely proportional to RMSE i.e. the more PSNR, lesser RMSE and vice versa. In order to compress an image, we minimize the number of bits that required to store an image, called image compression. To know how much data is compressed, we compute compression ratio and is computed by the expression:

Number of bits in the compressed image CR= Number of bits in the original image ………………………1.4.3 Compression ratio affects the quality of an image. Higher compression ratio means that the degradation of image is also high. To get good amount of compression, we degrade some quality of an image by using polynomial curve fitting. The region based schemes usually provide high compression ratios, but involve a lot of computations. On the other hand block based schemes in the contrast are fast and involve less computation. The process of constructing a compact representation to model the surface of an object based on a fairly large number of given data points is called surface fitting. In image processing, surface fitting is also called polynomial fitting. In the field of image processing, polynomial fitting has been used in image segmentation [Besl.P and Jain.R (1988)], image noise reduction [Sinha and Schunck (1992)], and quality of improvement of block based compression [Kieu and Nguyen (2001)]. Salah Ameer (2009) had implemented 1st order polynomial fitting for 8 x 8 blocks.

…………………….2.1

Let these be the 1 , 2 , and 3 order polynomial respectively. In case of 2.1 of (1), f(x, y) is estimated using two paramest

nd

rd

a ,a

three parameters 0 1 , f(x,y) is estimated

a2

and in case of 2.1 of (3), using four parameters

a 0 , a1, a 2 , a 3. etc... . Further we can compute the error value and is given by

Error = åå (f(x, y) - g(x, y))

2

x y

………………….2.2 Where g(x, y) is the original intensity value and f(x, y) is the polynomial function. Let

y = a0 +

k

å a jx j j =1

be the k order polynomial which is used to estimate f(x, y). Using least square curve fitting a set of parameters th

a 0 , a j , j=1,2,…k are estimated in such a way that the sum of the squared error is minimum. If a set of data are given, it can be approximated or estimated using these parame-

a , a , a , a . .. ters ( 0 1 2 3 .). Now with these parameter values the reconstruction is made by approximating these values of the image by curve fitting. If the curve fitting is the best, means that estimated values and the original values are near same. Hence, the main objective of the study is to find efficient representation of an image using polynomial curve fitting. The best fitted polynomial will give minimum error for all the three polynomial functions.

3 PROCEDURE FOR BLOCK BASED IMAGE COMPRESSION USING POLYNOMIAL FUNCTION First an image is subdivided into sub blocks of sizes 4 x 4,

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

15


IMAGE COMPRESSION

8 x 8, and 16 x 16. A block having less variation, polynomial of order one is fitted. A polynomial of order one is not appropriate all the time, when there is much variation in pixel values then higher order polynomial would be a better choice to minimize error sum of squares. We fit polynomial of order 2 or 3 if the resulting error of the block is more than some pre-defined threshold which is 1%, 2%, and 5% of the mean square of the original image and finally we store only few polynomial coefficients for a number of pixels. After reconstructing the image with the coefficients, the RMSE and PSNR are computed to measure the quality of an image.

and obtain polynomial coefficients. [-1.24E-6

0.0002

-0.0094 0.5913]

With these coefficients, we reconstruct the sub image and we estimate the values for 8 x 8 blocks are given below:

3.1 Method Step 1: Read the image into the MATLAB environment and call it the original image. Step 2: Compute the mean sum of squares of the original image. Table 1 Reconstructed Data matrix

Step 3: Obtain the threshold which is 1%, 2% and 5% of the mean sum of squares of the original image. Step 4: Subdivide the original image into 4 x 4, 8 x 8 and 16 x 16 blocks call it as mega pixel. Step 5: For each mega pixel, fit polynomial of order 1 if mean square error of the block is less than the threshold, else fit polynomial of order 2, else polynomial of order 3 and obtain polynomial coefficients of the block.

4

EXPERIMENTAL RESULTS

The results of the experiments are summarized in the tables below.

Step 6: Now obtain the estimated values for each mega pixel (block) to reconstruct the original image. The procedure considered is illustrated step by step is as under: Table 2 Comparison of mega pixel of size 4 x 4 , 8 x 8 and 16 x 16 threshold as 5% of mean square

Table 3 Digitized Data matrix

For this data we calculated Mean sum of squares Threshold = 5% of mean sum of squares 1st order polynomial Mean Square Error

Table 4 Comparison of mega pixel of size 4 x 4 , 8 x 8 and 16 x 16 as threshold 1% of mean square

0.3298 0.0165 0.0885

Since mean square error is more than the threshold, we now fit a polynomial of order 2. For this we computed Mean sum of squares 0.3298 Threshold = 5% of mean sum of squares 0.0165 2nd order polynomial Mean Square Error 0.0878

Table 5 Comparison of mega pixel of size 4 x 4 , 8 x 8 and 16 x 16 as threshold 2% of mean square.

From this fit it is observe that still the MSE is more than the threshold value, hence we fit a polynomial of order 3

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

16


IMAGE COMPRESSION

Figure 1 Musical Instrument- Comparison of mega pixel of size 4 x 4 , 8 x 8 and 16 x 16 threshold as 5% of mean square

100

100

200

200

300

300

400

400

500

500

600

600

Compressed Image CR(15.9849) PSNR(28.5944)

Original Image

700

700

100

200

300

400

500

600

700

800

900

100

Original Image

200

300

400

500

600

700

800

900

Reconstructed Image (4 x 4 block) CR(15.9849) PSNR(28.5944)

100

100

200

200

300

300

400

400

500

500

Compressed Image CR(4.3934) PSNR(24.5058)

600

Compressed Image CR(1.2284) PSNR(21.6191)

600

700

700

100

200

300

400

500

600

700

800

900

100

200

300

400

500

600

700

800

Reconstructed Image (8 x 8 block)

Reconstructed Image (16 x 16 block)

CR (4.3934) PSNR(24.5058)

CR(1.2284) PSNR(21.6191)

900

Figure 2 Fort- Comparison of mega pixel of size 4 x 4 , 8 x 8 and 16 x 16 threshold as 5% of mean square

200

200

400

400

600

600

800

800

1000

1000

1200

1200

1400

1400

1600

1600

1800

1800

Compressed Image CR(16.2299) PSNR(31.2236)

Original Image

2000

2000

500

1000

1500

Original Image

2000

2500

3000

500

1000

1500

2000

2500

3000

Reconstructed Image (4 x 4 block) CR(16.2299) PSNR(31.2236)

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

17


IMAGE COMPRESSION

200

200

400

400

600

600

800

800

1000

1000

1200

1200

1400

1400

1600

1600

Compressed Image CR(4.6026) PSNR(26.9362)

1800

Compressed Image CR(1.2554) PSNR(23.9697)

1800

2000

2000

500

1000

1500

2000

2500

3000

500

1000

1500

2000

2500

3000

Reconstructed Image (16 x 16 block)

Reconstructed Image (8 x 8 block) CR(4.6026) PSNR(26.9362)

CR(1.2554) PSNR(23.9697)

Figure 3 Musical Instrument- Comparison of mega pixel of size 4 x 4 , 8 x 8 and 16 x 16 threshold as 1% of mean square

100

100

200

200

300

300

400

400

500

500

600

600

Compressed Image CR(19.1839) PSNR(28.6116)

Original Image

700

700

100

200

300

400

500

600

700

800

900

100

Original Image

200

300

400

500

600

700

800

900

Reconstructed Image (4 X 4 block) CR(19.1839) PSNR(28.6116)

100

100

200

200

300

300

400

400

500

500

Compressed Image CR(5.4575) PSNR(24.5109)

600

Compressed Image CR(1.4910) PSNR(21.6203)

600

700

700

100

200

300

400

500

600

700

800

900

Reconstructed Image (8 x 8 block) CR(5.4575) PSNR(24.5109)

100

200

300

400

500

600

700

800

900

Reconstructed Image (16 x 16 block) CR(1.4910) PSNR(21.6203)

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

18


IMAGE COMPRESSION

Figure 4 Fort- Comparison of mega pixel of size 4 x 4 , 8 x 8 and 16 x 16 threshold as 1% of mean square

200

200

400

400

600

600

800

800

1000

1000

1200

1200

1400

1400

1600

1600

1800

1800

Compressed Image CR(20.6084) PSNR(31.2665)

Original Image

2000

2000

500

1000

1500

2000

2500

3000

500

Original Image

1000

1500

2000

2500

3000

Reconstructed Image (4 x 4 block) CR(20.6084) PSNR(31.2665)

200

200

400

400

600

600

800

800

1000

1000

1200

1200

1400

1400

1600

1600

Compressed Image CR(1.5127) PSNR(23.9717)

Compressed Image CR(5.7726) PSNR(26.9456)

1800

1800

2000

2000

500

1000

1500

2000

2500

3000

500

Reconstructed Image (8 x 8 block) CR(5.7726) PSNR(26.9456)

1000

1500

2000

2500

3000

Reconstructed Image (16 x 16 block) CR(1.5127) PSNR(23.9717)

Figure 5 Musical Instrument- Comparison of mega pixel of size 4 x 4 , 8 x 8 and 16 x 16 threshold as 2% of mean square

100

100

200

200

300

300

400

400

500

500

600

600

Compressed Image CR(17.5755) PSNR(28.6065)

Original Image

700

700

100

200

300

400

500

600

Original Image

700

800

900

100

200

300

400

500

600

700

800

900

Reconstructed Image (4 x 4 block) CR(17.5755) PSNR(28.6065)

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

19


IMAGE COMPRESSION

100

100

200

200

300

300

400

400

500

500

Compressed Image PSNR(4.9962) PSNR(24.5094)

600

Compressed Image CR(1.3857) PSNR(21.6199)

600

700

700

100

200

300

400

500

600

700

800

900

100

Reconstructed Image (8 x 8 block) CR(4.9962) PSNR(24.5094)

200

300

400

500

600

700

800

900

Reconstructed Image (16 x 16 block) CR(1.3857) PSNR(21.6199)

Figure 6 Fort- Comparison of mega pixel of size 4 x 4 , 8 x 8 and 16 x 16 threshold as 2% of mean square

200

200

400

400

600

600

800

800

1000

1000

1200

1200

1400

1400

1600

1600

1800

1800

Compressed Image CR(18.3398) PSNR(31.2528)

Original Image

2000

2000

500

1000

1500

2000

2500

3000

500

Original Image

1000

1500

2000

2500

3000

Reconstructed Image (4 x 4 block) CR(18.3398) PSNR(31.2528)

200

200

400

400

600

600

800

800

1000

1000

1200

1200

1400

1400

1600

1600

Compressed Image CR(5.1521) PSNR(26.9432)

1800

Compressed Image CR(1.3278) PSNR(23.9712)

1800

2000

2000

500

1000

1500

2000

2500

Reconstructed Image (8 x 8 block) CR(5.1521) PSNR(26.9432)

3000

500

1000

1500

2000

2500

3000

Reconstructed Image (16 x 16 block) CR(1.3278) PSNR(23.9712)

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

20


IMAGE COMPRESSION

5

CONCLUSIONS

An image for instance “Musical instrument” which is approximated using polynomial function of different orders threshold as 5% of mean square of the original image with 4 x 4 blocks, the compression ratio and PSNR are respectively 15.9849 and 28.5944 and the compression for the blocks of size 8 x 8 are 4.3934 and 24.5098 respectively and showing that a good compression is achieved with 8 x 8 blocks as compared to 4 x 4 blocks size. Furthermore the % of increase in the compression ratio is approximately 72% and the % of decrease in the quality of image is only 14% and can be concluded that with a less degradation in the quality of image a good compression is achieved with 8 x 8 blocks size. Similarly, the compression ratio and PSNR when the image for instance “Musical instrument” is approximated using polynomial function of different orders threshold as 5% of mean square of the original image with 8 x 8 blocks are respectively 4.3934 and 24.5098 and the compression with 16 x 16 blocks are 1.2284 and 21.6191 respectively and is showing that a good compression is achieved with 16 x 16 blocks size as compared to 8 x 8 blocks size, but it is seen that reproduced image is not visually clear. And hence it can be concluded that without compromising the quality of the image, a good compression is achieved with 8 x 8 blocks size. And the same pattern we have observed in all the images. On the basis of mathematical measure of PSNR, RMSE, and compression ratio, we have compared various gray scale images for different block sizes, varying intensity levels with threshold levels 1%, 2% and 5% of mean square of the original image, and the observations are listed here: The compression using polynomial function of order 1, 2 and 3 gave lower RMSE, and compression with block size 4 x 4 gave lower compression ratios, better quality of the image while compression with 16 x 16 block size gave lesser visual quality irrespective of threshold level. For all images compression with 8 x 8 block size gave a higher compression with less degradation in the quality of the image as compared to 16 x 16 block size.

values of PSNR and CR are 31.2665 and 20.6084 respectively and for 4 x 4 block size with 2% of the mean square of the original image, the values of PSNR and CR are 31.2528 and 18.3398 respectively. These results show that without changing block size using threshold 2% of the mean square of the original image, there is only 0.000438% decrease in the quality of image but 11% increase in the compression ratio. For Fort image using 4 x 4 block size with 2% of the mean square of the original image, the values of PSNR and CR are 31.2528 and 18 3398 respectively and for 4 x 4 block size with 5% of the mean square of the original image, the values of PSNR and CR are 31.2236 and 16.2299 respectively. These results show that without changing block size using threshold 5% of the mean square of the original image, there is only 0.00093% decrease in the quality of image but there is further 11.5% increase in the compression ratio.

REFERENCES [1]

[2] [3]

[4]

[5]

[6]

Besl P and Jain R (1988) “Segmentation through variable –order surface fitting”, IEEE Trans. Pattern Anal, Machine Intell, vol.10, pp: 167-192. Eden M, et.al (1986) “Polynomial representation of pictures” Signal Processing, vol.10, pp: 385-393. Said. A and Drukarev . A.I (1999), “Simplified segmentation for compound image compression”, in proceedings of ICIP, pp: 229-233. Salah Ameer, “Investigating polynomial fitting schemes for Image Compression”, Ph.D. thesis Waterloo, Ontario. Kieu T and Nguyen D (2001), “Surface fitting approach for reducing blocking artifacts in low bit rate Discrete Cosine Transform decoded images”, Proc.IEEE Region 10 int. Conf. on Electrical and Electronic Technology, vol. 1, pp: 23-27. Andrew H. C, Tescher, A.G and Kruger R.P (1972), “Image Processing by Digital Computers”, IEEE Spectrum, 9, 20-32

Dr. R Srinivasa Rao is working as an Assistant Professor at Vignana Jyothi Institute of Management, Hyderabad. Prof. V. V. Haragopal is a Professor in the Department of Statistics at Osmania University, Hyderabad.

Also without changing the block size a good compression is achieved through 5% threshold level with a negligible degradation in the quality of the image as compared to the other threshold level 1% and 2%. For instance, an image “Fort”, using 4 x 4 block size with 1% of the mean square of the original image, the

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

21


AUTOMATED HEALTHCARE

Towards Automated Healthcare for the Masses Using Text Mining Analytics Vishnu S. Pendyala Abstract— Current health system cannot sustain due to its demand for increasing finance and human capital resources. This article is an attempt to profess the need for deploying text mining analytics for healthcare, so as to make it available to large populations of the world. The article presents a brief survey of the work towards the goal and points to two other papers written by the author on the topic for those interested in knowing the specifics. Index Terms—Text Mining, Healthcare, Expert System, Information Retrieval, TF-IDF

—————————— u ——————————

1 INTRODUCTION

T

HE recently released National Health Policy 2017 by Government of India proposes to assure healthcare for all Indian citizens, falling just short of making healthcare a fundamental right. It is a move in the right direction, but it still seems an ambitious goal. Even when USA spent more than 17% of its GDP on healthcare costs (Source: http://worldbank.org), it could not assure healthcare as a fundamental right to its citizens. India’s National Health Policy proposes a time-bound increase to a mere 2.5% of the GDP for healthcare. It is a fact that a vast portion of the world population does not even have access to proper healthcare and the cost of healthcare is steeply increasing. Both are serious concerns that need to be addressed. If feasible, automated diagnosis, which is machines doing the diagnosis instead of human doctors, will substantially help in both respects. The constitution of World Health Organization considers “highest attainable standard of health” as a fundamental right of the people. How governments provision this right is left to the nations and sadly, many governments are not in a position to pass on this right to their citizens. Many governments cannot afford to provide the right and access to even basic healthcare is not at an acceptable level in many regions of the world.

2 INTERNATIONAL EFFORTS Pandemics and disasters will only increase with time because people travel. But more than

pandemics and Communicable Diseases, as populations prosper, it is the NonCommunicable Diseases (NCD) such as diabetes and cardio vascular disease that become more life threatening. The two apex bodies in their respective fields, The International Telecommunication Union (ITU) and World Health Organization launched a four-year initiative to use mobile technologies to combat NCDs such as hypertension and cancer. The project has demonstrated improvements in a) Disease and Epidemic Outbreak Tracking b) Mobile Health care telephone help line, c) Treatment compliance, d) Appointment reminders, e) Community mobilization, f ) Mobile surveys (surveys by mobile phone), g) Surveillance, h) Patient monitoring, i) Information and decision support systems, Pregnancy care advice by SMS j) Patient record keeping. However, it still leaves much to be done. With the increasing finance and human resources capital, it is almost impossible for the current healthcare system to scale to the future need. We therefore need an automated way of supplementing the healthcare system. We want to invent drastic solutions to help the masses.

3 TECHNOLOGICAL APPROACHES An analysis of the past and present trends show that Computing Technology has been advancing at an exponential rate from the times known to the mankind. The human genome sequencing project, which costed more than a bil-

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

22


AUTOMATED HEALTHCARE

lion dollars a decade ago (Source: http://www.genome.gov), is now targeted to be done in just a thousand US Dollars. A substantial number of tasks that only humans could do, are now completely automated, resulting in a far better quality of life. In spite of all this progress, one function that has remained an unsuccessful target for automation for decades is the human expertise. While other branches of Artificial Intelligence (AI) have progressed in leaps and bounds and have come a long way in all these decades, the Expert Systems branch of AI that models human expertise has not really taken off even after so many years. Expert Systems used for medical diagnosis, such as MYCIN developed almost four decades ago, even that day, with the limited infrastructure, actually outperformed human experts. Still, in spite of the huge gains in infrastructure and computing speeds, the idea of automated medical diagnosis for mass and general application is not widely used for various reasons, some of which are beyond the scope of this paper. It may still be worth revisiting that idea again, but with an entirely different approach, with the hope that one day, automated general medical diagnosis not specific to any disease or condition, becomes a reality.

4 PROBLEM REORIENTED Earlier Expert Systems were rule based. Knowledge was modeled and reasoned using first-order-logic systems. A number of rules were put in place so that the system could reason with them. The end result was still Information Retrieval (IR) - retrieving the information needed by the user. We can therefore reorient the problem primarily as that of Information Retrieval and not as a knowledge engineering or expert reasoning problem. Developments in IR now make it possible to try self-diagnosis by doing an Internet search. It is quite typical to search for symptoms on the Internet to get an idea of the disease, before receiving professional help. So, it can be hypothesized that the IR as a technology is mature enough to be used for professional medical di-

agnosis. The conventional programming paradigm has been to generalize from small data. The current trend is to personalize from big data using IR tools. The legacy of deterministic programming is fast yielding to the new model of probabilistic programming to deal with the uncertainty in the big data.

5 ENABLING FACTORS A few key enablers can make this happen: a) Mobile is doing what the World Wide Web did in the nineties, which is taking technology and services to the masses. b) Machines already proved that diagnosis can be automated. c) We already have ways to cure at the molecular level by replacing mutated or damaged genes. d) There are companies which manufacture wearables that can measure the vitals and other health signals such as ECG. e) We so far could not replace the mind, but we were successful in replicating some of its functionality. All these pieces of the puzzle are combined to unfold a vision for making automated diagnosis available to the masses using mobile devices connected to the cloud infrastructure which is presented in the author’s work cited as [11] and [12].

6 METHODOLOGY The crux of the work described in these two papers, [11] and [12] can be described in a few sentences. The first step is to collect huge sets of “discharge sheets”, which contain a systematic description of the symptoms and the diagnosis of a medical expert. This is the big data or “text corpus” involved in the project. A sample discharge sheet is shown in Figure 1. Each of the discharge sheet is plotted as a vector in a multidimensional space. Each word in the corpus of the discharge sheet is a dimension. The value along the dimension is the TF-IDF score of the document. TF-IDF stands for “Term Frequency– Inverse Document Frequency.” It is a measure of a word’s relative importance to a document. If a word occurs in all the documents in the corpus, its importance to any single document is low. Hence the term, “inverse” in front of the “document frequency.”

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

23


AUTOMATED HEALTHCARE

or fine-tune the diagnosis done using text mining. Missing data is a common problem with Clinical repositories, including discharge sheets. Purwar et al [2] present a novel way of imputing missing values using simple K-means clustering and use it for predicting disease onset, with highly accurate results. Authors of [3] apply a sequential learning framework to model and predict the progression of Alzheimer’s Disease in a patient. The dataset they used for the purpose is humongous clinical diagnosis data, particularly comprising of the medical images of brain scan. Karamanli and others [4] use Artificial Neural Networks on clinical data to predict cases of Obstructive Sleep Apnea. Constantinou et al [5] use Bayesian Networks to model expertise to support medical decisions.

Figure 1 A Sample Discharge Sheet

The discharge sheets are now vectors in multidimensional space. When a new set of symptoms arrive, they can be documented and plotted just like described above using TF-IDF scores in the same multidimensional space. The only difference from the discharge sheets is that this new document of symptoms does not have the diagnosis listed on it. If we can find a discharge sheet that is similar to the document of symptoms, we can use the diagnosis listed on the similar discharge sheet for the set of given symptoms. Finding similar discharge sheet is a matter of identifying the closest vector to the vector representing the symptoms. Closest vector to a given vector can be identified by computing the cosine similarity that we learnt in high school math.

7 IMPROVING THE ACCURACY BY USING RELATED WORK These ideas presented in the two papers can be combined with other related work to improve the accuracy of the diagnosis. For instance, using augmented reality enabled Web, an image of a patient’s MRI can be superimposed on his body to provide further information to the diagnosis application that can be used to further confirm

As mentioned in the introduction, selfdiagnosis is becoming popular and [6] examines its effectiveness by evaluating the symptom checker apps. They conclude that there are issues with the symptom checkers. The authors of [7] use K-means clustering for knowledge extraction for improving the prediction of traumatic brain injury survival rates. A layered approach to data collection, management, and providing services out of the data is presented in [8] using Cloud and Big Data Analytics. The authors of [9] present a mobile application that deploys an intelligent classifier to predict heart disease. They use machine learning algorithms on clinical data to do this. Integrated with the mobile application is a real-time monitoring component that constantly monitors the patient and raises an alarm when the vitals flag an emergency. In [10], the authors discuss organizing breast imaging examinations and mining them for structured reporting. Some other techniques for diagnosis include using Multiple Logistic Regression (MLR) and Sequential Feature Selection (SFS) on a Coronary Artery Disease (CAD) dataset to select features as described in [1] and apply Neuro Fuzzy Classifier (NFC) for CAD diagnosis. All these techniques can be used as an ensemble approach along with the Text Mining technique that is described in the article to automate medical diagnosis and improve its accuracy.

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

24


AUTOMATED HEALTHCARE

8 CONCLUSION

the masses. In Global Humanitarian Technology Conference (GHTC), 2014 IEEE (pp. 28-35). IEEE.

In this paper, we listed a few ways diagnosis can be automated. Self-diagnosis by doing a Web search often results in correct identification of the problem, indicating that the information online is quite useful, when it comes to medical diagnosis. Any medical diagnosis application may be missing out significantly, if it does not leverage the humongous information available on the Web. This probably is one future direction that interested readers can explore.

[12] Pendyala, V. S., and Figueira, S. (2017, April). Automated Medical Diagnosis from Clinical Data. In IEEE Third International Conference on Big Data Computing Service and Applications (BigDataService), 2017 IEEE.

REFERENCES [1] Marateb, H. R., & Goudarzi, S. (2015). A noninvasive method for coronary artery diseases diagnosis using a clinicallyinterpretable fuzzy rule-based system. Journal of research in medical sciences: the official journal of Isfahan University of Medical Sciences, 20(3), 214.

Vishnu S. Pendyala is a Senior Member of IEEE and Computer Society of India, with over two decades of software experience with industry leaders like Cisco, Synopsys, Informix (now IBM), and Electronics Corporation of India Limited. Vishnu received the Ramanujam memorial gold medal at State Math Olympiad and has been a successful leader during his undergrad years. He also played an active role in Computer Society of India and was the Program Secretary for its annual convention, which was attended by over 1500 delegates. Marquis Who's Who has selected Vishnu 's biography for inclusion in multiple of its publications for multiple years. He is currently authoring a book on a Big Data topic to be published by Apress / Springer.

[2] Purwar, A., & Singh, S. K. (2015). Hybrid prediction model with missing value imputation for medical data. Expert Systems with Applications, 42(13), 5621-5631. [3] Xie, Q., Wang, S., Zhu, J., Zhang, X., & Alzheimer’s Disease Neuroimaging Initiative. (2016). Modeling and predicting AD progression by regression analysis of sequential clinical data. Neurocomputing, 195, 50-55. [4] Karamanli, H., Yalcinoz, T., Yalcinoz, M. A., & Yalcinoz, T. (2016). A prediction model based on artificial neural networks for the diagnosis of obstructive sleep apnea. Sleep and Breathing, 20(2), 509-514. [5] Constantinou, A. C., Fenton, N., Marsh, W., & Radlinski, L. (2016). From complex questionnaire and interviewing data to intelligent Bayesian network models for medical decision support. Artificial intelligence in medicine, 67, 75-93. [6] Semigran, H. L., Linder, J. A., Gidengil, C., & Mehrotra, A. (2015). Evaluation of symptom checkers for self diagnosis and triage: audit study. [7] Rodger, J. A. (2015). Discovery of medical Big Data analytics: Improving the prediction of traumatic brain injury survival rates by data mining Patient Informatics Processing Software Hybrid Hadoop Hive. Informatics in Medicine Unlocked, 1, 17-26. [8] Zhang, Y., Qiu, M., Tsai, C. W., Hassan, M. M., & Alamri, A. (2015). Health-CPS: healthcare cyber-physical system assisted by cloud and big data. [9] Otoom, A. F., Abdallah, E. E., Kilani, Y., Kefaye, A., & Ashour, M. (2015). Effective Diagnosis and Monitoring of Heart Disease. heart, 9(1). [10] Margolies, L. R., Pandey, G., Horowitz, E. R., & Mendelson, D. S. (2016). Breast Imaging in the Era of Big Data: Structured Reporting and Data Mining. AJR. American journal of roentgenology, 206(2), 259. [11] Pendyala, V. S., Fang, Y., Holliday, J., & Zalzala, A. (2014, October). A text mining approach to automated healthcare for

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

25


NASSCOM

NASSCOM Think Tank Discussion on Talent and Skill Development for Data Science in India: A Report Saumyadipta Pyne and Meghana Aruru Abstract— NASSCOM held its ‘Data Science – Talent & Skill Development’ think-tank meeting on Thursday, June 22, 2017, at the Hyderabad International Convention Centre in Hyderabad, Telangana State, India. The meeting brought together industry and academic representatives to participate in a discussion on employment opportunities and training needs in the areas of Big Data Analytics (BDA) and Artificial Intelligence (AI). Following is a report on this significant event. Index Terms—Data Science, Big Data Analytics (BDA), and Artificial Intelligence (AI).

—————————— u ——————————

1 INTRODUCTION

N

ASSCOM think tank focused on a crucial aspect of advancing Data Science in India. Specific objectives of this meeting were: 1. To identify specific job roles in BDA/AI; 2. To develop an understanding of how companies are skilling and reskilling their workforce; 3. To evaluate the BDA/Data Science needs in the market. Following introduction of all participants, the meeting began with the NASSCOM representatives presenting an overview of the analytics market in India and projected growth. Currently, there are 600+ analytics firms operating in India hiring 120,000+ professionals. Approximately, 60% of these firms are directly engaged in BDA.

The NASSCOM team presented their results from an exhaustive search of BDA/AI job listings on the professional networking website LinkedIn. Role descriptions and pathways to promotion were identified. Specifically, 6 new job roles in Big Data space and 12 under AI space were mentioned. Meeting participants were asked to look into each position title and the corresponding required skill-sets as well as role ambiguity or role overlap, if any, with other positions.

The following sections describe issues and opportunities across different areas as identified in the meeting.

A discussion on what constitutes a ‘data scientist’ ensued followed by role descriptions across various sectors. In terms of BDA/AI roles, each sector uses its own terminology and definition of what constitutes a particular role. Such definitions may not be transferrable across sectors and may present challenges in recruitment and skilling/reskilling. For example, it was proposed that an employee moving from the finance sector to the supply chain/logistics sector may need to be reskilled in specific domains. Such reskilling is not often easy nor is it readily available, thus making it harder to map the role in the latter case. There was agreement around the table to develop a mechanism for mapping roles across industries and making job descriptions more consistent. Participants expressed a need to identify training and assessment modalities for capacity building.

2 ROLES AND TITLES IN BDA AND AI

3 CAPACITY BUILDING AND TRAINING

The primary agenda of this discussion section was to identify the new roles and skills needed beyond the traditional roles, the numbers of existing and potential jobs, and challenges associated with timely recruitment and training.

This discussion section was centered around three major questions: 1. How would industry propose to handle unskilled workers and train them? 2. How would industry propose to reskill workers into

In terms of the market, key emerging forces such as ageing populations, disruptive technologies, geo-politics, etc., provide a push to build capacity and train professionals in the tools and techniques of BDA and AI. An estimated 150 million direct jobs were identified in 40 sectors along with 200 million indirect jobs indicating the need to train and develop analysts to meet the needs of a growing market.

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

26


NASSCOM

specific domains? 3. What are the common basic skills necessary for a data scientist? One of the first and foremost areas of concern mentioned was a lack of depth in Machine Learning in general. Some industry representatives mentioned that they would prefer cross-domain ability; i.e., a blend of domain and technology and a wider talent mix over focused expertise in single domains. Other representatives noted that they found gaps in skills, e.g., an expert in Natural Language Processing may not be as skilled in statistical modeling. To this, some expressed that reskilling could be difficult as compared to hiring freshers and skilling them. Yet, other representatives suggested filling the gap of data scientists/analysts by skilling non-analysts/non-technical employees. In terms of training and capacity building, partnerships with academia were suggested. Outcomes based performance through programs such as the U.S. Certification program INFORMS was discussed. It was proposed that job descriptions and roles be consistent across industries to make capacity building easier. Industry representatives identified courses on MOOCs, Coursera and edX that may be helpful but would need to be mapped with existing courses and curricula within the country to identify gaps and opportunities.

The need to have a common baseline program to prepare for data science positions covering basic skills like mathematics, statistics and programming languages as well as structured thinking and client orientation was discussed. Some experts highlighted the need for special skills (such as streams, graphs, etc.) required for innovating analytical products and Big Data solutions, and the tests for benchmarking the same. Participants agreed that the Indian market would need significant capacity building and training at the academic level to develop BDA/AI professionals who would be market ready by 2020. An urgency to develop such training was expressed along with quality metrics and standards for certification/licensure, etc. Participants agreed that job descriptions and roles found on LinkedIn were highly varied and needed mapping across industries/sectors. The meeting ended with a resolve from experts to continue the role mapping process in both academia and industry towards reducing the employment gaps in BDA/AI for the Indian market. Prof. Saumyadipta Pyne and Dr. Meghana Aruru are faculty members at the Indian Institute of Public Health, Hyderabad, focusing on the areas of health analytics and health communications. They attended the NASSCOM think-tank meeting on 'Data Science – Talent & Skill Development' in Hyderabad on June 22, 2017.

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

27


CFP

COMPUTER SOCIETY OF INDIA

The Flagship Publication of the Special Interest Group on Big Data Analytics

Call for Papers

Visleshana is the official publication dedicated to the area of Big Data Analytics from Computer Society of India (CSI), the first and the largest body of computer professionals in India. Current and previous issues can be accessed at https://issuu.com/visleshana. Submissions, including technical papers, in-depth analyses, and research articles in IEEE transactions format (compsoc)* are invited for publication in “Visleshana”, the flagship publication of SIGBDA, CSI, in topics that include but are not limited to the following: • Big Data Architectures and Models • The ‘V’s of Big Data: Volume, Velocity, Variety, Veracity, Value, Visualization • Cloud Computing for Big Data • Big Data Persistence, Preservation, Storage, Retrieval, Metadata Management • Natural Language Processing Techniques for Big Data • Algorithms and Programming Models for Big Data Processing • Big Data Analytics, Mining and Metrics • Machine learning techniques for Big Data • Information Retrieval and Search Techniques for Big Data • Big Data Applications and their Benchmarking, Performance Evaluation • Big Data Service Reliability, Resilience, Robustness and High Availability • Real-Time Big Data • Big Data Quality, Security, Privacy, Integrity, and Fraud detection • Visualization Analytics for Big Data • Big Data for Enterprise, Vertical Industries, Society, and Smart Cities • Big Data for e-governance and policy • Big Data Value Creation: Case Studies • Big Data for Scientific and Engineering Research • Supporting Technologies for Big Data Research • Detailed Surveys of Current Literature on Big Data All submissions must be original, not under consideration for publication elsewhere or previously published. The Editorial Committee will review submissions for acceptance. Please send the submissions to the Editor, Vishnu S. Pendyala at visleshana@gmail.com.

* Manuscript templates can be downloaded from

https://www.ieee.org/publications_standards/publications/cs_template_latex.tar (LaTeX) or https://www.ieee.org/publications_standards/publications/cs_template_word.zip (Word)

July - September 2017 ^ Visleshana ^ Vol. 1 No. 4

28

Visleshana Vol. 1 No. 4  
Visleshana Vol. 1 No. 4  

Flagship Publication of the Computer Society of India, Special Interest Group on Big Data Analytics

Advertisement