Page 1







R E G U L A R F E AT U R E S 07




New Products


Tips & Tricks




EDITORIAL, SUBSCRIPTIONS & ADVERTISING DELHI (HQ) D-87/1, Okhla Industrial Area, Phase I, New Delhi 110020 Ph: (011) 26810602, 26810603; Fax: 26817563 E-mail:




BACK ISSUES Kits ‘n’ Spares New Delhi 110020 Ph: (011) 26371661, 26371662 E-mail:


Ph: 011-40596600 E-mail:


MUMBAI Ph: (022) 24950047, 24928520 E-mail: BENGALURU Ph: (080) 25260394, 25260023 E-mail:

Hive: The SQL-like Data Warehouse Tool for Big Data

PUNE Ph: 08800295610/ 09870682995 E-mail: GUJARAT Ph: (079) 61344948 E-mail: JAPAN Tandem Inc., Ph: 81-3-3541-4166 E-mail:

Keras: Building Deep Learning Applications with High Levels of Abstraction



SINGAPORE Publicitas Singapore Pte Ltd Ph: +65-6836 2272 E-mail: TAIWAN J.K. Media, Ph: 886-2-87726780 ext. 10 E-mail: UNITED STATES E & Tech Media Ph: +1 860 536 6677 E-mail:

“AI must be viewed in a holistic manner”

Re co mm en d

the on material, if found

able n tio ec

December 2017

Linux for your desktop. in


Solus is an operating system that is designed for home computing. It ships with a variety of software out-of-the-box, so you can set it up without too much fuss

efy. m@ tea cd ail: e-m

• Solus 3 Gnome • A collection of open source software

am Te

Kindly add ` 50/- for outside Delhi cheques. Please send payments only in favour of EFY Enterprises Pvt Ltd. Non-receipt of copies may be reported to—do mention your subscription number.


— — US$ 120

Any o



You Pay (`) 4320 3030 1150



Newstand Price (`) 7200 4320 1440



Five Three One

tended, and sh unin oul

he complex n d to t atu ute re



s c, i





open source software for Windows


Solus 3 GNOME

Collection of

a. t dat

In c ase this DV Dd oe sn ot wo r

placement. a free re

e ern Int


for ort@ upp at s us to rite ,w rly pe ro

M Drive VD-RO M, D B RA , 1G : P4 n ts me ire qu Re

Arjun Vishwanathan, associate director, emerging technologies, IDC India

Using jq to Consume JSON in the Shell tem ys dS

Printed, published and owned by Ramesh Chopra. Printed at Tara Art Printers Pvt Ltd, A-46,47, Sec-5, Noida, on 28th of the previous month, and published from D-87/1, Okhla Industrial Area, Phase I, New Delhi 110020. Copyright © 2017. All articles in this issue, except for interviews, verbatim quotes, or unless otherwise explicitly mentioned, will be released under Creative Commons Attribution-NonCommercial 3.0 Unported License a month after the date of publication. Refer to for a copy of the licence. Although every effort is made to ensure accuracy, no responsibility whatsoever is taken for any loss due to publishing errors. Articles that cannot be used are returned to the authors if accompanied by a self-addressed and sufficiently stamped envelope. But no responsibility is taken for any loss or delay in returning the material. Disputes, if any, will be settled in a New Delhi court only.


FOSSBYTES Compiled By: OSFY Bureau

Azure Functions gets Java support Debian 9.2 ‘Stretch’ brings

Support for Java functions has been added to Microsoft’s Azure Functions serverless computing platform. The new beta inclusion is in addition to the existing support for JavaScript C#, F#, Python, PHP, Bash, PowerShell and Batch codes. Azure Functions has received all the features of the Java runtime such as triggering options, data bindings and serverless models, with auto-scaling. The new support comes as an addition to the company’s recently announced capabilities to run Azure Functions runtime on .NET Core. Developers with Java skills can use their existing tools to build new creations using Azure Functions. There is also support for plugins, and Microsoft has enabled native integration of Maven projects using a specific plugin. Azure Functions’ serverless computing platform already supports a list of development languages and platforms. It competes with Amazon Web Services’ AWS Lambda that is widely known for its out-of-the-box serverless experience. Oracle, too, has recently announced its Fn project that competes with Azure Functions.

out 66 security fixes

The Debian Project has announced the second maintenance update to the Debian 9 Stretch operating system. Debuted as version 9.2, the new platform includes a number of new features and security patches. The official announcement confirms that the new point release is not a new version of Debian 9, but merely improves the included packages. Therefore, instead of performing a clean install of Debian 9.2, you can opt for Debian’s up-todate mirror.

Canonical drops 32-bit Ubuntu desktop ISO

Canonical has finally decided to drop support for the 32-bit live ISO release of the Ubuntu distribution. With most of the architecture today being 64-bit, it was only a matter of time that Linux distros stopped releasing 32-bit ISOs. Confirming the development, Canonical engineer Dimitri John Ledkov wrote, “…remove Ubuntu desktop i386 daily-live images from the release manifest for beta and final milestones of 17.10 and therefore, do not ship ubuntu-desktop-i386. iso artifact for 17.10.” It is worth noting that Canonical will only stop building the 32-bit Ubuntu Desktop Live ISO. The company will continue to focus on i386, which is becoming more of a purpose-built architecture for embedded devices. Canonical mainly wants to focus its efforts on the Internet of Things (IoT), where x86-32-bit is still very common. You can continue to install Ubuntu on your 32-bit machines. However, Canonical will no longer release any new live ISO for these machines. Canonical will continue to release minimal network installation ISOs for a 32-bit hardware. These images will receive updates and security patches until the next announcement. Alongside Canonical, open source distributions such as Arch Linux have also recently phased out 32-bit support to encourage users to switch to newer hardware. The 64-bit processors started becoming common since the launch of AMD Opteron and Athlon 64 in 2003. Today, every single mainstream processor available in the market is based on either AMD64 or Intel 64 architecture.

“The Debian Project is pleased to announce the second update of its stable distribution Debian 9 (codenamed ‘Stretch’). This point release mainly adds corrections for security issues, along with a few adjustments for serious problems,” read the official announcement. Debian GNU/Linux 9.2 includes a total of 87 bug fixes and 66 new security improvements. Various apps and core components have also been improved in this release. Another notable change is the inclusion of Linux kernel 4.9.51 LTS. If you keep your Debian Stretch installation updated, you need not update these packages using the point release. The detailed changelog is published on the official Web page. | OPEN SOURCE FOR YOU | DECEMBER 2017 | 7


OpenMessaging debuts to provide an open standard for distributed messaging

The Linux Foundation has announced a new project to bring about an open standard for distributed messaging. Called OpenMessaging, this project is aimed at establishing a governance model and structure for companies working on messaging APIs. Leading Internet companies like Yahoo, Alibaba, Didi and Streamlio are contributing to the OpenMessaging project, primarily to solve the challenges of messaging and streaming applications. The open standard design from this project will be deployed in on-premise, cloud and hybrid infrastructure models. Scaling with messaging services is a big problem. The lack of compatibility between wire-level protocols and standard benchmarking is the most common issue faced. When data gets transferred across different streaming and messaging platforms, compatibility becomes a problem. Additional resources as well as higher maintenance costs are the main complaints about messaging platforms. Existing solutions lack standardised guidelines for fault tolerance, load balancing, security and administration. The needs of modern cloudoriented messaging and streaming applications are very different. The Linux Foundation plans to address all these issues with the OpenMessaging project. The project is also designed to address the issue of redundant work for developers, and make it easier to meet the cutting-edge demands around smart cities, IoT and edge computing. The project contributors plan to facilitate a standard benchmark for application testing and enable platform independence.

SUSE Linux Enterprise Server for SAP applications coming to the IBM Cloud

SUSE has announced that SUSE Linux Enterprise Server for SAP applications will be available as an operating system for SAP solutions on the IBM Cloud. In addition, IBM Cloud is now a SUSE cloud service provider, giving customers a supported open source platform that makes them more agile and reduces operating costs as they only pay for what they use. “Customers need access to a secure and scalable cloud platform to run mission-critical workloads, one with the speed and agility of IBM Cloud, which is one of the largest open public cloud deployments in the world,” said Phillip Cockrell, SUSE VP of worldwide alliance sales. “As the public cloud grows increasingly more popular for production workloads, SUSE and IBM are offering enterprise-grade open source software fully supported on IBM Cloud. Whether big iron or public cloud, SUSE is committed to giving our customers the environments they need to succeed,” he added. Jay Jubran, director of offering management, IBM Cloud, said, “IBM Cloud is designed to give enterprises the power and performance they need to manage their mission-critical business applications. IBM provides a spectrum of fully managed and Infrastructure as a Service solutions to support SAP HANA applications, including SUSE Linux Enterprise Server as well as new bare metal servers with up to 8TB of memory.”

Spotlight on adopting serverless technologies

According to Gartner, “By 2022, most Platform as a Service (PaaS) offerings will evolve to a fundamentally serverless model, rendering the cloud platform architectures dominant in 2017 as legacy architectures.” Serverless is one of the hottest technologies in the cloud space today. The Serverless Summit organised on October 27 in Bengaluru by CodeOps Technologies put a spotlight on serverless technologies. The conference helped bring together people who are passionate about learning and adopting serverless technologies in their organisations. With speakers from three different continents and 250 participants from all over India, the event was a wonderful confluence of experts, architects, developers, DevOps, practitioners, CXOs and enthusiasts.



Linux support comes to Arduino Create

The Arduino team has announced a new update to the Arduino Create Web platform. The initial release has been sponsored by Intel and supports X86/X86_64 boards. This enables fast and easy

development and deployment of Internet of Things (IoT) applications with integrated cloud services on Linux-based devices. With Arduino Create supporting Linux on Intel chips, users are now able to program their Linux devices as if these were regular Arduinos. The new Arduino Create features a Web editor, as well as cloud-based sharing and collaboration tools. The software provides a browser plugin, letting developers upload sketches to any connected Arduino board from the browser. Arduino Create now allows users to manage individual IoT devices, and configure them remotely and independently from where they are located. To further simplify the user journey, the Arduino team has also developed a novel out-of-the-box experience that will let anyone set up a new device from scratch via the cloud, without any previous knowledge, by following an intuitive Web-based wizard. In the coming months, the team has plans to expand support for Linux based IoT devices running on other hardware architectures too.

The highlight of the conference was the keynote by John Willis (of ‘The DevOps Handbook’ fame) who travelled all the way from the US for the event. He talked about ‘DevOps in a Serverless World’ covering the best practices and how they manifest in a serverless environment. He also conducted a post-conference workshop on DevOps principles and practices. Serverless technology is an interesting shift in the architecture of digital solutions, where there is a convergence of serverless architecture, containers, microservices, events and APIs in the delivery of modular, flexible and dynamic solutions. This is what Gartner calls the ‘Mesh App and Services Architecture’ (or MASA, for short). With that theme, there were sessions on serverless frameworks and platforms like the open source Fn platform and Kubernetes frameworks (especially Fission), Adobe’s I/O runtime, and Microsoft’s Azure platform. Serverless technology applications covered at the event included sessions like ‘Serverless and IoT (Internet of Things) devices’, ‘Serverless and Blockchain’, etc. The hands-on sessions included building chatbots and artificial intelligence (AI) applications with serverless architectures. The conference ended with an interesting panel discussion between Anand Gothe (Prowareness), Noora (Euromonitor), John Willis (SJ Technologies), Sandeep Alur (Microsoft) and Vidyasagar Machupalli (IBM). Open Source For You (OSFY) was the media partner and the Cloud Native Computing Foundation (CNCF) was the community partner for the conference.

Microsoft announces new AI, IoT and machine learning tools for developers

At Connect(); 2017, Microsoft’s annual event for professional developers, executive VP Scott Guthrie announced Microsoft’s new data platform technologies and crossplatform developer tools. These tools will help increase developer productivity and simplify app development for intelligent cloud and edge technologies, across

devices, platforms or data sources. Guthrie outlined the company’s vision and shared what is next for developers across a broad range of Microsoft and open source technologies. He also touched on key application scenarios and ways developers can use built-in artificial intelligence (AI) to support continuous innovation and continuous deployment of today’s intelligent applications. “With today’s intelligent cloud, emerging technologies like AI have the potential to change every facet of how we interact with the world,” Guthrie said. “Developers are in the forefront of shaping that potential. Today at Connect(); we’re announcing new tools and services that will help developers build applications and services for the AI-driven future, using the platforms, languages and collaboration tools they already know and love,” he added. Microsoft is continuing its commitment to delivering open technologies and contributing to and partnering with the open source community.


FOSSBYTES New version of Red Hat OpenShift Container Platform launched for hybrid cloud environments

Red Hat has launched Red Hat OpenShift Container Platform 3.7, the latest version of Red Hat’s enterprise-grade Kubernetes container application platform. The new platform helps IT organisations to build and manage applications that use services from the data centre to the public cloud. The newest iteration is claimed to be the industry’s most comprehensive enterprise Kubernetes platform; it includes native integrations with Amazon Web Services (AWS) Service Brokers that enable developers to bind services across AWS and on-premise resources to create modern applications while providing a consistent, open standardsbased foundation to drive business evolution. “Modern, cloud-native applications are not monolithic stacks with clear-cut needs and resources; so to more effectively embrace modern applications, IT organisations need to re-imagine how their developers find, provision and consume critical services and resources across a hybrid architecture. Red Hat OpenShift Container Platform 3.7 addresses these needs head-on by providing hybrid access to services through its service catalogue, enabling developers to more easily find and bind the necessary services to their business-critical applications—no matter where these services exist—and adding close integration with AWS to further streamline cloud-native development and deployment,” said Ashesh Badani, vice president and general manager, OpenShift, Red Hat. Red Hat OpenShift Container Platform 3.7 will ship with OpenShift Template Broker, which turns any OpenShift Template into a discoverable service for application developers using OpenShift. OpenShift Templates are lists of OpenShift objects that can be implemented within specific parameters, making it easier for IT organisations to deploy reusable, composite applications comprising microservices. Also included with the new platform is OpenShift Ansible Broker for provisioning and managing services through the OpenShift service catalogue by using Ansible to define OpenShift Services. OpenShift Ansible Broker enables users to provision services both on and off the OpenShift platform, helping to simplify and automate complex workflows involving varied services and applications across on-premise and cloud-based resources.

Announcing the general availability of Bash in AzureCloud Shell

Microsoft has announced the availability of Bash in Azure Cloud Shell. Bash in Cloud Shell comes equipped with commonly used CLI tools, including Linux shell interpreters, Azure tools, text editors, source control, build tools, container tools, database tools, and more. Justin Luk, programme manager, Azure Compute, announced that Bash in Cloud Shell will provide an interactive Web-based, Linux command line experience from virtually anywhere. With a single click through the Azure portal, Azure documentation, or the Azure mobile app, users will gain access to a secure and authenticated Azure workstation to manage and deploy resources from a native Linux environment held in Azure. Bash in Cloud Shell will enable simple, secure authentication to use Azure resources with Azure CLI 2.0. Azure file shares enable file persistence through CloudDrive to store scripts and settings.

MariaDB Foundation gets a new platinum level sponsor

MariaDB Foundation recently announced that Microsoft has become a platinum sponsor. The sponsorship will help the Foundation in its goals to support continuity and open collaboration in the MariaDB ecosystem, and to drive adoption, serving an ever-growing community of users and developers. “Joining the MariaDB Foundation as a platinum member is a natural next step in Microsoft’s open source journey. In addition to Microsoft Azure’s strong support for open source technologies, developers can use their favourite database as a fully managed service on Microsoft Azure that will soon include MariaDB,” said Rohan Kumar, GM for database systems at Microsoft. Monty Widenius, founder of MySQL and MariaDB, stated, “Microsoft is here to learn from and contribute to the MariaDB ecosystem. The MariaDB Foundation welcomes and supports Microsoft towards this goal.” One of the fundamental principles in Azure is about choice. Customers of Azure will now be able to run the apps they love, and Microsoft wants to make sure that the MySQL and MariaDB experience on Windows and Linux hosts in Azure is excellent. Microsoft’s community engagement through open source foundations helps to nurture and advance the core technologies that the IT industry relies upon. MariaDB is a natural partner to Microsoft, as it is the fastest growing open source database. In most Linux distributions, MySQL has already been completely replaced with MariaDB. | OPEN SOURCE FOR YOU | DECEMBER 2017 | 11


Audacity 2.2.0 released with an improved look and feel

Audacity, a popular open source audio editing software, has received a significant update. The new version, dubbed Audacity 2.2.0, comes with four pre-configured, user-selectable themes. This enables you to choose the look and feel for Audacity’s interface. It also has playback support for MIDI files, and better organised menus. Around 198 bugs have been fixed in this newly released version — one of the major changes is the improved recovery from full file system errors. The menus are shorter and clearer than in previous Audacity versions, and have been simplified without losing functionality. The most commonly used functions are found in the top levels of the menus. The functions that have moved down into lower sub-menus are better organised. You can download the update from to try it out on Windows/Mac or any Linux based operating system.

Blender tool to be used in French animation movie

The soon-to-be-made animated movie ‘I Lost My Body’ will use the open source Blender software tool. The film will combine Blender’s computer graphics with hand-drawn elements. At the recent Blender conference in Amsterdam, French filmmaker Jérémy Clapin and his crew gave a presentation on the processes and tools to be used in the making of ‘I Lost My Body’. ‘The film’ will start production next year, with a likely release in 2019, adding to the open source animation showreel, thanks to Blender software.

All new Ubuntu 17.10 released

With the new release of Ubuntu, there is some good news for GNOME lovers. After a long time, Ubuntu has come up with some major changes. The new release has GNOME as the default desktop environment instead of Unity. Ubuntu 17.10 comes with the newest software enhancements and nine months of security and maintenance updates. It is based on the Linux Kernel release series 4.13. It includes support for the new IBM z14 mainframe CPACF instructions and new KVM features. 32-bit installer images are no longer provided for Ubuntu Desktop. Apart from this, GDM has replaced LightDM as the default display manager. The login screen now uses virtual terminal 1 instead of virtual terminal 7. Window control buttons are back on the right for the first time since 2010. Apps provided by GNOME have been updated to 3.26. Driverless printing support is now available for IPP Everywhere, Apple AirPrint and Wi-Fi Direct. LibreOffice has been updated to 5.4 and Python 2 is no longer installed by default. Python 3 has been updated to 3.6. Ubuntu 17.10 will be supported for nine months until July 2018. If you need long term support, it is recommended you use Ubuntu 16.04 LTS instead.

End of Linux Mint with the KDE Desktop environment

Linux Mint founder, Clement Lefebvre, announced in a blog post that the upcoming Linux Mint 18.3 Sylvia operating system will be the last release to feature a KDE edition. In the post, Lefebvre said, “Users of the KDE edition represent a portion of our user base. I know from their feedback that they really enjoy it. They will be able to install KDE on top of Linux Mint 19 of course, and I’m sure the Kubuntu PPA will continue to be available. They will also be able to port Mint software to Kubuntu itself, or they might want to trade a bit of stability away and move to a bleeding edge distribution such as Arch to follow upstream KDE more closely.” He added: “KDE is a fantastic environment but it’s also a different world, one which evolves away from us and away from everything we focus on. Their apps, their ecosystem and the QT toolkit, which is central there, have very little in common with what we’re working on.” The bottom line is that Linux Mint 19 will be available only in Cinnamon, Xfce and MATE editions.

For more news, visit


Guest Column

Anil Seth

Exploring Software

Aadhaar Could Simplify Life in Important Ways All of you have perhaps been busy fulfilling the requirements of linking your Aadhaar numbers with various organisations like banks, mobile phone service providers, etc, just as I have been. I am reminded of the time when relational databases were becoming popular. Back then, the IT departments had to worry about consistency of data when normalising it and removing duplication.


y name seems simple enough but it gets routinely misspelled. All too often it does not matter and you, like me, may choose to ignore the incorrect spelling. We tend to be more particular that the address is correct with an online shopping store, even if the name is misspelled. On the other hand, you will want to make sure that the name is correct on a travel document, even if there is a discrepancy in the address. Reconciling data can be a very difficult process. Hence, it comes as a surprise that once PAN is linked to the bank accounts and Aadhaar is linked to the PAN, why create the potential for discrepancies by forcing banks to link the accounts with Aadhaar, especially as companies do not have Aadhaar and one person can create many companies? This set me thinking about some areas where the UID would have saved me a lot of effort and possibly, at no risk to my identity in the virtual world. Obviously, the use cases are for digitally comfortable citizens and should never be mandatory.

When registering a will

While an unregistered will may be valid, the formalities become much simpler if the will is registered. A local government office told me that registering a will is simple — just bring two witnesses, one of whom should be a gazetted officer (I wonder if there is an app to find one). It would be much simpler if I register the will using an Aadhaar verification, using biometrics. No witnesses needed. Now, no one needs to know what I would like to happen after I am no longer around.

When registering a nominee

If the Aadhaar ID of a nominee is mentioned, the nominee does not need to provide any documentation or witnesses other than the death certificate, for the nomination formalities to be completed. Even the verification of the person can be avoided if any money involved is transferred to the account linked to that Aadhaar ID.

Notarisation of documents

The primary purpose of notarisation is to ensure that the document is an authentic copy and the person signing could be prosecuted if the information therein is incorrect. This requires you to submit physical paper documents, whereas the desire is

to be online and paperless. An easy option is to seed the public key with the Aadhaar database. It does not need to be issued by a certification authority. Any time a person digitally signs a digital document with his private key, it can be verified using the public key. There is no need to worry about securing access to the public key as, by its very nature, it is public. This can save on court time as well. In case of any disputes, no witnesses need be called; and even after many years, there will be no need to worry about the fallibility of the human mind.

Elimination of life certificates

Even after linking the Aadhaar number to the pension account, one still needs to go through the annual ritual of physically proving that one is alive! I am reminded of an ERP installation where the plant manager insisted on keeping track of the production status at various stages, against our advice. It took a mere week for him to realise his error. While his group was spending more time creating data, he himself was drowning in it. He had less control over production than he had before the ERP system was installed. Since we had anticipated this issue, it did not take us long to change the process to capture the minimum data, as we had recommended. Pension-issuing authorities should assume that the pensioner is alive till a death certificate is issued. The amount of data needed in the latter case is considerably less!

Lessons from programming

Most programmers learn from experience that exception handling is the crucial part of a well written program. More often than not, greater effort is required to design and handle the exceptions. Efficient programming requires that the common transactions take the minimal resources. The design and implementation must minimise the interactions needed with a user and not burden the user by providing unnecessary data. One hopes that any new usage of UID will keep these lessons in mind. By: Dr Anil Seth The author has earned the right to do what interests him. You can find him online at, http://sethanil., and reach him via email at | OPEN SOURCE FOR YOU | DECEMBER 2017 | 13



Sandya Mannarswamy

In this month’s column, we discuss a real life NLP problem, namely, detecting duplicate questions in community question-answering forums.


hile we have been discussing many questions in machine learning (ML) and natural language processing (NLP), I had a number of requests from our readers to take up a real life ML/NLP problem with a sufficiently large data set, discuss the issues related to this specific problem and then go into designing a solution. I think it is a very good suggestion. Hence, over the next few columns, we will be focusing on one specific real life NLP problem, which is detecting duplicate questions in community question-answering (CQA) forums. There are a number of popular CQA forums such as Yahoo Answers, Quora and StackExchange where netizens post their questions and get answers from domain experts. CQA forums serve as a common means of distilling crowd intelligence and sharing it with millions of people. From a developer perspective, sites such as StackOverflow fill an important need by providing guidance and help across the world, 24x7. Given the enormous number of people who use such forums, and their varied skill levels, many questions get asked again and again. Since many users have similar informational needs, answers to new questions can typically be found either in whole or part from the existing question-answer archive of these forums. Hence, given a new incoming question, these forums typically display a list of similar or related questions, which could immediately satisfy the information needs of users, without them having to wait for their new question to be answered by other users. Many of these forums use simple keyword/tag based techniques for detecting duplicate questions. However, often, these automated lists returned by the forums are not accurate, frustrating users looking for answers. Given the challenges in identifying duplicate questions, some forums put in manual effort to tag duplicate questions. However, this is not scalable, given the rate at which new


questions get generated, and the need for specific domain expertise to tag a question as duplicate. Hence, there is a strong requirement for automated techniques that can help in identifying questions that are duplicates of an incoming question. Note that identifying duplicate questions is different from identifying ‘similar/related’ questions. Identifying similar questions is somewhat easier as it only requires that there should be considerable similarity between a question pair. On the other hand, in the case of duplicate questions, the answer to one question can serve as the answer to the second question. This identification requires stricter and more rigorous analysis. At first glance, it appears that we can use various text similarity measures in NLP to identify duplicate questions. Given that people express their information needs in widely different forms, it is a big challenge to identify the exact duplicate questions automatically. For example, let us consider the following two questions: Q1: I am interested in trying out local cuisine. Can you please recommend some local cuisine restaurants that are wallet-friendly in Paris? Q2: I like to try local cuisine whenever I travel. I would like some recommendations for restaurants which are not too costly, but serve authentic local cuisine in Athens? Now consider applying different forms of text similarity measures. The above two questions score very high on various similarity measures— lexical, syntactic and semantic similarity. While it is quite easy for humans to focus on the one dissimilarity, which is that the locations discussed in the two questions are different, it is not easy to teach machines that ‘some dissimilarities are more important than other dissimilarities.’ It also raises the question of whether the two words ‘Paris’ and ‘Athens’ would be considered as extremely

Guest Column dissimilar. Given that one of the popular techniques for word similarity measures these days is the use of word-embedding techniques such as Word2Vec, it is highly probable that ‘Paris’ and ‘Athens’ end up getting mapped as reasonably similar by the word-embedding techniques since they are both European capital cities and often appear in similar contexts. Let us consider another example. Q1: What’s the fastest way to get from Los Angeles to New York? Q2: How do I get from Los Angeles to New York in the least amount of time? While there may not be good word-based text similarity between the above two questions, the information needs of both the questions are satisfied by a common answer and hence this question pair needs to be marked as a duplicate. Let us consider yet another example. Q1: How do I invest in the share market? Q2: How do I invest in the share market in India? Though Q1 and Q2 have considerable text similarity, they are not duplicates since Q2 is a more specific form of question and, hence, cannot share the same answer as Q1. These examples are meant to illustrate the challenges involved in identifying duplicate questions. Having chosen our task and defined it, now let us decide what would be our data set. Last year, the CQA forum, Quora, had released a data set for the duplicate question detection task. This data set was also used in a Kaggle competition involving the same task. Hence let us use this data set for our exploration. It is available at https:// So please download the train.csv and test.csv files for your exploratory data analysis. Given that this was run as a Kaggle competition, there are a lot of forum discussions on Kaggle regarding the various solutions to this task. While I would encourage readers to go through them to enrich their knowledge, we are not going to use any non-text features as we attempt to solve this problem. For instance, many of the winners have used question ID as a feature in their solution. Some others have used graph features, such as learning the number of neighbours that a duplicate question pair would have compared to a nonduplicate question pair. However, we felt that these are extraneous features to text and are quite dependent on the data. Hence, in order to arrive at a reliable solution, we will only look at text based features in our approaches. As with any ML/NLP task, let us begin with some exploratory data analysis. Here are a few questions to our readers (Note: Most of these tasks are quite easy, and can be done with simple commands in Python using Pandas. So I urge you to try them out). 1. Find out how many entries there are in train.csv? 2. What are the columns present in train.csv? 3. Can you find out whether this is a balanced data set or not? How many of the question pairs are duplicates? 4. Are there any NaNs present in the entries for Question 1 and Question 2 columns?


5. Create a Bag of Words classifier and report the accuracy. I suggest that our readers (specifically those who have just started exploring ML and NLP) try these experiments and share the results in a Python Jupiter notebook. Please do send me the pointer to your notebook and we can discuss it in this column. Another exercise that is usually recommended is to go over the actual data and see what types of questions are marked as duplicate and what are not. It would also be good to do some initial text exploration of the data set. I suggest that readers use the Stanford CoreNLP tool kit for this purpose because it is more advanced in its text analysis compared to NLTK. Since Stanford CoreNLP is Java based, you need to run this as a server and use a client Python package such as Please try the following experiments on the Quora data set. 1. Identify the different Named Entities present in the Quora train data set and test the data set. Can you cluster these identities? 2. Stanford CoreNLP supports the parse tree. Can you use it for different types of questions such as ‘what’, ‘where’, ‘when’ and ‘how’ questions? While we can apply many of the classical machine learning techniques after identifying the appropriate features, I thought it would be more interesting to focus on some of the neural networks based approaches since the data set is sufficiently large (Quora actually used a random forest classifier initially). Next month, we will focus on some of the simple neural network based techniques to attack this problem. I also wanted to point out a couple of NLP problems related to this task. One is the task of textual entailment recognition where, given a premise statement and hypothesis statement, the task is to recognise whether the hypothesis follows from the premise, contradicts the premise or is neutral to the premise. Note that textual entailment is a 3-class classification problem. Another closely related task is that of paraphrase generation. Given two statements S1 and S2, the task is to identify whether S1 and S2 are paraphrases. Some of the techniques that have been applied for paraphrase identification and textual entailment recognition can be leveraged for our task of duplicate question identification. I’ll discuss more on this in next month’s column. If you have any favourite programming questions/ software topics that you would like to discuss on this forum, please send them to me, along with your solutions and feedback, at sandyasm_AT_yahoo_DOT_com.

By: Sandya Mannarswamy The author is an expert in systems software and is currently working as a research scientist at Conduent Labs India (formerly Xerox India Research Centre). Her interests include compilers, programming languages, file systems and natural language processing. If you are preparing for systems software interviews, you may find it useful to visit Sandya’s LinkedIn group ‘Computer Science Interview Training (India)‘ at | OPEN SOURCE FOR YOU | DECEMBER 2017 | 15

NEW PRODUCTS Touch based gesture control headphones from Beyerdynamic Audio equipment manufacturer, Beyerdynamic, has launched the Aventho wireless headphones with sound customisation technology. The headphones have been developed by Berlin based Mimi Hearing Technologies. These headphones sport touch based gesture control on the right ear cup, through which users can receive and disconnect calls, increase or decrease volume levels, etc. They use the Bluetooth 4.2 protocol with the aptX HD codec from Qualcomm, which guarantees the best sound even without wires. The Aventho wireless headphones come with the Tesla sound transducers to offer a great acoustic performance. The compact size and cushion cups ensure comfort over long hours. Additional features include a frequency response of 32 ohms, a transmission rate of 48kHz/24 bits, and a play time of more than 20 hours on a single charge. The Beyerdynamic Aventho headphones are packed in a sturdy fabric bag and are available in black and brown, at retail stores. Price:

` 24,999

Address: Beyerdynamic India Pvt Ltd, 1, 10th Main Road, Malleshwaram West, Bengaluru, Karnataka 560003

Google Pixel 2 now available in India

` 61,000

Search giant Google has launched two successors to its first smartphone in its Pixel hardware line—the Pixel 2 and Pixel 2 XL. The Pixel 2 comes with an ‘always-on’ display of 12.7cm (5 inches) with a full HD (1920 x 1080 pixels) AMOLED at 441ppi, and 2.5D Corning Gorilla Glass 5 protection. The Pixel 2 XL comes with a full 15.2cm (6 inches) QHD (2880 x 1440 pixels) pOLED at 538ppi, with 3D Corning Gorilla Glass protection. The Pixel 2 is powered by a 1.9GHz octa-core Qualcomm Snapdragon 835 processor and runs on Android’s latest 8.0.0 Oreo OS version. The Pixel 2 XL also runs on Android Oreo, and comes with the Adreno 540 GPU. On the camera front, both the variants sport a 12.2 megapixel rear and an 8 megapixel front camera, which have optical and electronic image stabilisation along with fixed focus. Both phones offer a RAM of 4GB and internal storage of 64GB and 128GB (in two variants), which is not further



` 70,000

and for the 64GB and 128GB variants of Pixel 2; and

` 73,000

` 82,000

and for the 64GB and 128GB variants of Pixel 2 XL.

expandable. Users can enjoy unlimited online storage for photos and videos. With a battery of 2700mAh for Pixel 2 and 3520mAh for Pixel 2 XL, the smartphones are designed with stereo front-firing speakers, and a headphone adaptor to connect a 3.5mm jack. On the connectivity and wireless front, the devices offer Wi-Fi 2.4GHz+5GHz 802.11/a/b/g/n/ ac 2x2 MIMO with Bluetooth 5.0+LE, NFC, GPS, etc. The feature-packed smartphones have an aluminium unibody with a hybrid coating, and IP67 standard water and dust resistance. Both the smartphones are available online and at retail stores in black, white and blue. Address: Google Inc., Unitech Signature Tower-II, Tower-B, Sector-15, Part-II, Village Silokhera, Gurugram, Haryana 122001; Ph: 91-12-44512900

iBall’s latest wireless keyboard is silent iBall, a manufacturer of innovative technology products, has launched a wireless ‘keyboard and mouse’ set, which promises a silent workspace at home or the office. The set comes with unique silent keys and buttons for quiet typing and a distraction-free work environment. Crafted with a rich, piano-like sheen, the set adds elegance to the desk space. The ultra-slim and stylish keyboard has a sealed membrane, which ensures greater durability and reliability. It is designed with 104 special soft-feel keys, including the full numeric keypad. The wireless mouse features blueeye technology and an optical tracking engine, which ensures responsive and

accurate cursor movement, enabling users to work on any surface, ranging from wood to glass, the sofa or even a carpet. The highspeed 1600cpi mouse allows users to adjust the speed of the mouse as per requirements. The device offers reliable performance up to 10 metres and 2.4GHz wireless transmission.


` 3,499

The keyboard and mouse set is available at all the leading stores across India with a three-year warranty. Address: iBall, 87/93, Mistry Industrial Complex, MIDC Cross Road A, Near Hotel Tunga International, Andheri East, Mumbai, Maharashtra 400093; Ph: 02230815100

Voice assistant speakers launched by Amazon Amazon has launched its voice assistant speaker in three variants – Echo Dot, Amazon Echo and Echo Plus. The devices connect to ‘Alexa’ — a cloud based voice assistant service that helps users to play music, set alarms, get information, access a calendar, get weather reports, etc. The speakers support Wi-Fi 802.11 a/b/g/n and the advanced audio distribution profile (A2DP). They enable 360-degree omni-directional audio to deliver crisp vocals and dynamic bass. With seven microphones, beamforming technology and noise cancellation, the speakers can take commands from any direction, even in noisy environments or while playing music. They are also capable of assisting users in controlling lights, switches, etc, with compatible connected devices, or even in ordering food online. The Amazon Echo Dot is the most compact and affordable version with a 1.52cm (0.6 inches) tweeter. The Amazon Echo and Echo Plus are larger variants, with 6.35cm (2.5 inches)


woofers, and 1.52cm (0.6 inches) and 2.03cm (0.8 inches) tweeters, respectively. All the speakers come with four physical buttons on the top to control the volume, the microphone, etc. At the bottom, the device has a power port and 3.5mm audio output.

The prices, features and specifications are based on information provided to us, or as available on various websites and portals. OSFY cannot vouch for their accuracy.

` 3,149 for Echo Dot, ` 6,999 for Amazon Echo and ` 10,499 for Echo Plus All three variants are available in black, grey and white, via Address: Amazon India, Brigade Gateway, 8th Floor, 26/1, Dr Rajkumar Road, Malleshwaram West, Bengaluru, Karnataka 560055 Compiled by: Aashima Sharma | OPEN SOURCE FOR YOU | DECEMBER 2017 | 17


The Growing Popularity of

Bug Bounty Platforms After a spate of high-profile ransomware and malware infiltrated IT systems worldwide, Indian enterprises are now sitting up and adopting bug bounty programmes to protect their applications from hacking attacks.


he global security threat scenario has changed radically in recent times. If hackers of yore were mainly hobbyists testing the security limits of corporate systems as an intellectual challenge, the new threat comes from well-concerted plans hatched by criminal gangs working online with an eye to profit, or to compromise and damage information technology systems. The widespread hack attacks have also become possible because of the high degree of connectivity of devices, like smartphones, laptops and tablets, that run a variety of operating systems. When consumer data gets compromised it has an immediate impact on the brand and reputation of the affected company, as was evident when Verizon cut its purchase price for Yahoo by US$ 350 million, after a online portal revealed that it had been repeatedly hacked. When the data of a company gets compromised and is


followed by frequent attempts to conceal the fact after the incident, it can seriously impact whether customers will continue to deal with the company in any way. In the final analysis, customers are not willing to put their data at risk with a vendor who does not value and protect their personal information. India has not been spared in this regard. Recent reports allege that customer data at telecom giant Reliance Jio was compromised and previously, this occurred at online restaurant guide Zomato. Companies need to team up with the right kind of hackers. Organisations cannot on their own match the wiles of the thousands of very smart hackers. This battle cannot be fought with internal resources alone. Companies need to build a culture of information-sharing on security issues with government CERTs (computer emergency response teams), security companies and security researchers.

Advertorial Countering malicious hackers needs a large number of ‘ethical hackers’, also known as ‘white hats’, who will probe your systems just as any hacker would, but responsibly report to you any vulnerabilities in your system. Many of them do this work for recognition, so don’t hesitate to name the person who helped you. Do appreciate the fact that they are spending a lot of their time identifying the security holes in your systems. This concept is not new. It has been tried by a number of Internet, information technology, automobile and core industry companies. Google, Facebook, Microsoft, ABN AMRO, Deutsche Telekom and the US Air Force are some of the many organisations that have set up their own reward programmes. And it has helped these companies spot bugs in their systems that were not evident to their own in-house experts, because the more pairs of eyes checking your code, the better. Some companies might hesitate to work with hobbyist researchers, since it is difficult to know, for example, whether they are encouraging criminal hackers or not. What if the hobbyists steal company data? As more and more organisations are becoming digital, startups now offer their services through Web or mobile applications, so their only assets are the software apps and customer data. Once the data breach happens, customer credentials get stolen or denial of services attacks occur, leading to huge losses in revenue, reputation and business continuity. By becoming part of the bug bounty platform, companies can create a security culture within the organisations. Indian companies have a unique advantage if they decide to crowdsource the identification of security vulnerabilities in their IT infrastructure since the country has one of the largest number of security researchers, who is part of the crowd that are willing to help organisations spot a bug before a criminal does. The 2017 Bugcrowd report cited 11,663 researchers in India that worked on bug bounty programmes, which is behind the US with about 14,244 white hat hackers. While most of them have jobs or identified themselves as students, 15 per cent of bug hunters were fully engaged in the activity, with this number expected to increase, according to Bugcrowd. Although Indian hackers earned over US$ 1.8 million in bounties in 2016-17, the bounties paid by Indian companies added up to a paltry US$ 50, according to HackerOne, indicating that local firms are not taking advantage of the crowdsourcing option. Part of the reason is that Indian companies are still wary of having their security infrastructure and any vulnerability in it exposed to the public. This over-cautious approach could backfire in the long term, as it is always better to look for bugs cooperatively with responsible hackers in a controlled environment, rather than have the vulnerabilities eventually spotted and exploited by criminals.

Companies also take cover behind a smokescreen of denial when they are actually hit by cyber attacks, as Indian law does not make it mandatory to report security incidents to the CERT or any government agency. However, the regulatory framework is expected to change with the Reserve Bank of India, for example, making it mandatory for banks to report cyber security incidents within two to six hours of the attacks being noticed. Indian organisations also do not have a local platform for engaging with researchers, which would define the financial, technical and legal boundaries for the interaction in compliance with local regulations. Such a platform would give these companies the confidence that they can engage safely with people who are not on their payroll, even if their main objective is to hack for bugs. Bug bounty platforms like SafeHats are connecting enterprises with white hacker communities in India., powered by Instasafe Technologies, is a leading Security as a Service provider. It offers a curated platform that helps organisations to create a responsible vulnerability disclosure policy that lays down the rules of engagement, empanels reputed researchers, and makes sure that the best and the safest white hackers get to your systems before the bad guys do. SafeHats has been working with some leading banking organisations and e-commerce players in securing their applications. Once vulnerabilities are discovered, SafeHat helps to fix them and to launch secure apps to the market. The key difference with this kind of platform is that the organisations pay the security researchers only if the bug is found, and the amount paid is based on the severity of the bug. A large number of Indian enterprises are in dire need of tightening up on their security, as the compute infrastructures of an increasing number of organisations are being breached. On the other hand, we see an opportunity for Indian companies to leverage the large talent pool of white hackers from India. SafeHats in Bengaluru was born out of the need to bring Indian companies and hackers together, in a safe environment. More organisations are now aware about their security needs after the high-profile Wannacry and Petya ransomware attacks. Lot of growth stage startups have shown interest in adopting bug bounty programmes as they have realised application security is key to their next round of funding. Sandip Panda, CEO of Instasafe, says, “Security is now an important topic in every organisation’s board room discussions. Investment in security is as important as investment in the product itself. Bug bounty platforms will create an entirely new security culture in India.” By: Shasanka Sahu The author works at Instasafe Technologies Pvt Ltd. | OPEN SOURCE FOR YOU | DECEMBER 2017 | 19

For U & Me

Interview Arjun Vishwanathan, associate director, emerging technologies, IDC India


What are the latest trends in the world of AI?

IDC predicts that by 2018, 75 per cent of enterprise and ISV development will include cognitive, AI or machine learning functionality in at least one application, including all business analytics tools. The adoption of AI solutions is set to grow at a fast pace, especially in the Asia Pacific region (excluding Japan) (APEJ). More than half of the organisations in this region are planning to adopt AI within a five-year timeline.

“AI must be vIewed In A holIstIc mAnner� Artificial intelligence (AI) is touching new heights across all verticals, including consumer services, e-commerce, mobile phones, life sciences and manufacturing, among others. But how will AI transform itself over time? Arjun Vishwanathan, associate director, emerging technologies, IDC India, discusses the transformation of AI in an exclusive conversation with Jagmeet Singh of OSFY. Edited excerpts...

20 | December 2017 | OPeN SOUrce FOr YOU |


What are your observations on the evolution of AI?

Global spending on cognitive and AI solutions will continue to see significant corporate investment over the next several years, achieving a compound annual growth rate (CAGR) of 54.4 per cent through 2020 when revenues will be more than US$ 46 billion. Around 59 per cent of organisations plan to make new software investments in cognitive or AI technologies, whereas 45 per cent will make new investments in hardware, IT services and business services. Data services have the lowest rank in all the categories. Overall, IDC forecasts that worldwide revenues for cognitive and AI systems will reach US$ 12.5 billion in 2017, an increase of 59.3 per cent over 2016.


In what ways can AI become the icing on the cake for enterprises moving towards digital transformation (DX)?

The adoption status of cognitive/AI correlates highly with the information DX

Interview maturity of the organisations. More organisations that adopted AI solutions have moved into the later stages of information DX maturity, which is then managed and optimised by AI. To promote digital transformation that utilises IoT and cognitive systems, it is important for user companies to cultivate an ‘agile’ mindset. For instance, it is necessary to determine the ROI while using IoT and cognitive systems in actual situations.


How do you see the growth of machine learning and deep learning in the growing space of AI? IDC predicts that by 2019, all effective IoT efforts will merge streaming analytics with machine learning trained on data lakes and content stores, accelerated by discrete or integrated processors. An increase in the use of machine learning will lower reliance on the programmatic model of development.


There is a belief that AI will one day become a major reason for unemployment in the IT world. What is your take on this?

AI must be viewed in a holistic manner. Having said as much, AI and cognitive developments are expected to make significant inroads into hitherto unchartered and mostly manual/human domains. IDC predicts that by 2022, nearly 40 per cent of operational processes will be self-healing and selflearning— minimising the need for human intervention or adjustments. Additionally, as IDC recently forecasted, as much as 5 per cent of business revenues will come through interaction with a customer's digital assistant by 2019. All this merely proves that AI will increasingly complement businesses in driving new and more authentic experiences while also driving business value.

For U & Me


What are the obstacles slowing down the growth of AI nowadays?

The primary barriers to AI solutions include a shortage of skill sets, an understanding of vendor solutions, governance and regulatory implications.

IDC predicts that by 2022, nearly 40 per cent of operational processes will be self-healing and selflearning — minimising the need for human intervention or adjustments.


Do you think companies like Apple, Facebook, Google and Microsoft will take the current AI model to the next level in the future? Amazon, Google, IBM and Microsoft certainly have the largest mindshare in the Asia-Pacific region. But domestic platforms, such as Alibaba PAI and Baidu PaddlePaddle are even better known in local markets. Also, the IBM Watson platform has a larger mindshare compared with that of other bigger organisations.


Why should enterprises focus on enabling AI advances to move towards a profitable future?

Increased employee productivity and greater process automation are the most common expectations among organisations adopting or planning to adopt AI solutions. AI is presumed to bring significant business value to half of the organisations in the APAC region within two years. Customer service and support are the business processes that receive the most immediate benefits, whereas supply chain and physical assets management see the least urgency. | OPeN SOUrce FOr YOU | December 2017 | 21

Open Source India 2017

osi 2017:

The Show Continues to Grow


he beautiful city of Bengaluru, threatened by thick dark clouds, kept convention delegates anxious about reaching the venue on time, since they also knew they would be braving the city’s legendary traffic. Thankfully, the weather gods heard the OSI team’s prayers and the clouds refrained from drenching the visitors, allowing many from the open source industry as well as enthusiasts from the community to reach the NIMHANS Convention Center well before the 8:30 a.m. registration time. While there were a lot of familiar faces — participants who have been loyally attending the event over the past years, it was also heartwarming to welcome new enthusiasts, who’d come on account of the positive word-of-mouth publicity the event has been building up over the years. In terms of numbers, the 14th edition of the event, which happened on October 13 and 14, 2017, witnessed 2,367 unique visitors over the two days, breaking all previous attendance records. The event boasted of 70+ industry and community experts coming together to speak in the nine conference tracks and 14 hands-on workshops. The visitors, as usual, comprised a cross-section of people in terms of their expertise and experience. Since the tracks for the

22 | december 2017 | OPeN SOUrce FOr YOU |

KEY FACTS Show dates: October 13-14, 2017 Location: NIMHANS Convention Centre, Bengaluru, Karnataka, India Number of exhibitors: 27 Brands represented: 33 Unique visitors: 2,367 Number of conferences: 09 Number of workshops: 14 Number of speakers: 70+

Open Source India 2017 conferences are always planned with this diversity in mind, there were tracks for the creators (developers, project managers, R&D teams, etc) as well as for the implementers of open source software. The star-studded speakers’ list included international experts like Tony Wasserman (professor of the software management practice at Carnegie Mellon University, Silicon Valley), Joerg Simon (ISECOM and Fedora Project), and Soh Hiong, senior consultant, NetApp. There was also active participation from the government of India with Debabrata Nayak (project director, NeGD, MeitY) and K. Rajasekhar (deputy director general, NIC, MeitY), delivering speeches at the event. Industry experts like Andrew Aitken (GM and global open source practice leader, Wipro Technologies), Sandeep Alur (director, technical engagements (partners), Microsoft Corporation India), Sanjay Manwani (MySQL India director, Oracle), Rajdeep Dua (director, developer relations at Salesforce), Gagan Mehra (director, information strategy, MongoDB), Rajesh Jeyapaul (architect, mentor and advocate, IBM), Valluri Kumar (chief architect, Huawei India), and Ramakrishna Rama (director software, Dell India R&D) were amongst the 70+ experts who spoke at the event. The many intriguing topics covered compelled visitors to stay glued to their seats till late in the evening on both days. A few topics that solicited special interest from the audience included a panel discussion on ‘Open Source vs Enterprise Open Source. Is this a Key Reason for the Success of Open Source?’, ‘Accelerating the Path to Digital with Cloud Data Strategy’, ‘Open Source - A Blessing or a Curse?’, ‘Intelligent Cloud and AI – The Next Big Leap’ and ‘IoT Considerations Decisive Open Source Policy’. “We owe our success to the active participation of the community and the industry. It’s exciting to see how this event, which had just a handful of exhibitors in its initial stages, has grown

to the stage at which, today, the who’s who of open source are demonstrating their solutions to the audience. I hope that we continue to make this event even more exciting, and ensure it becomes one of the biggest open source events across the globe,” said Rahul Chopra, editorial director, EFY Group. Delegates looking to buy workshop passes on the spot were disappointed, as all the workshops were sold out earlier, online, even before the commencement of the event. Workshops like ‘Self-Service Automation in OpenStack Cloud’, ‘Building Machine Learning Pipelines with PredictionIO’, ‘Analyzing Packets using Wireshark’, ‘OpenShift DevOps Solution’, ‘Make your First Open Source IoT Product’ and ‘Tools and Techniques to Dive into the Mathematics of Machine Learning’ drew a lot of interest from the techie audience. “We would like to thank our sponsors, Microsoft, IBM, Oracle, Salesforce, 2nd Quadrant, Wipro

Technologies, Zoho Corporation, Digital Ocean, SUSE, Siemens, Huawei and others for their valuable participation and support for the event. We look forward to having an even bigger showcase of open source technologies with the support of our existing partners as well as many new stalwarts from the tech industry,” said Atul Goel, vice president of events at EFY. Adding to this, Rahul Chopra said, “With the overwhelming response of the tech audience and the demand for more sessions, we have decided to expand the event, starting from the 2018 edition. The event will now become a three-day affair instead of a two-day one. This means more tracks, more speakers, more workshops and more knowledgesharing on open source. We have already announced the dates for the next edition. It’s going to happen right here, at the same venue, on October 11, 12 and 13, 2018.”

Connectivity Partner | OPeN SOUrce FOr YOU | december 2017 | 23

Open Source India 2017

Key tracks @ osi 2017 Open Source and You (Success Stories) This was a full-day track with multiple sessions, during which enterprise end users (CXOs, IT heads, etc) shared their success stories. This track was led by speakers like K. Rajasekhar (deputy director general, NIC, MeitY), Ravi Trivedi (founder,, Vikram Mehta (associate director, information security, MakeMyTrip), Dr Michael Meskes (CEO, credativ international GmbH) and Prasanna Lohar (head, technology, DCB Bank Ltd).

Application Development Day This was a half-day track with multiple sessions on the role of open source in hybrid application development. Speakers included Vivek Sridhar (developer advocate, Digital Ocean), Rajdeep Dua (director – developer relations, Salesforce India) and Bindukaladeepan Chinasamy (senior technical evangelist, Microsoft), amongst others.

hot in OpenStack. This track was done in collaboration with the Open Stack community and attracted a lot of cloud architects, IT managers, CXOs, etc. Speakers for this track included famous OpenStack enthusiasts like Anil Bidari, Chandan Kumar, M. Ranga Swami and Janki Chhatbar, amongst others.

Cyber Security Day Open source plays an important role with respect to security. So it was security experts, IT heads and project leaders working on missioncritical projects who attended this track that had multiple sessions on understanding cyber security and open source. Speakers for this track included Biju George (co-founder, Instasafe), Sandeep Athiyarath (founder, FCOOS), and Joerg Simon (ISECOM and Fedora Project), amongst others.

Open Source in IoT This half-day track with multiple sessions was on the role of open source in the Internet of Things (IoT). This track was one of the most sought after and saw thought leaders like Adnyesh Dalpati (director – solutions architect and presales, Alef Mobitech), Rajesh Sola (education specialist, KPIT Technologies), Aahit Gaba (counsel open source, HP Enterprise) and Ramakrishna Rama (director - software, Dell India R&D) sharing their knowledge on the subject.

The Cloud and Big Data

This was a half-day track with multiple sessions that helped IT managers/ heads in understanding the role of open source in different aspects of the cloud and in Big Data. This track had speakers like Sudhir Rawat (senior technical evangelist, Microsoft), Rajkumar Natarajan (CIO, Prodevans Technologies), Rajesh Jeyapaul (architect, mentor and advocate, A panel discus sion on ‘Open Source vs Ente IBM), Suman Debnath (project Is this a key re rprise Open Sou ason for the su rce: leader, Toshiba) and Sangeetha Priya ccess of Open Source?’ (director, Axcensa Technologies), amongst others.

Database Day Open source databases have always been of great importance. The event had speakers from leading database companies including Sujatha Sivakumar and Madhusudhan Joshi (Oracle India), Soh Hiong (senior consultant, NetApp), Pavan Deolasee (PostgreSQL consultant, 2nd Quadrant India), Gagan Mehra (director, Information Strategy, MongoDB) and Karthik P. R. (CEO and DB architect, Mydbops), amongst others.

OpenStack India This was a half-day track with multiple sessions on what’s

Container Day

This half-day track, conducted in collaboration with the Docker community, had leaders from the Docker community as well as experts from the industry share their thoughts. The speakers included Neependra Khare (founder, CloudYuga Technologies), Uma Mukkara (cofounder and COO, OpenEBS), Ananda Murthy (data centre solutions architect, Microfocus India) and Sudharshan Govindan (senior developer, IBM India), amongst others.

Open Source and You This half-day track was about the latest in open source and how its future is shaping up. This track had speakers like Krishna M. Kumar and Sanil Kumar D. (chief architects in cloud R&D, Huawei India), Vishal Singh (VP – IT infra and solutions, Eli Research India), Biju K. Nair (executive director, and Kumar Priyansh (developer, BackSlash Linux).

24 | december 2017 | OPeN SOUrce FOr YOU |

Open Source India 2017

Key workshops @ osi 2017 Self-Service Automation in OpenStack Cloud (by Soh Hiong, senior consultant, NetApp) To build an OpenStack-powered cloud infrastructure, there is only one choice for block storage. SolidFire delivers a very comprehensive OpenStack block storage integration. This workshop helped participants to learn how SolidFire’s integration with OpenStack Cinder seamlessly enables self-service automation of storage and guarantees QoS for each and every application.

Building Machine Learning Pipelines with PredictionIO (by Manpreet Ghotra and Rajdeep Dua, Salesforce India)

packets. Attendees also learned how to use Wireshark for troubleshooting network problems.

OpenStack Cloud Solution (by Manoj and Asutosh, Open Stack consultants for Prodevans) The workshop helped attendees to master their ability to work on Red Hat Enterprise Linux (RHEL) OpenStack platform services with Red Hat Ceph Storage, and implement advanced networking features using the OpenStack Neutron service.

OpenShift DevOps Solution (by Chanchal Bose, CTO, Prodevans)

Apache PredictionIO is an open source machine learning server built on top of a state-of-art open source stack for developers and data scientists to create predictive engines for any machine learning task. This workshop helped attendees understand how to build ML pipelines using PredictionIO.

This was a hands-on workshop to get familiar with containers. It covered OpenShift’s advantages and features as well as the DevOps CI/CD scenario on OpenShift.

Software Architecture: Principles, Patterns and Practices (by Ganesh Samarthyam, co-founder, CodeOps Technologies)

This workshop was for those with basic Linux knowledge and an interest in Linux systems administration. It covered an introduction to Ansible, and then highlighted its advantages and features. The key takeaway was on how to automate infrastructure using Ansible.

Developers and designers aspiring to become architects and hence wanting to learn about the architecture of open source applications, using case studies and examples, participated in this workshop. It introduced key topics on software architecture including architectural principles, constraints, non-functional requirements (NFRs), architectural styles and design patterns, viewpoints and perspectives, and architecture tools.

Ansible Automation (by Manoj, Sohail and Varsha, Ansible consultants for Prodevans)

Make your First Open Source IoT Product (by Arzan Amaria, senior solutions architect for the cloud and IoT, CloudThat)

This workshop involved a hands-on session on building a small prototype of an ideal product using open source technologies. By following step-bystep instructions and A workshop in progress at OS practical assistance, Serverless Platforms: What to Expect and What I 2017 to Ask For (by Monojit Basu, founder and director, TechYugadi participants were able to IT Solutions and Consulting) build and make a connected device, a process that inspired them The advent of serverless platforms evokes a feeling of déjà to innovate new products. The instructor shared the right logic vu. This workshop narrowed down resource utilisation to and resources required for anyone to jumpstart the journey of the granularity of a function call while remaining in control IoT development. of execution of the code! Serverless platforms offer various Hands-on experience of Kubernetes and Docker in action (by capabilities to address the questions that users of these platforms Sanil Kumar, chief architect, Cloud Team, Huawei India) need to be aware of. The goal of this workshop was to help This workshop provided exposure to and visualisation of the attendees identify the right use cases for going serverless. cloud from the PaaS perspective. It also introduced Kubernetes and containers (Docker). It was a hands-on session, designed Analysing packets using Wireshark (by Sumita Narshetty, security researcher at QOS Technology) to help participants understand how to start setting up and The workshop helped attendees understand packets capture and using Kubernetes and containers, apart from getting to learn analyse packets using Wireshark. It covered different aspects application deployment in a cloud environment, inspecting of packet capture and the tools needed to analyse captured containers, pods, applications, etc. | OPeN SOUrce FOr YOU | december 2017 | 25

Open Source India 2017 Obstacle Avoidance Robot with Open Hardware (by Shamik Chakraborty, Amrita School of Engineering) This workshop explored the significance of robotics in Digital India and looked at how Make in India can be galvanised by robotics. It also propagated better STEM education for a better global job market for skilled professionals, and created a space for participants to envision the tech future.

Microservices Architecture with Open Source Framework (by Dibya Prakash, founder, ECD Zone) This workshop was designed for developers, architects and engineering managers. The objective was to discuss the highlevel implementation of the microservices architecture using Spring Boot and the JavaScript (Node.js) stack.

Hacking, Security and Hardening Overview for Developers – on the Linux OS and Systems Applications (by Kaiwan Billimoria, Linux consultant and trainer, kaiwanTECH) The phenomenal developments in technology, and especially software-driven products (in domains like networking, telecom, embedded-automotive, infotainment, and now IoT, ML and AI), beg for better security on end products. Hackers

Asheem Bakhtawar regional director, India, Middle East and Africa, 2ndQuadrant India Pvt Ltd

Divyanshu Verma senior engineering manager, Intel R&D

are currently enjoying a field day and are only getting better at it, while product developers lag behind. This workshop was geared towards helping participants understand where software vulnerabilities exist, while programming and after; OS hardening techniques; what tools and methodologies help prevent and mitigate security issues, etc.

Tools and Techniques to Dive Into the Mathematics of Machine Learning (by Monojit Basu, founder and director, TechYugadi IT Solutions and Consulting) In order to build an accurate model for a machine learning problem, one needs better insights into the mathematics behind these models. For those primarily focused on the programming aspects of machine learning initiatives, this workshop gave the opportunity to regain a bit of mathematical context into some of the models and algorithms frequently used, and to learn about a few open source tools that will come in handy when performing deeper mathematical analysis of machine learning algorithms.

By: Omar Farooq The author is product head at Open Source India.

Balaji Kesavaraj head marketing, India and SAARC, Autodesk

26 | december 2017 | OPeN SOUrce FOr YOU |

Janardan Revuru Open Source Evangelist

Dibya Prakash founder, ECDZone

Dhiraj Khare national alliance manager, Liferay India



Reduce Security Risks with SELinux

Discover SELinux, a security module that provides extra protocols to ensure access control security. It supports mandatory access controls (MAC) and is an integral part of RHEL’s security policy.


ecurity-Enhanced Linux or SELinux is an advanced access control built into most modern Linux distributions. It was initially developed by the US National Security Agency to protect computer systems from malicious tampering. Over time, SELinux was placed in the public domain and various distributions have incorporated it in their code. To many systems administrators, SELinux is uncharted territory. It can seem quite daunting and at times, even confusing. However, when properly configured, SELinux can greatly reduce a system’s security risks and knowing a bit about it can help you to troubleshoot access related error messages.

Basic SELinux security concepts

Security-Enhanced Linux is an additional layer of system security. The primary goal of SELinux is to protect the users’ data from system services that have been compromised. Most Linux administrators are familiar with the standard user/group/other permissions

security model. This is a user and group based model known as discretionary access control. SELinux provides an additional layer of security that is object based and controlled by more sophisticated rules, known as mandatory access control. To allow remote anonymous access to a Web server, firewall ports must be opened. However, this gives malicious users an opportunity to crack the system through a security exploit, if they compromise the Web server process and gain its permissions — the permissions of Apache user and Apache group, which user/group has read write access to things like document root (/var/www/html), as well as the write access to /var, /tmp and any other directories that are world writable. Under discretionary access control, every process can access any object. But when SELinux enables mandatory access control, then a particular context is given to an object. Every file, process, directory and port has a special security label, called a SELinux context. A context is a name that is used by the SELinux policy to determine


whether a process can access a file, directory or port. By default, the policy does not allow any interaction unless an explicit rule grants access. If there is no rule, no access is allowed. SELinux labels have several contexts—user, role, type and sensitivity. The targeted policy, which is the default policy in Red Hat Enterprise Linux, bases its rules on the third context—the type context. The type context normally ends with _t. The type context for the Web server is httpd_t . The type context for files and directories normally found in /var/www/html is httpd_sys_content_t, and for files and directories normally found in /tmp and /var/tmp is tmp_t. The type context for Web server ports is httpd_port_t. There is a policy rule that permits Apache to access files and directories with a context normally found in /var/www/html and other Web server directories. There is no ‘allow’ rule for files found in /var/tmp directory, so access is not permitted. With SELinux, a malicious user cannot access the /tmp directory. SELinux has a rule for remote file systems such as NFS and CIFS, although all files on such file systems are labelled with the same context.

SELinux modes

For troubleshooting purposes, SELinux protection can be temporarily disabled using SELinux modes. SELinux works in three modes-enforcing mode, permissive mode and disabled mode. Enforcing mode: In the enforcing mode, SELinux actively denies access to Web servers attempting to read files with the tmp_t type context. In this mode, SELinux both logs the interactions and protects files. Permissive mode: This mode is often used to troubleshoot issues. In permissive mode, SELinux allows all interactions, even if there is no explicit rule, and it logs the interactions that it would have denied in the enforcing mode. This mode can be used to

Insight Admin temporarily allow access to content that SELinux is restricting. No reboot is required to go from enforcing mode to permissive mode. Disabled mode: This mode completely disables SELinux. A system reboot is required to disable SELinux entirely, or to go from disabled mode to enforcing or permissive mode. Figure 1: Checking the status of SELinux

SELinux status

To check the present status of SELinux, run the sestatus command on a terminal. It will tell you the mode of SELinux.

Defining SELinux default file context rules

# sestatus

Changing the current SELinux mode

Figure 2: Changing the SELinux mode to enforcing mode

Run the command setenforce with either 0 or 1 as the argument. A value of 1 specifies enforcing mode; 0 would specify permissive mode. # setenforce

Setting the default SELinux mode

Figure 3: Default configuration file of SELinux

Initial SELinux context

Figure 4: Checking the context of files

The configuration file that determines what the SELinux mode is at boot time is /etc/selinux/config. Note that it contains some useful comments. Use /etc/selinux/config to change the default SELinux mode at boot time. In the example shown in Figure 3, it is set to enforcing mode.

Typically, the SELinux context of a file’s parent directory determines the initial SELinux context. The context of the parent directory is assigned to newly created files. This works for commands like vim, cp and touch. However, if a file is created elsewhere and the permissions are preserved (as with mv cp -a), the original SELinux context will be unchanged.

Changing the SELinux context of a file

There are two commands that are used to change the SELinux context of a

to the command. Often, the -t option is used to specify only the type component of the context. The restorecon command is the preferred method for changing the SELinux context of a file or directory. Unlike chcon, the context is not explicitly specified when using this command. It uses rules in the SELinux policy to determine what the context of a file should be.

The semanage fcontext command can be used to display or modify the rules that the restorecon command uses to set the default file context. It uses extended regular expressions to specify the path and filenames. The most common extended regular expression used in fcontext rules is (/.*)? which means “optionally match a / followed by any number of characters.” It matches the directory listed before the expression and everything in that directory recursively. The restore command is part of the policycoreutil package and semanage is part of the policycoreutil-Python package. As shown in Figure 6, the permission is preserved by using the mv command while the cp command will not preserve the permission, which will be the same as that of the parent directory. To restore the permission, run restorecon which will give the parent directory permission to access the files. Figure 7 shows how to use semanage to add a context for a new directory. First, change the context of the parent directory using the semanage command, and then use the restorecon command to restore the parent permission to all files contained in it.

SELinux Booleans Figure 5: Restoring context of the file with the parent directory

file—chcon and restorecon. The chcon command changes the context of a file to the context specified as an argument

SELinux Booleans are switches that change the behaviour of the SELinux policy. These are rules that can be enabled or disabled, and can be used by security administrators to tune the policy to make selective adjustments. | OPEN SOURCE FOR YOU | DECEMBER 2017 | 29



The getsebool command is used to display SELinux Booleans and their current values. The -a option helps this command to list all the Booleans. The getsebool command is used to display SELinux Booleans and setsebool is used to modify these. setsebool -P modifies the SELinux policy to make the modifications persistent. semanage boolean Figure 6: Preserving the context of the file -1 will show whether or not a Boolean is persistent, along with a short description of it. To list only local modifications to the state of the SELinux Booleans (any setting that differs from the default in the policy), the -C option with semanage Boolean can be used. Figure 7: Changing the context of a file In Figure 8, the Boolean was first modified, and then this modification was made persistent; the -C option was used with semanage to list the modifications.

Troubleshooting in SELinux

content isn’t published by the users. If access has been granted, then additional steps need to be taken to solve the problem. ƒ The most common SELinux issue is an incorrect file context. This can occur when a file is created in a location with one file context, and moved into a place where a different context is expected. In most cases, running restorecon will correct the issue. Correcting issues in this way has a very narrow impact on the security of the rest of the system. ƒ Another remedy could be adjustment of the Boolean. For example, the ftpd_anon_write Boolean controls whether anonymous FTP users can upload files. This Boolean may be turned on if you want to allow anonymous FTP users to upload files to a server. ƒ It is possible that the SELinux policy has a bug that prevents a legitimate access. However, since SELinux has matured, this is a rare occurrence.

Sometimes, SELinux prevents access to files on the server. Here are the steps that should be Figure 8: Changing the SELinux Boolean followed when this occurs. ƒ Before thinking of making any compromise of the service if Web adjustments, consider that SELinux may be doing its job correctly by By: Kshitij Upadhyay prohibiting the attempted access. The author is RHCSA and RHCE certified and loves to write about new technologies. If a Web server tries to access the He can be reached at files in /home, this could signal a

OSFY Magazine Attractions During 2017-18 MONTH


March 2017

Open Source Firewall, Network security and Monitoring

April 2017

Databases management and Optimisation

May 2017

Open Source Programming (Languages and tools)

June 2017

Open Source and IoT

July 2017

Mobile App Development and Optimisation

August 2017

Docker and Containers

September 2017

Web and desktop app Development

October 2017

Artificial Intelligence, Deep learning and Machine Learning

November 2017

Open Source on Windows

December 2017

BigData, Hadoop, PaaS, SaaS, Iaas and Cloud

January 2018

Data Security, Storage and Backup

February 2018

Best in the world of Open Source (Tools and Services)


How To Admin

Unit Testing Ansible Code with Molecule and Docker Containers

Molecule is an open source framework that is easy to use for validating and auditing code. It can be easily introduced into the CI-CD pipeline, thus keeping Ansible scripts relevant.


evOps teams rely on ‘Infrastructure as a Code’ (IaaC) for productivity gains. Automation and speed are prerequisites in the cloud environment, where resources are identified by cattle nomenclature rather than pet nomenclature due to the sheer volumes. Ansible is one of the leading technologies in the IaaC space. Its declarative style, ease of parameterisation and the availability of numerous modules make it the preferred framework to work with. Any code, if not tested regularly, gets outdated and becomes irrelevant over time, and the same applies to Ansible. Daily testing is a best practice and must be introduced for Ansible scripts too—for example, keeping track of the latest version of a particular software during provisioning. Similarly, the dependency management repository used by apt and Yum

may have broken dependencies due to changes. Scripts may also fail if a dependent URL is not available. Those working with Ansible would have faced these challenges. This is where we introduce unit testing, which can run during the nightly build and can detect these failures well in advance. The Molecule project is a useful framework to introduce unit testing into Ansible code. One can use containers to test the individual role or use an array of containers to test complex deployments. Docker containers are useful, as they can save engineers from spawning multiple instances or using resource-hogging VMs in the cloud or on test machines. Docker is a lightweight technology which is used to verify the end state of the system, and after the test, the provisioned resources are destroyed thus cleaning up the environment. | OPEN SOURCE FOR YOU | DECEMBER 2017 | 31

Loonycorn is hiring


Mail Resume + Cover Letter to You:    

Really into tech - cloud, ML, anything and everything Interested in video as a medium Willing to work from Bangalore in the 0-3 years of experience range

Us:  ex-Google | Stanford | INSEAD  100,000+ students  Video content on Pluralsight, Stack, Udemy...

Our Content:


 The Ultimate Computer Science Bundle 9 courses | 139 hours  The Complete Machine Learning Bundle 10 courses | 63 hours  The Complete Computer Science Bundle 8 courses | 78 hours  The Big Data Bundle 9 courses | 64 hours  The Complete Web Programming Bundle 8 courses | 61 hours  The Complete Finance & Economics Bundle 9 courses | 56 hours  The Scientific Essentials Bundle 7 courses | 41 hours  ~35 courses on Pluralsight ~80 on StackSocial ~75 on Udemy


How To

Installing and working with Molecule is simple. Follow the steps shown below. First, get the OS updated as follows:

Table 1: A comparison of folders created by Ansible and Molecule

sudo apt-get update && sudo apt-get -y upgrade

Next, install Docker: sudo apt install

Now install Molecule with the help of Python Pip: sudo apt-get install python-pip python-dev build-essential sudo pip install --upgrade pip sudo pip install molecule

Folders created by Ansiblegalaxy init

Folders created by Molecule init








No changes are needed and folder structure is identical









After the install, do a version check of Molecule. If the molecule version is not the latest, upgrade it as follows: molecule sudo pip install --upgrade molecule

It is always good to work with the latest version of Molecule as there are significant changes compared to an earlier version. Enabling or disabling modules in the latest version of Molecule is more effective. For example, a common problem faced is the forced audit errors that make Molecule fail. When starting to test with Molecule, at times audit errors can pose a roadblock. Disabling the Lint module during the initial phase can give you some speed to concentrate on writing tests rather than trying to fix the audit errors. Here are a few features of Molecule explained, though the full toolset offers more. 1. Create: This creates a virtualised provider, which in our case is the Docker container. 2. Converge: Uses the provisioner and runs the Ansible scripts to the target running Docker containers. 3. Idempotency: Uses the provisioner to check the idempotency of Ansible scripts. 4. Lint: This does code audits of Ansible scripts, test code, test scripts, etc. 5. Verify: Runs the test scripts written. 6. Test: Runs the full steps needed for Molecule, i.e., create, converge, Lint, verify and destroy. The roles need to be initialised by the following command:

All Molecule related scripts and test scripts are placed in this folder

The Molecule folder has files in which one can put in a pre-created Molecule playbook and test scripts. The environment that is created is named default. This can be changed as per the project’s requirements. One file that will be of interest is molecule.yml, which is placed at: ./molecule/default/molecule.yml

Another file which describes the playbook for the role is playbook.yml placed at: ./molecule/default/playbook.yml

Note: Molecule initialises the environment called default but engineers can use a name as per the environment used in the project.

molecule init role –role-name abc

This will create all the folders needed to create the role, and is similar to the following command: ansible-galaxy init abc

There are certain differences in the directory structure, and some amount of manual folder creation may be required.

Folders needs to be created manually to take advantage of file and template module of ansible

Figure 1: molecule.xml sample file


How To Admin

Figure 2: playbook.yml sample file

Figure 4: Molecule test run

tests. The file contains assertions as shown in Figure 3. As can be seen, these assertions are declarative; hence it requires no effort to add them. The overall test is run with the Molecule test command as shown in Figure 4. It runs steps in an opinionated way—cleaning and provisioning the infrastructure, checking idempotency, testing assertions and at the end, cleaning resources when the tests are finished. Molecule can be extended to test complex scenarios and distributed clustered environments, too, in test machines. As seen from the steps, Molecule can spin up a Docker container, do all the testing required and destroy the container after cleaning the infrastructure. Figure 3: Sample assertions in test file

The sample Molecule file is shown in Figure 1. One needs to modify it as per the project’s requirement. Note: Docker images used for testing need to be systemd enabled and should have privileges. Take a look at the playbook. Make sure that you have made all the changes needed for all the parameters to be passed to the role. Now we are ready to write test scripts using TestInfra ( We need to go to the folder Molecule/default/tests. You can create a test assertion. Test assertion is one of the most important steps in a testing framework. Molecule uses TestInfra (http://testinfra. as its test validation framework. The tests are located in the Python file located at ./molecule/default/

References [1] [2] [3] [4] [5] [6]

For More Test and Equipment Stories By: Ranajit Jana

The author is a senior architect in service transformation (open source) at Wipro Technologies. He is interested in all the technologies related to microservices— containerisation, monitoring, DevOps, etc. You can contact him at


For More Testand andMeasurement EquipmentStories Stories Test Visit | OPEN SOURCE FOR YOU | DECEMBER 2017 | 35



How To

DevOps Series

Using Ansible to Deploy a Piwigo Photo Gallery Piwigo is Web based photo gallery software that is written in PHP. In this tenth article in our DevOps series, we will use Ansible to install and configure a Piwigo instance.


iwigo requires a MySQL database for its backend and has a number of extensions and plugins developed by the community. You can install it on any shared Web hosting service provider or install it on your own GNU/Linux server. It basically uses the (G) LAMP stack. In this article, we will use Ansible to install and configure a Piwigo instance, which is released under the GNU General Public License (GPL). You can add photos using the Piwigo Web interface or use an FTP client to synchronise the photos with the server. Each photo is made available in nine sizes, ranging from XXS to XXL. A number of responsive UI themes are available that make use of these different photo sizes, depending on whether you are viewing the gallery on a phone, tablet or computer. The software also allows you to add a watermark to your photos, and you can create nested albums. You can also tag your photos, and Piwigo stores metadata about the photos too. You can even use access control to make photos and albums private. My Piwigo gallery is available at


The Piwigo installation will be on an Ubuntu 15.04 image running as a guest OS using KVM/QEMU. The host system is a Parabola GNU/Linux-libre x86_64 system. Ansible is installed on the host system using the distribution package manager. The version of Ansible used is: $ ansible --version ansible config file = /etc/ansible/ansible.cfg configured module search path = [u’/home/shakthi/.ansible/ plugins/modules’, u’/usr/share/ansible/plugins/modules’] ansible python module location = /usr/lib/python2.7/sitepackages/ansible executable location = /usr/bin/ansible python version = 2.7.14 (default, Sep 20 2017, 01:25:59) [GCC 7.2.0]


The /etc/hosts file should have an entry for the guest

How To Admin “ubuntu” VM as indicated below: ubuntu

You should be able to issue commands from Ansible to the guest OS. For example: $ ansible ubuntu -m ping

The Ansible playbook updates the software package repository by running apt-get update and then proceeds to install the Apache2 package. The playbook waits for the server to start and listen on port 80. An execution of the playbook is shown below: $ ansible-playbook -i inventory/kvm/inventory playbooks/ configuration/piwigo.yml --tags web -K SUDO password:

ubuntu | SUCCESS => { “changed”: false, “ping”: “pong” }

PLAY [Install Apache web server] ****************** TASK [setup] *********************************** ok: [ubuntu]

On the host system, we will create a project directory structure to store the Ansible playbooks:

TASK [Update the software package repository] ************* changed: [ubuntu]

ansible/inventory/kvm/ /playbooks/configuration/ /playbooks/admin/

TASK [Install Apache] *************************************** changed: [ubuntu] => (item=[u’apache2’])

An ‘inventory’ file is created inside the inventory/kvm folder that contains the following: ubuntu ansible_host= ansible_connection=ssh ansible_user=xetex ansible_password=pass


The Apache Web server needs to be installed first on the Ubuntu guest VM. The Ansible playbook for the same is as follows: - name: Install Apache web server hosts: ubuntu become: yes become_method: sudo gather_facts: true tags: [web]

TASK [wait_for] ********************************************* ok: [ubuntu] PLAY RECAP ************************************************** ubuntu : ok=4 changed=2 unreachable=0 failed=0

The verbosity in the Ansible output can be achieved by passing ‘v’ multiple times in the invocation. The more number of times that ‘v’ is present, the greater is the verbosity level. The -K option will prompt for the sudo password for the xetex user. If you now open, you should be able to see the default Apache2 index.html page as shown in Figure 1.

tasks: - name: Update the software package repository apt: update_cache: yes - name: Install Apache package: name: “{{ item }}” state: latest with_items: - apache2

Figure 1: Apache2 default index page - wait_for: port: 80


Piwigo requires a MySQL database server for its back-end, | OPEN SOURCE FOR YOU | DECEMBER 2017 | 39


How To

and at least version 5.0. As the second step, you can install the same using the following Ansible playbook:

+-------------------------------------------+ 1 row in set (0.00 sec)

- name: Install MySQL database server hosts: ubuntu become: yes become_method: sudo gather_facts: true tags: [database]

Also, the default MySQL root password is empty. You should change it after installation. The playbook can be invoked as follows:

tasks: - name: Update the software package repository apt: update_cache: yes - name: Install MySQL package: name: “{{ item }}” state: latest with_items: - mysql-server - mysql-client - python-mysqldb

$ ansible-playbook -i inventory/kvm/inventory playbooks/ configuration/piwigo.yml --tags database -K


Piwigo is written using PHP (PHP Hypertext Preprocessor), and it requires at least version 5.0 or later. The documentation website recommends version 5.2. The Ansible playbook to install PHP is given below: - name: Install PHP hosts: ubuntu become: yes become_method: sudo gather_facts: true tags: [php] tasks: - name: Update the software package repository apt: update_cache: yes

- name: Start the server service: name: mysql state: started - wait_for: port: 3306 - mysql_user: name: guest password: ‘*F7B659FE10CA9FAC576D358A16CC1BC646762FB2’ encrypted: yes priv: ‘*.*:ALL,GRANT’ state: present

- name: Install PHP package: name: “{{ item }}” state: latest with_items: - php5 - php5-mysql

Update the software package repository, and install PHP5 and the php5-mysql database connectivity package. The Ansible playbook for this can be invoked as follows:

The APT software repository is updated first and the required MySQL packages are then installed. The database server is started, and the Ansible playbook waits for the server to listen on port 3306. For this example, a guest database user account with osfy as the password is chosen for the gallery Web application. In production, please use a stronger password. The hash for the password can be computed from the MySQL client as indicated below:

The final step is to download, install and configure Piwigo. The playbook for this is given below:

mysql> SELECT PASSWORD(‘osfy’); +-------------------------------------------+ | PASSWORD(‘osfy’) | +-------------------------------------------+ | *F7B659FE10CA9FAC576D358A16CC1BC646762FB2 |

- name: Setup Piwigo hosts: ubuntu become: yes become_method: sudo gather_facts: true

$ ansible-playbook -i inventory/kvm/inventory playbooks/ configuration/piwigo.yml --tags php -K



How To Admin

Figure 4: Piwigo home page dest: “{{ piwigo_dest }}/gallery” remote_src: True

Figure 2: Piwigo install page

Figure 3: Piwigo install success page tags: [piwigo] vars: piwigo_dest: “/var/www/html” tasks: - name: Update the software package repository apt: update_cache: yes - name: Create a database for piwigo mysql_db: name: piwigo state: present - name: Create target directory file: path: “{{ piwigo_dest }}/gallery” state: directory - name: Download latest piwigo get_url: url: php?code=latest dest: “{{ piwigo_dest }}/” - name: Extract to /var/www/html/gallery unarchive: src: “{{ piwigo_dest }}/”

- name: Restart apache2 server service: name: apache2 state: restarted

The piwigo_dest variable stores the location of the default Apache hosting directory. The APT software package repository is then updated. Next, an exclusive MySQL database is created for this Piwigo installation. A target folder gallery is then created under /var/www/ html to store the Piwigo PHP files. Next, the latest version of Piwigo is downloaded (2.9.2, as on date) and extracted under the gallery folder. The Apache Web server is then restarted. You can invoke the above playbook as follows: $ ansible-playbook -i inventory/kvm/inventory playbooks/ configuration/piwigo.yml --tags piwigo -K

If you open the URL in a browser on the host system, you will see the screenshot given in Figure 2 to start the installation of Piwigo. After entering the database credentials and creating an admin user account, you should see the ‘success’ page, as shown in Figure 3. You can then go to to see the home page of Piwigo, as shown in Figure 4.


The Piwigo data is present in both the installation folder and in the MySQL database. It is thus important to periodically make backups, so that you can use these archive files to restore data, if required. The following Ansible playbook creates a target backup directory, makes a tarball of the installation folder, and dumps the database contents to a .sql file. The epoch timestamp is used in the filename. The backup folder can be rsynced to a different system or to secondary backup. - name: Backup Piwigo hosts: ubuntu become: yes | OPEN SOURCE FOR YOU | DECEMBER 2017 | 41


How To

become_method: sudo gather_facts: true tags: [backup]

file: path: “{{ piwigo_dest }}/gallery” state: absent

vars: piwigo_dest: “/var/www/html”

- name: Drop database mysql_db: name: piwigo state: absent

tasks: - name: Create target directory file: path: “{{ piwigo_dest }}/gallery/backup” state: directory

- name: Uninstall PHP packages package: name: “{{ item }}” state: absent with_items: - php5-mysql

- name: Backup folder archive: path: “{{ piwigo_dest }}/gallery/piwigo” dest: “{{ piwigo_dest }}/gallery/backup/piwigobackup-{{ ansible_date_time.epoch }}.tar.bz2”

- php5 - name: Stop the database server service: name: mysql state: stopped

- name: Dump database mysql_db: name: piwigo state: dump target: “{{ piwigo_dest }}/gallery/backup/piwigo-{{ ansible_date_time.epoch }}.sql”

- name: Uninstall MySQL packages package: name: “{{ item }}” state: absent with_items: - python-mysqldb - mysql-client - mysql-server

The above playbook can be invoked as follows: $ ansible-playbook -i inventory/kvm/inventory playbooks/ configuration/piwigo.yml --tags backup -K

Two backup files that were created from executing the above playbook are piwigo-1510053932.sql and piwigobackup-1510053932.tar.bz2.

- name: Stop the web server service: name: apache2 state: stopped

Cleaning up

You can uninstall the entire Piwigo installation using an Ansible playbook. This has to happen in the reverse order. You have to remove Piwigo first, followed by PHP, MySQL and Apache. A playbook to do this is included in the playbooks/admin folder and given below for reference: --- name: Uninstall Piwigo hosts: ubuntu become: yes become_method: sudo gather_facts: true tags: [uninstall]

- name: Uninstall apache2 package: name: “{{ item }}” state: absent with_items: - apache2

The above playbook can be invoked as follows: $ ansible-playbook -i inventory/kvm/inventory playbooks/admin/ uninstall-piwigo.yml -K

vars: piwigo_dest: “/var/www/html” tasks: - name: Delete piwigo folder 42 | DECEMBER 2017 | OPEN SOURCE FOR YOU |

You can visit for more documentation. By: Shakthi Kannan The author is a free software enthusiast and blogs at

How To Admin

Hive: The SQL-like Data Warehouse Tool for Big Data The management of Big Data is crucial if enterprises are to benefit from the huge volumes of data they generate each day. Hive is a tool built on top of Hadoop that can help to manage this data.


ive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarise Big Data, and makes querying and analysing easy. A little history about Apache Hive will help you understand why it came into existence. When Facebook started gathering data and ingesting it into Hadoop, the data was coming in at the rate of tens of GBs per day back in 2006. Then, in 2007, it grew to 1TB/day and within a few years increased to around 15TBs/day. Initially, Python scripts were written to ingest the data in Oracle databases, but with the increasing data rate and also the diversity in the sources/types of incoming data, this was becoming difficult. The Oracle

instances were getting filled pretty fast and it was time to develop a new kind of system that handled large amounts of data. It was Facebook that first built Hive, so that most people who had SQL skills could use the new system with minimal changes, compared to what was required with other RDBMs. The main features of Hive are: ƒ It stores schema in a database and processes data into HDFS. ƒ It is designed for OLAP. ƒ It provides an SQL-type language for querying, called HiveQL or HQL. ƒ It is familiar, fast, scalable and extensible. Hive architecture is shown in Figure 1. The components of Hive are listed in Table 1. | OPEN SOURCE FOR YOU | DECEMBER 2017 | 43


How To

Table 1


Unit name


User interface

Hive is data warehouse infrastructure software that can create interactions between the user and HDFS. The user interfaces that Hive supports are Hive Web UI, the Hive command line, and Hive HD Insight (in Windows Server).

Meta store

Hive chooses respective database servers to store the schema or metadata of tables, databases and columns in a table, along with their data types, and HDFS mapping.

HiveQL process engine

HiveQL is similar to SQL for querying on schema information on the meta store. It replaces the traditional approach of the MapReduce program. Instead of writing the MapReduce program in Java, we can write a query for a MapReduce job and process it.

Execution engine

The conjunction part of the HiveQL Process engine and MapReduce is the Hive Execution engine, which processes the query and generates results that are the same as MapReduce results.


Hadoop distributed file system or HBASE comprises the data storage techniques for storing data into the file system.


Hive QL Process Engine

Meta Store

Execution Engine Map Reduce

HDFS or HBASE Data Storage

Figure 1: Hive architecture

Figure 2: Hive configuration

The importance of Hive in Hadoop

Apache Hive lets you work with Hadoop in a very efficient manner. It is a complete data warehouse infrastructure that is built on top of the Hadoop framework. Hive is uniquely placed to query data, and perform powerful analysis and data summarisation while working with large volumes of data. An integral part of Hive is the HiveQL query, which is an SQL-like interface that is used extensively to query what is stored in databases. Hive has the distinct advantage of deploying high-speed data reads and writes within the data warehouses while managing large data sets that are distributed across multiple locations, all thanks to its SQL-like features. It provides a structure to the data that is already stored in the database. The users are able to connect with Hive using a command line tool and a JDBC driver.

How to implement Hive

First, download Hive from stable/. Next, download apache-hive-1.2.1-bin.tar. gz 26-Jun-2015 13:34 89M . Extract it manually and rename the folder as hive.


Figure 3: Getting started with Hive


HD Insight


How To

In the command prompt, type the following commands: sudo mv hive /usr/local/hive sudo gedit ~/.bashrc


Below this line, write the following code: # Set HIVE_HOME export HIVE_HOME=/usr/local/hive PATH=$PATH:$HIVE_HOME/bin export PATH user@ubuntu:~$ cd /usr/local/hive user@ubuntu:~$ sudo gedit

Go to the line where the following statements are written: # Allow alternate conf dir location.



A procedural data flow language

A declarative SQLish language

For programming

For creating reports

Mainly used by researchers and programmers

Mainly used by data analysts

Operates on the client side of a cluster

Operates on the server side of a cluster

Pig is SQL-like but varies to a great extent Pig supports the Avro file format

Now, start Hadoop. Type hive, and you will see the tables 2.

Pig vs Hive

Table 2 illustrates the differences between pig and Hive.

Table 2

Does not have a dedicated metadata database

export HADOOP_HOME=/usr/local/hadoop (write the path where the Hadoop file is)

Makes use of the exact variation of the dedicated SQL DDL language by defining tables beforehand Directly leverages SQL and is easy to learn for database experts Hive does not support this file format


References [1] [2] [3] [4] [5]

By: Prof. Prakash Patel and Prof. Dulari Bhatt Prof. Prakash Patel is an assistant professor in the IT department of the Gandhinagar Institute of Technology. You can contact him at Prof. Dulari Bhatt is also an assistant professor in the IT department of the Gandhinagar Institute of Technology. You can contact her at

Success Story For U & Me



Kumar Priyansh, the developer of BackSlash Linux

BackSlash Linux is one of the newest Linux distributions developed in India, and that too, by a 20-year-old. The operating system has a mix of Ubuntu and Debian platforms, and offers two different worlds under one roof, with KDE and GNOME integration.


t is not very hard to build a Linux distribution,” says Kumar Priyansh, the 20-yearold developer who has single-handedly created BackSlash Linux. As a child, Priyansh had always been curious about how operating systems worked. But instead of being merely curious and dreaming of developing an operating system, he started making OpenSUSE-based distributions in 2011 to step into the world of open source platforms. He had used the SUSE Studio to release three versions of his very own operating system that he called Blueberry. All that experience helped Priyansh in bringing out a professional Linux platform

that debuted as BackSlash Linux in November 2016. “To try and build a new Linux distro I decided to dedicate a lot of time for development, and started attending online tutorials,” says Priyansh. Going through many online tutorial sessions, the Madhya Pradesh resident observed that he needed to combine multiple parts of different tutorials to understand the basics. “I started connecting parts of different tutorials from the Web, making an authentic and working tutorial for myself that allowed me to build the necessary applications and compile the very first version of my own distribution,” he says.

It’s Ubuntu behind the scenes, but with many tweaks Priyansh picked Ubuntu as the Linux distribution to build his platform. But to give users an even more advanced experience, he deployed Hardware Enablement (HWE) support that allows the operating system to work with newer hardware, and provides an up-to-date delivery of the Linux kernel. The developer also added a proprietary repository channel that helps users achieve better compatibility with their hardware. “I ship the platform with a proprietary repository channel enabled, which Ubuntu does not offer by default. This is because I don’t want | OPEN SOURCE FOR YOU | DECEMBER 2017 | 47

For U & Me Success Story BackSlash users to hop on different websites for unsupported hardware,” says Priyansh, a Samrat Ashok Technological Institute alumnus. Alongside Ubuntu, the BackSlash maker opted for Debian package support. He initially wanted to integrate the Wayland protocol as well as enable a space for Qt and GTK apps. However, in the initial implementation, Priyansh found it a challenge to offer a unified experience across KDE and GNOME environments. “I found that GTK apps running on KDE didn’t look as attractive as on GNOME. Therefore, all I had to do was to provide a theme available for both KDE and GTK, and put it into the respective locations from where the apps acquire an identical look and feel,” Priyansh says.

Uniqueness all around

Although the original aim of BackSlash Linux wasn’t to compete with any other Linux distros, it has a list of features that distinguish it from others. “I start building at the stage that the other distros stop their work. I craft solutions around what users want and continue to improve things until everything seems pixel perfect,” affirms Priyansh. The current beta version of the platform includes a new login screen that displays aerial background video updates, fingerprint protection for logging in, access to the terminal and other apps, multi-touch gestures, the coverflow Alt+Tab switcher, Snap support, and an updated Plasma Shell. There are also new updates, including Wine 2.14 for running Windows packages, Redshift (a blue light filter), a new email client, an enhanced system optimiser with an advanced app uninstaller and Google Play Music Desktop Edition. Priyansh has chosen characters from the Disney movie ‘Frozen’ to name the versions of his operating system. The current beta version of BackSlash Linux is called Kristoff,

which is the name of a Sami iceman in the animated film, while its first stable release was launched as ‘Anna’, named after a princess in ‘Frozen’. Key features that have helped BackSlash Linux clock 75,000 downloads • • • • • • •

Resembles Apple’s MacOS Snap package support BackSlash Sidebar Fingerprint integration Redshift night light filter Microsoft fonts Backup utility onboard

Security features under the hood

BackSlash Linux is not targeted at enterprise users. Having said that, the operating system does have some security features to make the experience safer for those who want to begin with a Linux distribution. It receives security updates directly from Canonical to keep the environment secure. The preinstalled System Optimiser app also helps users optimise performance, toggle startup programs, and uninstall applications and packages. Additionally, community feedback that has recently started rolling out enables Priyansh to enhance the security of the platform. “The current


beta release is doing quite well and receiving much praise from the community,” the developer says.

Sources of revenue

Experts often believe that selling an open source solution is more difficult than trading a proprietary technology. For BackSlash Linux, Priyansh has opted for a model that involves receiving donations and sponsorships. “Our primary source of revenue will always be donations and sponsorships for the project,” Priyansh asserts.

Future plans

BackSlash Linux, available on AMD64 and Intel x64 platforms, certainly has the potential to grow bigger. Priyansh is planning to add his own technologies to the platform, going forward. He is set to develop his own Web browser, music player and some ‘awesome apps’ to take the ‘Made in India’ operating system to the global stage. “There are also plans to build a custom compiled Linux kernel in the future to deliver better support out-of-the-box,” the developer concludes. By: Jagmeet Singh The author was an assistant editor at EFY until recently.


Let's Try

A Brief Introduction to


As a server configuration management tool, Puppet offers an automated way to inspect, deliver and operate software regardless of its deployment. It provides control and enforces consistency, while allowing for any of the modifications dictated by business needs.


he burgeoning demand for scalability has driven technology into a new era with a focus on distributed and virtual resources, as opposed to the conventional hardware that drives most systems today. Virtualisation is a method for logically dividing the computing resources for a system between different applications. Tools offer either full virtualisation or para-virtualisation for a system, driving away from the ‘one server one application’ model that typically under-utilises resources, towards a model that is focused on more efficient use of the system. In hardware virtualisation, a virtual machine gets created and behaves like a real computer with an operating system. The terminology, ‘host’ and ‘guest’ machine, is utilised for the real and virtual system respectively. With this paradigm shift, software and service architectures have undergone a transition to virtual machines, laying the groundwork for distributed and cloud computing.

Enterprise infrastructure

The classical definition of enterprise infrastructure — the data centre, a crucial piece of the puzzle serving to bolster

the operations of the company — is evolving in a manner that would have been hard to fathom a decade ago. It was originally built as isolated chunks of machinery pieced together into a giant to provide storage and network support for day-to-day operations. Archetypal representations include a mass of tangled wires and rolled up cables connecting monster racks of servers churning data minute after minute. A few decades ago, this was the norm; today companies require much more flexibility, to scale up and down! With the advent of virtualisation, enterprise infrastructure has been cut down, eliminating unnecessary pieces of hardware, and managers have opted for a cloud storage network that they can scale on-the-fly. Today, the business of cloud service providers is booming because not only startups but corporations, too, are switching to a virtual internal infrastructure to avoid the hassles of preparing and maintaining their own set of servers, especially considering the man hours required for the task. Former US Chief Information Officer (CIO) Vivek Kundra’s paper on Federal Cloud Computing, states: “It allows users to control the


Let's Try Admin Custom Apps


Packaged Apps CRM

App 1



Pay roll

Baremetal App

Figure 3: Puppet: The server configuration management tool (Source:

Figure 1: Overview of virtualisation (Source: Today

1990s App






App Virtualization




Storage Controller

Storage Controller

Storage Controller

Storage Controller

Storage Controller


Storage Controller


Figure 2: Enterprise infrastructure through the ages (Source:

computing services they access, while sharing the investment in the underlying IT resources among consumers. When the computing resources are provided by another organisation over a wide area network, cloud computing is similar to an electric power utility. The providers benefit from economies of scale, which in turn enables them to lower individual usage costs and centralise infrastructure costs.� However, setting up new virtual machines at scale presents a new challenge when the size of the organisation increases and the use cases balloon. It becomes nearly impossible to manage the customisation and configuration of each individual instance, considering that an average time of a few minutes needs to be spent per deployment. This raises questions about the switch to the cloud itself, considering the overheads are now similar in terms of the time spent in the case of physical infrastructure. This is where Puppet enters the picture. It is a server configuration management tool that automates the deployment and modification of virtual machines and servers with a single configuration script that can serve to deploy thousands of virtual machines simultaneously.

Features of Puppet

A server configuration management tool, Puppet gives you an automated way to inspect, deliver and operate software regardless of its deployment. It provides control and enforces consistency while allowing for any modifications as dictated by business needs. It uses a Ruby-like easyto-read language in preparing the deployment of the infrastructure. Working with both the cloud and the data centre, it is platform-independent, and can enforce and propagate all the necessary changes for the infrastructure.

Figure 4: Automation plays a vital role in DevOps (Source:

Figure 5: Puppet offered on the AWS Marketplace (Source:

It also allows for the monitoring of each stage to ensure visibility and compliance. Puppet supports DevOps by providing automation and enabling faster releases without sacrificing on security or stability. It allows for security policies to be set and monitored for regulatory compliance so that the risk of misconfigurations and failed audits is minimised. By treating the infrastructure as code, Puppet ensures the deployments are faster and continuous shipping of code results in lower risk of failure. It streamlines heterogenous technologies and unifies them under a singular configuration management interface. Puppet supports containers and encourages analysis of what the container is made of, at scale. Tools are provided to monitor any discrepancies in functionality in order to gain deeper insight into the deployments of a product. | OPEN SOURCE FOR YOU | DECEMBER 2017 | 51


Let's Try Puppet has found widespread application across large infrastructure networks in companies like Google, Amazon and Walmart. In fact, it is an integrated solution provided with Amazon Web Services and the Google Cloud Engine as well. DevOps has been proven to yield remarkable results and it is time organisations focus on reducing latency and frivolity within cycles to increase efficiency in rollouts.

Case studies of Puppet

Studies include the case of Staples, which faced a challenge in terms of creating a private cloud with automated provisioning and speeding up development cycles. With the introduction of Puppet into the picture, the developers had the freedom to provision their own systems and create their own configurations. This resulted in increased stability and, needless to say, faster deployments. SalesForce and Hewlett Packard have also attested as to how integrating Puppet into their workflow enabled code delivery timelines to reduce to hours from weeks and allowed support for more efficient DevOps practices, including automation. Getty Images, another popular service provider, that originally used the open source version of Puppet decided to try out the enterprise version on a smaller scale. As it switched to an agile model, test-driven development and automation were key to its development cycles and the use of Puppet expanded thereon. Puppet offers a promising solution for configuration management, as well as an array of assorted tools to bolster its core product offering. It is a must-try for organisations facing code shipping issues, time constraints and deployment woes.

Figure 6: DevOps is gaining visibility (Source:

Figure 7: Companies that use Puppet

Where is Puppet used?

There is a generalised subset of applications where Puppet fits in, including automation, test-driven development and configuration management. The security and reliability of the product, combined with its ease of use, allows for quick adoption and integration into the software development cycle of a product. With less time spent on superfluous tasks, more focus can be afforded to core practices and product development, allowing for better returns for the company.


By: Swapneel Mehta The author has worked with Microsoft Research, CERN and startups in AI and cyber security. An open source enthusiast, he enjoys spending his time organising software development workshops for school and college students. You can connect with him at and find out more at

The latest from the Open Source world is here. Join the community at Follow us on Twitter @OpenSourceForU


Let's Try Admin

Use ownCloud to Keep Your Data Safe and Secure ownCloud is an open source, self-hosted file ‘sync and share’ app platform, which allows users to access, synchronise and share data anytime and anywhere, without the worries associated with the public cloud.


he cloud is currently very popular across businesses. It has been stable for some time and many industries are moving towards cloud technologies. However, a major challenge in the cloud environment is privacy and the security of data. Organisations or individuals host their professional or personal data on the cloud, and there are many providers who promise a 99 per cent guarantee for the safety of user data. Yet, there are chances of security breaches and privacy concerns. Therefore, many companies have pulled their data from public clouds and started creating their own private cloud storage. Using ownCloud, one can create services similar to Dropbox and iCloud. We can sync and share files, calendars and more. Let’s take a look at how we can do this.


ownCloud is free and open source software that operates like any cloud storage system on your own domain. It is very quick and easy to set up compared to other similar software. It can be used not only for file sharing but also to leverage many features like text editors, to-do lists, etc. ownCloud can be integrated with any desktop or mobile calendar and contact apps. So now, we don’t really require a Google Drive or Dropbox account.

Requirements for ownCloud Web hosting

You will need hosting space and a domain name so that you can access your storage publicly, anywhere and anytime. You

can register a domain and hosting space with any service provider like GoDaddy or BigRock. The only requirement is that the hosting service should support PHP and MySQL, which most hosting providers usually do.

ownCloud server

There are different ways to install ownCloud server but here, we will use the easiest and quickest method. It will not take more than five minutes to get your cloud ready. We need to download the latest version of the ownCloud server and install it. To install and configure ownCloud, go to the link From this link, download the ownCloud server, install/#instructions-server. The current version is 10.0.3. Click on Download ownCloud Server. Select the Web installer from the top options in the left side menus. Check Figure 1 for more details. Figure 1 mentions the installation steps. We need to download the setup-owncloud.php file and upload it into the Web space. We will upload the file using winSCP or any similar kind of software. Here, I am using winSCP. If you don’t have winSCP, you can download it first from the link https:// The next step is to get to know your FTP credentials in order to connect to your Web space. For that, you need to log in to your hosting provider portal. From the menu, you will find the FTP options, where you see your credentials. This approach varies based on your hosting provider. Since I bought my hosting from, this was how I found my | OPEN SOURCE FOR YOU | DECEMBER 2017 | 53


Let's Try

Figure 5: ownCloud installed

Figure 1: ownCloud downloading the Web installer file

Figure 6: Creating an account

Figure 2: FTP connection to the Web space

Figure 7: ownCloud home page

Figure 3: ownCloud installation page

Figure 8: Files uploaded in ownCloud Figure 4: Error while installing ownCloud

FTP credentials. If you are not able to find them, then Google what specific steps you need to take for your hosting provider, for which you will get lots of documentation. Figure 2 shows my FTP connection to my Web space. Transfer your setup-owncloud.php file to your Web space by dragging it. Now you can load the URL Yourdomainname. com/setup-owncloud.php into your browser. In my case, the domain name is and hence my URL is Figure 3 demonstrates this step. As you can see, ownCloud installation has started. While clicking on the next button, if we get the error mentioned in Figure 4, it means we don’t have proper access. This happens because of the root directory. In shared hosting, we don’t have RW access to the root folder; so I have created one folder and put setup-owncloud.php into it. Hence, the current path is While creating an account, you can click on Database

selection and provide your user name and password for data. Click Next and then you will come to the home page of ownCloud. Figure 7 demonstrates that. There are different options—you can download your own cloud storage as an Android or iPhone app. You can also connect to your Calendar or Contacts. Now everything else is self-explanatory, just as in Google Drive or Dropbox. You can now upload any file and share it with friends and colleagues, like you would with any other cloud storage service.


Reference [1]

By: Maulik Parekh The author works at Cisco as a consulting engineer and has an M. Tech degree in cloud computing from VIT University, Chennai. He constantly strives to learn, grow and innovate. He can be reached at Website: https://www.

Let’s Try Admin

Spark’s MLlib: Scalable Support for Machine Learning Designated as Spark’s scalable machine learning library, MLlib consists of common algorithms and utilities as well as underlying optimisation primitives.


he world is being flooded with data from all sources. The hottest trend in technology is related to Big Data and the evolving field of data science is a way to cope with this data deluge. Machine learning is at the heart of data science. The need of the hour is to have efficient machine learning frameworks and platforms to process Big Data. Apache Spark is one of the most powerful platforms for analysing Big Data. MLlib is its machine learning library, and is potent enough to process Big Data and apply all machine learning algorithms to it efficiently.

Apache Spark

Apache Spark is a cluster computing framework based on Hadoop’s MapReduce framework. Spark has in-memory cluster computing, which helps to speed up computation by reducing the IO transfer time. It is widely used to deal with Big Data problems because of its distributed architectural support and parallel processing capabilities. Users prefer it to Hadoop on account of its stream processing and interactive query features. To provide a wide range of services, it has built-in libraries like GraphX, SparkSQL and MLlib. Spark supports Python, Scala, Java and R as programming languages, out of which Scala is the most preferred.


MLlib is Spark’s machine learning library. It is predominantly used in Scala but it is compatible with Python and Java as well. MLlib was initially contributed by AMPLab at UC Berkeley. It makes machine learning scalable, which provides an advantage when handling large volumes of incoming data. The main features of MLlib are listed below. Machine learning algorithms: Regression, classification, collaborative filtering, clustering, etc Featurisation: Selection, dimensionality reduction, transformation, feature extraction, etc Pipelines: Construction, evaluation and tuning of ML pipelines Persistence: Saving/loading of algorithms, models and pipelines Utilities: Statistics, linear algebra, probability, data handling, etc Some lower level machine learning primitives like the generic gradient descent optimisation algorithm are also present in MLlib. In the latest releases, the MLlib API is based on DataFrames instead of RDD, for better performance.

The advantages of MLlib

The true power of Spark lies in its vast libraries, which are capable of performing every data analysis task imaginable. MLlib is at the core of this functionality. It has several advantages. | OPEN SOURCE FOR YOU | DECEMBER 2017 | 55


Let’s Try

Ease of use: MLlib integrates well with four languages— Java, R, Python and Scala. The APIs of all four provide ease of use to programmers of various languages as they don’t need to learn a new one. Easy to deploy: No preinstallation or conversion is required to use a Hadoop based data source such as HBase, HDFS, etc. Spark can also run standalone or on an EC2 cluster. Scalability: The same code can work on small or large volumes of data without the need of changing it to suit the volume. As businesses grow, it is easy to expand vertically or horizontally without breaking down the code into modules for performance. Performance: The ML algorithms run up to 100X faster than MapReduce on account of the framework, which allows iterative computation. MLlib’s algorithms take advantage of iterative computing properties to deliver better performance, surpassing that of MapReduce. The performance gain is attributed to the in-memory computing, which is a speciality of Spark. Algorithms: The main ML algorithms included in the MLlib module are classification, regression, decision trees, recommendation, clustering, topic modelling, frequent item sets, association rules, etc. ML workflow utilities included are feature transformation, pipeline construction, ML persistence, etc. Single value decomposition, principal component analysis, hypothesis testing, etc, are also possible with this library. Community: Spark is open source software under the Apache Foundation now. It gets tested and updated by the vast contributing community. MLlib is the most rapidly expanding component and new features are added every day. People submit their own algorithms and the resources available are unparalleled.

Basic modules of MLlib

SciKit-Learn: This module contains many basic ML algorithms that perform the various tasks listed below. Classification: Random forest, nearest neighbour, SVM, etc Regression: Ridge regression, support vector regression, lasso, logistic regression, etc Clustering: Spectral clustering, k-means clustering, etc Decomposition: PCA, non-negative matrix factorisation, independent component analysis, etc

Clustering: k-means, fuzzy k-means, etc Decomposition: SVD, randomised SVD, etc

Spark MLlib use cases

Spark’s MLlib is used frequently in marketing optimisation, security monitoring, fraud detection, risk assessment, operational optimisation, preventative maintenance, etc. Here are some popular use cases. NBC Universal: International cable TV has tons of data. To reduce costs, NBC takes its media offline when it is not in use. Spark’s MLlib is used to implement SVM to predict which files should be taken down. ING: MLlib is used for its data analytics pipeline to detect anomaly. Decision trees and k-means are implemented by MLlib to enable this. Toyota: Toyota’s Customer 360 insights platform uses social media data in real-time to prioritise the customer reviews and categorise them for business insights.

ML vs MLLib

There are two main machine learning packages —spark. mllib and The former is the original version and has its API built on top of RDD. The latter has a newer, higher-level API built on top of DataFrames to construct ML pipelines. The newer version is recommended due to the DataFrames, which makes it more versatile and flexible. The newer releases support the older version as well, due to backward compatibility. MLlib, being older, has more features as it was in development longer. Spark ML allows you to create pipelines using machine learning to transform the data. In short, ML is new, has pipelines, DataFrames and is easier to construct. But MLlib is old, has RDD and has more features. MLlib is the main reason for the popularity and the widespread use of Apache Spark in the Big Data world. Its compatibility, scalability, ease of use, good features and functionality have led to its success. It provides many inbuilt functions and capabilities, which makes it easy for machine learning programmers. Virtually all known machine learning algorithms in use can be easily implemented using either version of MLlib. In this era of data deluge, such libraries certainly are a boon to data science.

Mahout: This module contains many basic ML algorithms that perform the tasks listed below. Classification: Random forest, logistic regression, naive Bayes, etc Collaborative filtering: ALS, etc


References [1] [2]

By: Preet Gandhi The author is an avid Big Data and data science enthusiast. She can be reached at

Let's Try Admin

Apache CloudStack: A Reliable and Scalable Cloud Computing Platform Apache CloudStack is yet another outstanding project that has contributed many tools and projects to the open source community. The author has selected the relevant and important extracts from the excellent documentation provided by the Apache CloudStack project team for this article.


pache CloudStack is one among the highly visible projects from the Apache Software Foundation (ASF). The project focuses on deploying open source software for public and private Infrastructure as a Service (IaaS) clouds. Listed below are a few important points about CloudStack. ƒ It is designed to deploy and manage large networks of virtual machines, as highly available and scalable Infrastructure as a Service (IaaS) cloud computing platforms. ƒ CloudStack is used by a number of service providers to offer public cloud services and by many companies to provide on-premises (private) cloud offerings or as part of a hybrid cloud solution. ƒ CloudStack includes the entire ‘stack’ of features that most organisations desire with an IaaS cloud -- compute orchestration, Network as a Service, user and account management, a full and open native API, resource




accounting, and a first-class user interface (UI). It currently supports the most popular hypervisors — VMware, KVM, Citrix XenServer, Xen Cloud Platform (XCP), Oracle VM server and Microsoft Hyper-V. Users can manage their cloud with an easy-to-use Web interface, command line tools and/or a full-featured RESTful API. In addition, CloudStack provides an API that’s compatible with AWS EC2 and S3 for organisations that wish to deploy hybrid clouds. It provides an open and flexible cloud orchestration platform to deliver reliable and scalable private and public clouds.

Features and functionality

Some of the features and functionality provided by CloudStack are: ƒ Works with hosts running XenServer/XCP, KVM, Hyper-V, and/or VMware ESXi with vSphere | OPEN SOURCE FOR YOU | DECEMBER 2017 | 57


Let's Try

Provides a friendly Web-based UI for managing the cloud Provides a native API May provide an Amazon S3/EC2 compatible API Manages storage for instances running on the hypervisors (primary storage) as well as templates, snapshots and ISO images (secondary storage) ƒ Orchestrates network services from the data link layer (L2) to some application layer (L7) services, such as DHCP, NAT, firewall, VPN and so on ƒ Accounting of network, compute and storage resources ƒ Multi-tenancy/account separation ƒ User management Support for multiple hypervisors: CloudStack works with a variety of hypervisors and hypervisorlike technologies. A single cloud can contain multiple hypervisor implementations. As of the current release, CloudStack supports BareMetal (via IPMI), Hyper-V, KVM, LXC, vSphere (via vCenter), Xenserver and Xen Project. Massively scalable infrastructure management: CloudStack can manage tens of thousands of physical servers installed in geographically distributed data centres. The management server scales near-linearly, eliminating the need for cluster-level management servers. Maintenance or other outages of the management server can occur without affecting the virtual machines running in the cloud. Automatic cloud configuration management: CloudStack automatically configures the network and storage settings for each virtual machine deployment. Internally, a pool of virtual appliances supports the configuration of the cloud itself. These appliances offer services such as firewalling, routing, DHCP, VPN, console proxy, storage access, and storage replication. The extensive use of horizontally scalable virtual machines simplifies the installation and ongoing operation of a cloud. Graphical user interface: CloudStack offers an administrator’s Web interface that can be used for provisioning and managing the cloud, as well as an end user’s Web interface, for running VMs and managing VM templates. The UI can be customised to reflect the desired look and feel that the service provider or enterprise wants. API: CloudStack provides a REST-like API for the operation, management and use of the cloud. AWS EC2 API support: It provides an EC2 API translation layer to permit common EC2 tools to be used in the CloudStack cloud. High availability: CloudStack has a number of features that increase the availability of the system. The management server itself may be deployed in a multi-node installation where the servers are load balanced. MySQL may be configured to use replication to provide for failover in the event of a database loss. For the hosts, CloudStack supports NIC bonding and the use of separate networks for storage as well as iSCSI Multipath. ƒ ƒ ƒ ƒ

Management Server

Machine 1


Machine 2

Figure 1: A simplified view of a basic deployment

Deployment architecture

CloudStack deployments consist of the management server and the resources to be managed. During deployment, you inform the management server of the resources to be managed, such as the IP address blocks, storage devices, hypervisors and VLANs. The minimum installation consists of one machine running the CloudStack management server and another machine acting as the cloud infrastructure. In its smallest deployment, a single machine can act as both the management server and the hypervisor host. A more full-featured installation consists of a highly-available multi-node management server and up to tens of thousands of hosts using any of several networking technologies. Management server overview: The management server orchestrates and allocates the resources in your cloud deployment. It typically runs on a dedicated machine or as a virtual machine. It controls the allocation of virtual machines to hosts, and assigns storage and IP addresses to the virtual machine instances. The management server runs in an Apache Tomcat container and requires a MySQL database for persistence. The management server: ƒ Provides the Web interface for both the administrator and the end user ƒ Provides the API interfaces for both the CloudStack API as well as the EC2 interface ƒ Manages the assignment of guest VMs to a specific compute resource ƒ Manages the assignment of public and private IP addresses ƒ Allocates storage during the VM instantiation process ƒ Manages snapshots, disk images (templates) and ISO images ƒ Provides a single point of configuration for your cloud Cloud infrastructure overview: Resources within the cloud are managed as follows. ƒ Regions: This is a collection of one or more geographically proximate zones managed by one or more management servers.


Let's Try Admin

Figure 3: Installation complete Figure 2: A region with multiple zones

Zones: Typically, a zone is equivalent to a single data centre. It consists of one or more pods and secondary storage. ƒ Pods: A pod is usually a rack, or row of racks that includes a Layer-2 switch and one or more clusters. ƒ Clusters: A cluster consists of one or more homogenous hosts and primary storage. ƒ Host: This is a single compute node within a cluster; often, a hypervisor. ƒ Primary storage: This is a storage resource typically provided to a single cluster for the actual running of instance disk images. ƒ Secondary storage: This is a zone-wide resource which stores disk templates, ISO images, and snapshots. Networking overview: CloudStack offers many types of networking, but these typically fall into one of two scenarios. ƒ Basic: This is analogous to AWS-classic style networking. It provides a single flat Layer-2 network, where guest isolation is provided at Layer-3 by the hypervisors bridge device. ƒ Advanced: This typically uses Layer-2 isolation such as VLANs, though this category also includes SDN technologies such as Nicira NVP. ƒ


In this section, let us look at the minimum system requirements and installation steps for CloudStack. Management server, database and storage system requirements: The machines that will run the management server and MySQL database must meet the following requirements. The same machines can also be used to provide primary and secondary storage, such as via local disks or NFS. The management server may be placed on a virtual machine. ƒ Preferred OS: CentOS/RHEL 6.3+ or Ubuntu 14.04 (.2) ƒ 64-bit x86 CPU (more cores lead to better performance) ƒ 4GB of memory ƒ 250GB of local disk space (more space results in better capability; 500GB recommended) ƒ At least 1 NIC ƒ Statically allocated IP address ƒ Fully qualified domain name as returned by the hostname command

Host/hypervisor system requirements: The host is where the cloud services run in the form of guest virtual machines. Each host is one machine that meets the following requirements: ƒ Must support HVM (Intel-VT or AMD-V enabled) ƒ 64-bit x86 CPU (more cores result in better performance) ƒ Hardware virtualisation support required ƒ 4GB of memory ƒ 36GB of local disk ƒ At least 1 NIC ƒ Latest hotfixes applied to hypervisor software ƒ When you deploy CloudStack, the hypervisor host must not have any VMs already running ƒ All hosts within a cluster must be homogeneous. The CPUs must be of the same type, count, and feature flags Installation steps: You may be able to do a simple trial installation, but for full installation, do make sure you go through all the following topics from the Apache CloudStack documentation (refer to the section ‘Installation Steps’ of this documentation): ƒ Choosing a deployment architecture ƒ Choosing a hypervisor: Supported features ƒ Network setup ƒ Storage setup ƒ Best practices The steps for the installation are as follows (you can refer to the Apache CloudStack documentation for detailed steps). Make sure you have the required hardware ready as discussed above. Installing the management server (choose single- or multi-node): The procedure for installing the management server is: ƒ Prepare the operating system ƒ In the case of XenServer only, download and install vhd-util ƒ Install the first management server ƒ Install and configure the MySQL database ƒ Prepare NFS shares ƒ Prepare and start additional management servers (optional) ƒ Prepare the system VM template

Configuring your cloud

After the management server is installed and running, you can add the compute resources for it to manage. For an overview | OPEN SOURCE FOR YOU | DECEMBER 2017 | 59


Let's Try

of how a CloudStack cloud infrastructure is organised, see ‘Cloud Infrastructure Overview’ in the Apache CloudStack documentation. To provision the cloud infrastructure, or to scale it up at any time, follow the procedures given below: 1. Define regions (optional) 2. Add a zone to the region 3. Add more pods to the zone (optional) 4. Add more clusters to the pod (optional) 5. Add more hosts to the cluster (optional) 6. Add primary storage to the cluster 7. Add secondary storage to the zone 8. Initialise and test the new cloud When you have finished these steps, you will have a deployment with the basic structure, as shown in Figure 4. For all the above steps, detailed instructions are available in the Apache CloudStack documentation.

Initialising and testing

After everything is configured, CloudStack will perform its initialisation. This can take 30 minutes or more, depending on the speed of your network. When the initialisation has been completed successfully, the administrator’s dashboard should be displayed in the CloudStack UI. 1. Verify that the system is ready. In the left navigation bar, select Templates. Click on the CentOS 5.5 (64-bit) no GUI (KVM) template. Check to be sure that the status is ‘Download Complete’. Do not proceed to the next step until this message is displayed. 2. Go to the Instances tab, and filter on the basis of My Instances. 3. Click Add Instance and follow the steps in the wizard. 4. Choose the zone you just added. 5. In the template selection, choose the template to use in the VM. If this is a fresh installation, it is likely that only the provided CentOS template is available. 6. Select a service offering. Be sure that the hardware you have allows the starting of the selected service offering. 7. In data disk offering, if desired, add another data disk. This is a second volume that will be available to but not mounted in the guest. For example, in Linux on XenServer you will see /dev/xvdb in the guest after rebooting the VM. A reboot is not required if you have a PV-enabled OS kernel in use. 8. In the default network, choose the primary network for the guest. In a trial installation, you would have only one option here. 9. Optionally, give your VM a name and a group. Use any descriptive text you would like to. 10. Click on Launch VM. Your VM will be created and started. It might take some time to download the template and complete the VM startup. You can watch the VM’s progress in the Instances screen. To use the VM, click the View Console button.

Figure 4: Conceptual view of a basic deployment

If you decide to increase the size of your deployment, you can add more hosts, primary storage, zones, pods and clusters. You may also see the additional configuration parameter setup, hypervisor setup, network setup and storage setup. CloudStack installation from the GIT repo (for developers): See the section ‘CloudStack Installation from the GIT repo for Developers’ in the Apache CloudStack documentation to explore these steps for developers.

The CloudStack API

The CloudStack API is a query based API using HTTP, which returns results in XML or JSON. It is used to implement the default Web UI. This API is not a standard like OGF OCCI or DMTF CIMI but is easy to learn. Mapping exists between the AWS API and the CloudStack API as will be seen in the next section. Recently, a Google Compute Engine interface was also developed, which maps the GCE REST API to the CloudStack API described here. The CloudStack query API can be used via HTTP GET requests made against your cloud endpoint (e.g., http://localhost:8080/client/api). The API name is passed using the command key, and the various parameters for this API call are passed as key value pairs. The request is signed using the access key and secret key of the user making the call. Some calls are synchronous while some are asynchronous. Asynchronous calls return a JobID; the status and result of a job can be asked with the query AsyncJobResult call. Let’s get started and look at an example of calling the listUsers API in Python. First, you will need to generate keys to make requests. In the dashboard, go to Accounts, select the appropriate account and then click on Show Users. Select the intended users and generate keys using the Generate Keys icon. You will see an APIKey and Secret Key field being generated. The keys will be in the following form: API Key : XzAz0uC0t888gOzPs3HchY72qwDc7pUPIO8LxCVkIHo4C3fvbEBY_Ccj8fo3mBapN5qRDg_0_EbGdbxi8oy1A Secret Key: zmBOXAXPlfb-LIygOxUVblAbz7E47eukDS_0JYUxP3JAmknOY o56T0R-AcM7rK7SMyo11Y6XW22gyuXzOdiybQ


Let's Try Admin Open a Python shell and import the basic modules necessary to make the request. Do note that this request could be made in many different ways—this is just a very basic example. The urllib* modules are used to make the HTTP request and do URL encoding. The hashlib module gives us the sha1 hash function. It is used to generate the hmac (keyed hashing for message authentication) using the secret key. The result is encoded using the base64 module.

‘apikey=plgwjfzk4gys3momtvmjuvg-x-jlwlnfauj9gabbbf9edmkaymmailqzzq1elzlyq_u38zcm0bewzgudp66mg&command=listusers&re sponse=json’ >>>,sig_str,hashlib.sha1).digest() >>> sig ‘M:]\x0e\xaf\xfb\x8f\xf2y\xf1p\x91\x1e\x89\x8a\xa1\x05\xc4A\ xdb’ >>> sig=base64.encodestring(,sig_ str,hashlib.sha1).digest()) >>> sig ‘TTpdDq/7j/J58XCRHomKoQXEQds=\n’ >>> sig=base64.encodestring(,sig_ str,hashlib.sha1).digest()).strip() >>> sig ‘TTpdDq/7j/J58XCRHomKoQXEQds=’

$python Python 2.7.3 (default, Nov 17 2012, 19:54:34) [GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/ clang-421.11.66))] on darwin Type “help”, “copyright”, “credits” or “license” for more information. >>> import urllib2 >>> import urllib >>> import hashlib >>> import hmac >>> import base64

Define the endpoint of the Cloud, the command that you want to execute, the type of the response (i.e., XML or JSON) and the keys of the user. Note that we do not put the secret key in our request dictionary because it is only used to compute the hmac. >>> baseurl=’http://localhost:8080/client/api?’ >>> request={} >>> request[‘command’]=’listUsers’ >>> request[‘response’]=’json’ >>> request[‘apikey’]=’plgWJfZK4gyS3mOMTVmjUVg-XjlWlnfaUJ9GAbBbf9EdM-kAYMmAiLqzzq1ElZLYq_u38zCm0bewzGUdP66mg’ >>> secretkey=’VDaACYb0LV9eNjTetIOElcVQkvJck_J_QljX_ FcHRj87ZKiy0z0ty0ZsYBkoXkY9b7eq1EhwJaw7FF3akA3KBQ’

Build the base request string, which is the combination of all the key/pairs of the request, url encoded and joined with ampersand. >>> request_str=’&’.join([‘=’.join([k,urllib.quote_ plus(request[k])]) for k in request.keys()]) >>> request_str ‘apikey=plgWJfZK4gyS3mOMTVmjUVg-X-jlWlnfaUJ9GAbBbf9EdMkAYMmAiLqzzq1ElZLYq_u38zCm0bewzGUdP66mg&command=listUsers&res ponse=json’

Compute the signature with hmac, and do a 64-bit encoding and a url encoding; the string used for the signature is similar to the base request string shown above, but the keys/ values are lower cased and joined in a sorted order. >>> sig_str=’&’.join([‘=’.join([k.lower(),urllib.quote_ plus(request[k].lower().replace(‘+’,’%20’))])for k in sorted(request.iterkeys())]) >>> sig_str

>>> sig=urllib.quote_plus(base64.encodestring(hmac. new(secretkey,sig_str,hashlib.sha1).digest()).strip()) Finally, build the entire string by joining the baseurl, the request str and the signature. Then do an http GET: >>> req=baseurl+request_str+’&signature=’+sig >>> req ‘http://localhost:8080/client/api?apikey=plgWJfZK4gyS3mOMTVmj UVg-X-jlWlnfaUJ9GAbBbf9EdM-kAYMmAiLqzzq1ElZLYq_u38zCm0bewzGUd P66mg&command=listUsers&response=json&signature=TTpdDq%2F7j%2 FJ58XCRHomKoQXEQds%3D’ >>> res=urllib2.urlopen(req) >>> { “listusersresponse” : { “count”:1 , “user” : [ { “id”:”7ed6d5da-93b2-4545-a502-23d20b48ef2a”, “username”:”admin”, “firstname”:”admin”, “lastname”:”cloud”, “created”:”2012-07-05T12:18:27-0700”, “state”:”enabled”, “account”:”admin”, “accounttype”:1, “domainid”:”8a111e58-e155-4482-93ce84efff3c7c77”, “domain”:”ROOT”, “apikey”:”plgWJfZK4gyS3mOMTVmjUVgX-jlWlnfaUJ9GAbBbf9EdM-kAYMmAiLqzzq1ElZLYq_ u38zCm0bewzGUdP66mg”, “secretkey”:”VDaACYb0LV9eNjTetIOElcVQkvJck_J_ QljX_FcHRj87ZKiy0z0ty0ZsYBkoXkY9b7eq1EhwJaw7FF3akA3KBQ”, “accountid”:”7548ac03-af1d-4c1c-90642f3e2c0eda0d” } ] } Continued on page...65 } | OPEN SOURCE FOR YOU | DECEMBER 2017 | 61

Developers How To


Building Deep Learning Applications with High Levels of Abstraction Keras is a high-level API for neural networks. It is written in Python and its biggest advantage is its ability to run on top of state-of-art deep learning libraries/ frameworks such as TensorFlow, CNTK or Theano. If you are looking for fast prototyping with deep learning, then Keras is the optimal choice.


eep learning is the new buzzword among machine learning researchers and practitioners. It has certainly opened the doors to solving problems that were almost unsolvable earlier. Examples of such problems are image recognition, speaker-independent voice recognition, video understanding, etc. Neural networks are at the core of deep learning methodologies for solving problems. The improvements in these networks, such as convolutional neural networks (CNN) and recurrent networks, have certainly raised expectations and the results they yield are also promising. To make the approach simple, there are already powerful frameworks/libraries such as TensorFlow from Google and CNTK (Cognitive Toolkit) from Microsoft. The TensorFlow approach has already simplified the implementation of deep learning for coders. Keras is a high-level API for neural networks written in Python, which makes things even simpler. The uniqueness of Keras is that it can be executed on top of libraries such as TensorFlow and CNTK. This article assumes that the reader is familiar with the fundamental concepts of machine learning.


ƒ ƒ

The primary reasons for using Keras are: Instant prototyping: This is ability to implement the deep learning concepts with higher levels of abstraction with a ‘keep it simple’ approach. Keras has the potential to execute without any barriers on CPUs and GPUs. Keras supports convolutional and recurrent networks -combinations of both can also be used with it.

Keras: The design philosophy

As stated earlier, the ability to move into action with instant prototyping is an important characteristic of Keras. Apart from this, Keras is designed with the following guiding principles or design philosophy: ƒ It is an API designed with user friendly implementation as the core principle. The API is designed to be simple and consistent, and it minimises the effort programmers are required to put in to convert theory into action. ƒ Keras’ modular design is another important feature. The primary idea of Keras is layers, which can be connected seamlessly.


How To Developers Instant Prototyping

Python at Core

User Friendliness

Keras - Design Philosophy

Keras - Top Reasons

CPU and GPU Support Modular Design


Support for Convolutional, Recurrent Network

Figure 2: Keras’ design philosophy

n Defi eM ode

del Mo

Keras is extensible. If you are a researcher trying to bring in your own novel functionality, Keras can accommodate such extensions. Keras is all Python, so there is no need for tricky declarative configuration files.


Keras Flow # 2. Com p ode l

Pre dicti on

ile M

# 4. Perf orm


. Fit #3



Figure 1: Primary reasons for using Keras


It has to be remembered that Keras is not a standalone library. It is an API and works on top of existing libraries (TensorFlow, CNTK or Theano). Hence, the installation of Keras requires any one of these backend engines. The official documentation suggests a TensorFlow backend. Detailed installation instructions for TensorFlow are available at https:// From this link, you can infer that TensorFlow can be easily installed in all major operating systems such as MacOS X, Ubuntu and Windows (7 or later). After the successful installation of any one of the backend engines, Keras can be installed using Pip, as shown below:

Figure 3: The sequence of tasks

Model definition Compilation of the model Model fitting Performing predictions The basic type of model is sequential. It is simply a linear stack of layers. The sequential model can be built as shown below:

ƒ ƒ ƒ ƒ

from keras.models import Sequential model = Sequential()

$sudo pip install keras

An alternative approach is to install Keras from the source (GitHub): #1 Clone the Source from Github $git clone #2 Move to Source Directory cd keras #3 Install using sudo python install

The three optional dependencies that are required for specific features are: ƒ cuDNN (CUDA Deep Neural Network library): For running Keras on the GPU ƒ HDF5 and h5py: For saving Keras models to disks ƒ Graphviz and Pydot: For visualisation tasks

The way Keras works

The basic building block of Keras is the model, which is a way to organise layers. The sequence of tasks to be carried out while using Keras models is:

The stacking of layers can be done with the add() method: from keras.layers import Dense, Activation model.add(Dense(units=64, input_dim=100)) model. add(Activation(‘relu’)) model.add(Dense(units=10)) model.add(Activation(‘softmax’))

Keras has various types of pre-built layers. Some of the prominent types are: ƒ Regular Dense ƒ Recurrent Layers, LSTM, GRU, etc ƒ One- and two-dimension convolutional layers ƒ Dropout ƒ Noise ƒ Pooling ƒ Normalisation, etc Similarly, Keras supports most of the popularly used activation functions. Some of these are: ƒ Sigmoid ƒ ReLu ƒ Softplus ƒ ELU ƒ LeakyReLu, etc | OPEN SOURCE FOR YOU | DECEMBER 2017 | 63

Developers How To The model can be compiled with compile(), as follows: model.compile(loss=’categorical_crossentropy’, optimizer=’sgd’, metrics=[‘accuracy’])

Keras is very simple. For instance, if you want to configure the optimiser given in the above mentioned code, the following code snippet can be used: model.compile(loss=keras.losses.categorical_crossentropy, optimizer=keras.optimizers.SGD(lr=0.01, momentum=0.9, nesterov=True))

from __future__ import print_function import keras from keras.datasets import mnist from keras.models import Sequential from keras.layers import Dense, Dropout from keras.optimizers import RMSprop batch_size = 128 num_classes = 10 epochs = 20 # the data, shuffled and split between train and test sets (x_train, y_train), (x_test, y_test) = mnist.load_data()

The model can be fitted with the fit() function:

In the aforementioned code snippet, x_train and y_train are Numpy arrays. The performance evaluation of the model can be done as follows:

x_train = x_train.reshape(60000, 784) x_test = x_test.reshape(10000, 784) x_train = x_train.astype(‘float32’) x_test = x_test.astype(‘float32’) x_train /= 255 x_test /= 255

loss_and_metrics = model.evaluate(x_test, y_test, batch_ size=128)

print(x_train.shape[0], ‘train samples’) print(x_test.shape[0], ‘test samples’)

The predictions on novel data can be done with the predict() function:

# convert class vectors to binary class matrices y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes), y_train, epochs=5, batch_size=32)

classes = model.predict(x_test, batch_size=128)

The methods of Keras layers

The important methods of Keras layers are shown in Table 1. Method



This method is used to return the weights of the layer


This method is used to set the weights of the layer


This method is used to return the configuration of the layer as a dictionary

Table 1: Keras layers’ methods

MNIST training

MNIST is a very popular database among machine learning researchers. It is a large collection of handwritten digits. A complete example for deep multi-layer perceptron training on the MNIST data set with Keras is shown below. This source is available in the examples folder of Keras ( fchollet/keras/blob/master/examples/

model = Sequential() model.add(Dense(512, activation=’relu’, input_shape=(784,))) model.add(Dropout(0.2)) model.add(Dense(512, activation=’relu’)) model.add(Dropout(0.2)) model.add(Dense(num_classes, activation=’softmax’)) model.summary() model.compile(loss=’categorical_crossentropy’, optimizer=RMSprop(), metrics=[‘accuracy’]) history =, y_train, batch_size=batch_size, epochs=epochs, verbose=1, validation_data=(x_test, y_test)) score = model.evaluate(x_test, y_test, verbose=0) print(‘Test loss:’, score[0]) print(‘Test accuracy:’, score[1])

If you are familiar with machine learning terminology, the above code is self-explanatory.


How To Developers Image classification with pre-trained models

the batch) print(‘Predicted:’, decode_predictions(preds, top=3)[0])

An image classification code with pre-trained ResNet50 is as follows (

# Predicted: [(u’n02504013’, u’Indian_elephant’, 0.82658225), (u’n01871265’, u’tusker’, 0.1122357), (u’n02504458’, u’African_elephant’, 0.061040461)]

from keras.applications.resnet50 import ResNet50 from keras.preprocessing import image from keras.applications.resnet50 import preprocess_input, decode_predictions import numpy as np

The simplicity with which the classification tasks are carried out can be inferred from the above code. Overall, Keras is a simple, extensible and easy-toimplement neural network API, which can be used to build deep learning applications with high level abstraction.

model = ResNet50(weights=’imagenet’) img_path = ‘elephant.jpg’ img = image.load_img(img_path, target_size=(224, 224)) x = image.img_to_array(img) x = np.expand_dims(x, axis=0) x = preprocess_input(x) preds = model.predict(x)

By: Dr K.S. Kuppusamy The author is an assistant professor of computer science at the School of Engineering and Technology, Pondicherry Central University. He has 12+ years of teaching and research experience in academia and in industry. He can be reached at

# decode the results into a list of tuples (class, description, probability) # (one such list for each sample in

Continued from page...61 All the clients you find on GitHub implement this signature technique, so you should not have to do it manually. Now that you have explored the API through the UI and you understand how to make low level calls, pick your favourite client or use CloudMonkey. This is a sub-project of Apache CloudStack and gives operators/developers the ability to use any of the API methods. Testing the AWS API interface: While the native CloudStack API is not a standard, CloudStack provides an AWS EC2 compatible interface. A great advantage of this is that existing tools written with EC2 libraries can be reused against a CloudStack based cloud. In the installation section, we described how to run this interface by installing packages. In this section, we find out how to compile the interface with Maven and test it with the Python Boto module. Using a running management server (with DevCloud for instance), start the AWS API interface in a separate shell with the following command:

FoWXKg3RvjHgsufcKhC1SeiCbeEc0obKwUlwJamB_ gFmMJkFHYHTIafpUx0pHcfLvt-dzw” secretkey=”oxV5Dhhk5ufNowey 7OVHgWxCBVS4deTl9qL0EqMthfP Buy3ScHPo2fifDxw1aXeL5cyH10hnLOKjyKphcXGeDA” region = boto.ec2.regioninfo.RegionInfo(name=”ROOT”, endpoint=”localhost”) conn = boto.connect_ec2(aws_access_key_id=accesskey, aws_ secret_access_key=secretkey, is_secure=False, region=region, port=7080, path=”/awsapi”, api_version=”2012-08-15”) images=conn.get_all_images() print images res = images[0].run(instance_type=’m1.small’,security_ groups=[‘default’])

Note the new api_version number in the connection object, and also note that there was no need to perform a user registration as in the case of previous CloudStack releases. Let us thank those at Apache for contributing yet another outstanding product to the open source community, along mvn -Pawsapi -pl :cloud-awsapi jetty:run with the detailed documentation they have provided for CloudStack. All the contents, samples and pictures in this Log into the CloudStack UI http://localhost:8080/client, go to Service Offerings and edit one of the compute offerings to have article are extracts from CloudStack online documentation and you may explore more about it at http://docs.cloudstack. the name m1.small or any of the other AWS EC2 instance types. With access and secret keys generated for a user, you should now be able to use the Python Boto module: By: Somanath T.V. import boto import boto.ec2 accesskey=”2IUSA5xylbsPSnBQ

The author has 14+ years’ experience in the IT industry and is currently doing research in machine learning and related areas, along with his tenure at SS Consulting, Kochi. He can be reached at | OPEN SOURCE FOR YOU | DECEMBER 2017 | 65

Developers How To

Using jq to Consume JSON in the Shell This article is a tutorial on using jq as a JSON parser and fetching information about the weather from different cities.


SON has become the most prevalent way of consuming Web APIs. If you try to find the API documentation of a popular service, chances are that the API will respond in JSON format. Many mainstream languages even have JSON parsers built in. But when it comes to shell scripting, there is no inbuilt JSON parser, and the only hacker way of processing JSON is with a combination of awk and sed, which are very painful to use. There are many JSON parsers apart from jq but, in this article, we will focus only on this option.


jq is a single binary program with no dependencies, so installation is as simple as downloading the binary from, copying the binary in /bin or /usr/bin and setting permissions. Many Linux distributions provide jq in the repositories, so installing jq is as easy as using the following commands: sudo apt install jq



For this demonstration, version 1.5 of jq was used. All the code examples are available at jatindhankhar/jq-tutorial. jq can be used in conjunction with other tools like cat and curl, by piping, or be used to directly read from the file, although the former is more popular in practice. When working with jq, two fantastic resources can be used. The first one is the documentation at, and the second is the Online Playground ( where one can play with jq and even share the snippets. Throughout this article, we will use different API endpoints of the MetaWeather API (https://www. The simplest use of jq is to pretty format JSON data. Let’s fetch the list of cities that contain the word ‘new’ in them, and then use this information to further fetch details of a particular city, as follows: curl -sS search/?query=new

sudo pacman -S jq

The above command will fetch all cities containing ‘new’ in their name. At this point, the output is not formatted.

Installation instructions may vary depending upon the distribution. Detailed instructions are available at https://

[{“title”:”New York”,”location_type”:”City”,”woeid”:24 59115,”latt_long”:”40.71455,-74.007118”},{“title”:”New Delhi”,”location_type”:”City”,”woeid”:28743736,”latt_long”


How To Developers :”28.643999,77.091003”},{“title”:”New Orleans”,”location_ty pe”:”City”,”woeid”:2458833,”latt_long”:”29.953690,90.077713”},{“title”:”Newcastle”,”location_type”:”City”, ”woeid”:30079,”latt_long”:”54.977940,-1.611620”},{“title ”:”Newark”,”location_type”:”City”,”woeid”:2459269,”latt_ long”:”40.731972,-74.174179”}]

Let’s pretty format by piping the curl output to jq as follows: curl -sS search/\?query\=new | jq

The screenshot shown in Figure 1 compares the output of both commands. Now that we have some data to work upon, we can use jq to filter the keys. The simplest filter available is ‘.’ which does nothing and filters the whole document as it is. Filters are passed to jq in single quotes. By looking at the output, we can see that all the objects are trapped inside a JSON array. To filter out the array, we use .[] , which will display all items inside an array. To target a specific item by index, we place the index number inside .[0]. To display the first item, use the following code: curl -sS search/\?query\=new | jq ‘.[0]’ { “title”: “New York”, “location_type”: “City”, “woeid”: 2459115, “latt_long”: “40.71455,-74.007118” }

To display only the available cities, we add another filter, which is the key name itself (in our case, .title). We can combine multiple filters using the | (pipe) operator. Here we combine the .[] filter with .title in this way: .[] | .title . For simple queries, we can avoid the | operator and rewrite it as .[] .title, but we will use the | operator to combine queries. curl -sS search/\?query\=new | jq ‘.[] | .title’ “New York” “New Delhi” “New Orleans” “Newcastle” “Newark”

But what if we want to display multiple keys together? Just separate them by ‘,’. Now, let’s display the city along with its ID (woeid): curl -sS search/\?query\=new | jq ‘.[] | .title,.woeid’ “New York” 2459115 “New Delhi” 28743736 “New Orleans” 2458833 “Newcastle” 30079 “Newark” 2459269

Figure 1: Output comparison | OPEN SOURCE FOR YOU | DECEMBER 2017 | 67

Developers How To

Figure 2: Basic filters

The output looks good, but what if we format the output and print it on a single line? For that we can use string interpolation. To use keys inside a string pattern, we use backslash and parentheses so that they are not executed. curl -sS search/\?query\=new | jq ‘.[] | “For \(.title) code is \ (.woeid)”’ “For “For “For “For “For

The JSON structure for this endpoint looks like what’s shown in Figure 3. Consolidated_weather contains an array of JSON objects with weather information, and the sources key contains an array of JSON objects from which particular weather information was fetched. This time, let’s store JSON in a file named weather. json instead of directly piping data. This will help us

New York code is 2459115” New Delhi code is 28743736” New Orleans code is 2458833” Newcastle code is 30079” Newark code is 2459269”

In our case, JSON is small, but if it is too big and we need to filter it based on a key value (like display the information for New Delhi), jq provides the select keyword for that operation. curl -sS search/\?query\=new | jq ‘ .[] | select(.title == “New Delhi”) ‘ { “title”: “New Delhi”, “location_type”: “City”, “woeid”: 28743736, “latt_long”: “28.643999,77.091003” }

Now that we have the Where on Earth ID (woeid) for New Delhi, we can retrieve more information about New Delhi using the endpoint location/woeid/.

Figure 3: JSON structure


How To Developers avoid making an API call every time we want to perform an operation and, instead, we can use the saved JSON. curl -sS > weather.json

Now we can use jq in the format jq ‘filters’ weather.json and we can also load filters from a file using the -f parameter. The command is jq -f filters.txt weather.json, but we can just load the JSON file and pass filters in the command line. Let’s list the weather followed by the source name. Since both sources and consolidated_weather is of the same length (get the length using the length filter), we can use range to generate an index and use string interpolation. There is transpose and map inbuilt as well. Covering all of them won’t be possible in a single article. jq ‘range(0;([.sources[]] | length)) as $i | “ \(.sources[$i] .title) predicts \(.consolidated_weather[$i] .weather_state_ name)”’ weather.json

“ “ “ “ “ “

BBC predicts Light Cloud” predicts Clear” Met Office predicts Clear” OpenWeatherMap predicts Clear” World Weather Online predicts Clear” Yahoo predicts Clear”

Figure 4: Final output . as $root | print_location, (.consolidated_weather | process_weather_data)

There are so many functions and filters but we will use sort_by and date functions, and end this article by printing the forecast for each day in ascending order. # Format Date # This function takes value via the Pipe (|) operator def format_date(x): x |strptime(“%Y-%m-%d”) | mktime | strftime(“%a - %d, %B”); def print_location: . | “ Location: \(.title) Coordinates : \(.latt_long)

Save the above code as filter.txt. sort_by sorts the value by data. format_date takes dates as parameters and extracts short day names, dates and months. print_location and print_data do not take any parameter, and can be applied after the pipe operator; and the default parameter for a parameterless function will be ‘.’ jq -f filter.txt weather.json -r

-r will return a raw string. The output is shown in Figure 4. I hope this article has given you an overview of all that jq can achieve. If you are looking for a tool that is easy to use in shell scripts, jq can help you out; so give it a try. “;

def print_data: . | “ -----------------------------------------------| \(format_date(.applicable_date))\t\t | | Humidity : .\(.humidity)\t\t | | Weather State: \(.weather_state_name)\t\t\t | ------------------------------------------------”; def process_weather_data: . | sort_by(.applicable_date)[] | print_data;

References [1] [2] [3] [4]

By: Jatin Dhankhar The author loves working with modern C++, Ruby, JavaScript and Haskell. He can be reached at | OPEN SOURCE FOR YOU | DECEMBER 2017 | 69

Developers How To

Developing Real-Time Notification and Monitoring Apps in IBM Bluemix IBM Bluemix is a cloud PaaS that supports numerous programming languages and services. It can be used to build, run, deploy and manage applications on the cloud. This article guides the reader in building a weather app as well as an app to remotely monitor vehicle drivers.


loud computing is one of the emerging research technologies today. With the wide use of sensor and wireless based technologies, the cloud has expanded to the Cloud of Things (CoT), which is the merger of cloud computing and the Internet of Things (IoT). These technologies provide for the transmission and processing of huge amounts of information on different channels with different protocols. This integration of technologies is generating volumes of data, which then has to be disseminated for effective decision making in multiple domains like business analytics, weather forecasting, location maps, etc. A number of cloud service providers deliver cloud based services and application programming interfaces (APIs) to users and developers. Cloud computing has different paradigms and delivery models including IaaS, PaaS, SaaS, Communication as a Service (CaaS) and many others. A cloud computing environment that uses a mix of cloud services is known as the hybrid cloud. There are many hybrid clouds, which differ on the basis of the types of services and features in their cloud environment. The prominent cloud service providers include IBM Bluemix, Amazon Web Service (AWS), Red Hat OpenShift, Google Cloud Platform, etc.

IBM Bluemix and cloud services

IBM Bluemix is a powerful, multi-featured, hybrid cloud environment that delivers assorted cloud services, APIs and

development tools without any complexities. In general, IBM Bluemix is used as a Platform as a Service (PaaS), as it has many programming platforms for almost all applications. It provides programming platforms for PHP, Java, Go, Ruby, Node.js, ASP.NET, Tomcat, Swift, Ruby Sinatra, Python, Scala, SQL databases, NoSQL platforms, and many others. Function as a Service (FaaS) is also integrated in IBM Bluemix along with serverless computing, leading to a higher degree of performance and accuracy. IBM Bluemix’s services began in 2014 and gained popularity within three years.

Creating real-time monitoring apps in IBM Bluemix

IBM Bluemix presents high performance cloud services with the integration of the Internet of Things (IoT) so that real-time applications can be developed for corporate as well as personal use. Different types of applications can be programmed for remote monitoring by using IBM and third party services. In the following scenarios, the implementation aspects of weather monitoring and vehicle driver behaviour analysis are covered.

Creating a weather notification app

A weather notification app can be easily created with IBM Bluemix so that real-time messages to the client can be delivered effectively. It can be used for real-life scenarios during trips and other purposes. Those planning to travel to a particular place can get real-time weather notifications.


How To Developers

Figure 2: Creating an IoT app in IBM Bluemix

Figure 1: Login panel for IBM Bluemix Figure 3: Selecting Weather Services in IBM Bluemix

The remote analytics of the weather, including temperature, humidity and other parameters can be received by travellers, and they can take the appropriate precautions. First, a new IBM ID needs to be created on https:// IBM Bluemix provides users a 30-day trial, without need for credit card authentication. Most other cloud services ask for international credit card authentication. Users of IBM Bluemix, on the other hand, can create, program and use the services with just a unique e-mail ID. After creating an account and logging in to the Bluemix panel, there are many cloud based services which can be programmed for real-time use. IBM Bluemix is a hybrid cloud that delivers many types of cloud applications including infrastructure, containers, VMWare, network management, storage, the Internet of Things, data analytics, high performance computation, mobile applications and Web apps. To create a weather notification app, just search the IBM Bluemix catalogue for the Boilerplates based Internet of Things. The name of the new cloud app can be mentioned. The other values and parameters can be set as default, as we are working on a free trial account. Once the Cloud Foundry is created, the next step is to connect this app with the Weather Services in IBM Bluemix. Search for Weather Company Data and IBM Bluemix will display the appropriate option. Using Weather Company Data, live temperature and weather parameters can be fetched effectively. The newly created app will be visible in the dashboard of IBM Bluemix and, automatically, a Cloudant database will be created here. This database will be used to store and process the data in JSON format.

Figure 4: Dashboard of IBM Bluemix with the weather notification app

IBM Cloudant is a cloud based database product of the non-relational distributed variety. Using the Node-RED editor, the overall flow of the process can be programmed and visualised for any type of service and process. Node-RED is a flow based integrated development tool created by IBM so that objects and their connections can be set up without complex programming. Once the connections are created with the IoT app and weather service, the different channels including input, weather services, and transformation of temperature format can be set in Node-RED. After the successful deployment and launch of the service, the output will be visible in the user panel. The conditions related to different categories of weather (cloudy, sunny, rainy, humid, etc) can be specified using the Node-RED editor.

Creating an app for vehicles and to remotely monitor drivers

By using another IoT application, the multi-dimensional behaviour of vehicle drivers can be analysed remotely using IBM Bluemix on the Watson IoT platform. | OPEN SOURCE FOR YOU | DECEMBER 2017 | 71

Developers How To

Figure 5: Editing Node-RED editor

Figure 8: Selecting the driver behaviour service in IoT

Figure 6: Message display panel

Figure 7: Creating a context mapping app for driver behaviour analysis Figure 9: Credentials and tenant information

The following behavioural aspects of car drivers can be monitored remotely: ƒ Speed ƒ Braking attitude ƒ Sharp use of brakes ƒ Smooth or harsh accelerations ƒ Sharp turns ƒ Frequent U-turns You will first need to visit the IoT Context Mapping Service in the IBM Bluemix console. The following credentials and authentication fields will be displayed after creating the service: ƒ Tenant ID ƒ User name ƒ Password It should be noted that we have to select the free plan in every panel to avoid the charges of cloud services provided by IBM Bluemix. After getting the tenant information and authentication tokens, the next step is to create remote devices with information about their unique identities. Every sensor based device has a unique ID, which is set in the IBM Watson IoT platform so that live monitoring from remote places can be done. The Watson platform provides options to enter different devices with the help of ‘vehicles’ and gateways. By using this approach, real-time signals are transmitted to satellites and then to the end user for taking further action and decisions.

Figure 10: Creating remote devices with the related information

After setting all the parameters with respect to devices and gateways, click on Deploy and wait for the ‘successfully deployed’ notification.

Twilio for live messaging

If there is a need for live messaging to a mobile phone, Twilio can be used. Using the Twilio APIs, messages can be delivered to smartphones and other points. Twilio provides APIs and authentication tokens, which can also be mapped in IBM Bluemix for live monitoring of weather or other IoT based applications.


By: Dr Gaurav Kumar The author is the MD of Magma Research and Consultancy Pvt Ltd, Ambala. He is associated with various academic and research institutes, where he delivers lectures and conducts technical workshops on the latest technologies and tools. He can be contacted at

Insight Developers

Regular Expressions in Programming Languages: The JavaScript Story Each programming language has its own way of parsing regular expressions. We have looked at how regular expressions work with different languages in the earlier four articles in this series. Now we explore regular expressions in JavaScript.


n the previous issue of OSFY, we tackled pattern matching in PHP using regular expressions. PHP is most often used as a server-side scripting language but what if your client doesn’t want to bother the server with all the work? Well, then you have to process regular expressions at the client side with JavaScript, which is almost synonymous with client-side scripting language. So, in this article, we’ll discuss regular expressions in JavaScript. Though, technically, JavaScript is a general-purpose programming language, it is often used as a client-side scripting language to create interactive Web pages. With the help of JavaScript runtime environments like Node.js, JavaScript can also be used at the server-side. However, in this article, we will discuss only the client-side scripting aspects of JavaScript because we have already discussed regular expressions in the server-side scripting language— PHP. Just like we found out about PHP in the previous article in this series, you will mostly see JavaScript code embedded inside HTML script. As mentioned earlier in the series, limited knowledge of HTML syntax will in no way affect the understanding of the regular expressions used in JavaScript.

Though we are mostly interested in the use of regular expressions, as always, let’s begin with a brief discussion on the syntax and history of JavaScript. JavaScript is an interpreted programming language. ECMAScript is a scripting language specification from the European Computer Manufacturer’s Association (ECMA) and International Organization for Standardization (ISO), standardised in ECMA-262 and ISO/IEC 16262 for JavaScript. JavaScript was introduced by Netscape Navigator (now defunct) in 1995; soon Microsoft followed with its own version of JavaScript which was officially named JScript. The first edition of ECMAScript was released in June 1997 in an effort to settle the disputes between Netscape and Microsoft regarding the standardisation of JavaScript. The latest edition of ECMAScript, version 8, was released in June 2017. All modern Web browsers support JavaScript with the help of a JavaScript engine that is based on the ECMAScript specification. Chrome V8, often known as V8, is an open source JavaScript engine developed by Google for the Chrome browser. Even though JavaScript has borrowed a lot of syntax from Java, do remember that JavaScript is not Java. | OPEN SOURCE FOR YOU | DECEMBER 2017 | 73

Developers Insight Standalone JavaScript applications

Now that we have some idea about the scope and evolution of JavaScript, the next obvious question is, can it be used to develop standalone applications rather than only being used as an embedded scripting language inside HTML scripts? Well, anything is possible with computers and yes, JavaScript can be used to develop standalone applications. But whether it is a good idea to do so or not is debatable. Anyway, there are many different JavaScript shells that allow you to run JavaScript code snippets directly. But, most often, this is done during testing and not for developing useful standalone JavaScript applications. Like standalone PHP applications, standalone JavaScript applications are also not very popular because there are other programming languages more suitable for developing standalone applications. JSDB, JLS, JSPL, etc, are some JavaScript shells that will allow you to run standalone JavaScript applications. But I will use Node.js, which I mentioned earlier, to run our standalone JavaScript file first.js with the following single line of code: console.log(‘This is a stand-alone application’);

Open a terminal in the same directory containing the file first.js and execute the following command:

World’ script hello.html in JavaScript: <html> <body> <script> alert(‘Hello World’); </script> </body> </html>

Now let us try to understand the code. The HTML part of the code is straightforward and needs no explanation. All the JavaScript code should be placed within the <script> tags (<script> and </script>). In this example, the following code uses the alert( ) function to display the message ‘Hello World’ in a dialogue box: alert(‘Hello World’);

To view the effect of the JavaScript code, open the file using any Web browser. I have used Mozilla Firefox for this purpose. Figure 2 shows the output of the file hello.html. Please note that a file containing JavaScript code alone can have the extension .js, whereas an HTML file with embedded JavaScript code will have the extension .html or .htm.

node -v

This will make sure that Node.js is installed in your system. If Node.js is not installed, install it and execute the following command: node first.js

…at the terminal to run the script: first.js The message ‘This is a stand-alone application’ is displayed on the terminal. Figure 1 shows the output of the script first.js. This and all the other scripts discussed in this article can be downloaded from source_code/

Figure 1: Standalone application in JavaScript

Hello World in JavaScript

Whenever someone discusses programming languages, it is customary to begin with ‘Hello World’ programs; so let us not change that tradition. The code below shows the ‘Hello

Figure 2: Hello World in JavaScript

Regular expressions in JavaScript

There are many different flavours of regular expressions used by various programming languages. In this series we have discussed two of the very popular regular expression styles. The Perl Compatible Regular Expressions (PCRE) style is very popular, and we have seen regular expressions in this style being used when we discussed the programming languages Python, Perl and PHP in some of the previous articles in this series. But we have also discussed the ECMAScript style of regular expressions when we discussed regular expressions in C++. If you refer to that article on


Insight Developers regular expressions in C++ you will come across some subtle differences between PCRE and the ECMAScript style of regular expressions. JavaScript also uses ECMAScript style regular expressions. JavaScript’s support for regular expressions is built-in and is available for direct use. Since we have already dealt with the syntax of the ECMAScript style of regular expressions, we can directly work with a simple JavaScript file containing regular expressions.

JavaScript with regular expressions

Consider the script called regex1.html shown below. To save some space I have only shown the JavaScript portion of the script and not the HTML code. But the complete file is available for download.

is the same as that of regex1.html. The next few lines of code involve an if-else block. The following line of code uses the search( ) method provided by the String object: if( != -1)

The search( ) method takes a regular-expression pattern as an argument, and returns either the position of the start of the first matching substring or −1 if there is no match. If a match is found, the following line of code inside the if block prints the message ‘MATCH FOUND’ in bold: document.write(‘<b>MATCH FOUND</b>’);

Otherwise, the following line of code inside the else block prints the message ‘NO MATCH’ in bold:

<script> var str = “Working with JavaScript”; var pat = /Java/; if( != -1) { document.write(‘<b>MATCH FOUND</b>’); } else { document.write(‘<b>NO MATCH</b>’); } </script>

document.write(‘<b>NO MATCH</b>’);

Remember the search( ) method searches for a substring match and not for a complete word. This is the reason why the script reports ‘Match found’. If you are interested in a literal search for the word Java, then replace the line of code: var pat = /Java/;

Open the file regex1.html in any Web browser and you will see the message ‘Match Found’ displayed on the Web page in bold text. Well, this is an anomaly, since we did not expect a match. So, now let us go through the JavaScript code in detail to find out what happened. The following line of code stores a string in the variable str: var str = “Working with JavaScript”;

The line of code shown below creates a regular expression pattern and stores it in the variable pat: var pat = /Java/;

The regular expression patterns are specified as characters within a pair of forward slash ( / ) characters. Here, the regular expression pattern specifies the word Java. The RegExp object is used to specify regular expression patterns in JavaScript. This regular expression can also be defined with the RegExp( ) constructor using the following line of code: var pat = new RegExp(“Java”);

This is instead of the line of code: var pat = /Java/;

A script called regex2.html with this modification is available for download. The output for the script regex2.html

…with: var pat = /\sJava\s/;

The script with this modification regex3.html is also available for downloading. The notation \s is used to denote a whitespace; this pattern makes sure that the word Java is present in the string and not just as a substring in words like JavaScript, Javanese, etc. If you open the script regex3.html in a Web browser, you will see the message ‘NO MATCH’ displayed on the Web page.

Methods for pattern matching

In the last section, we had seen the search( ) method provided by the String object. The String object also provides three other methods for regular expression processing. The methods are replace( ), match( ) and split( ). Consider the script regex4.html shown below which uses the method replace( ): <html> <body> <form id=”f1”> ENTER TEXT HERE: <input type=”text” name=”data” > </form> <button onclick=”check( )”>CLICK</button> <script> function check( ) { var x = document.getElementById(“f1”); | OPEN SOURCE FOR YOU | DECEMBER 2017 | 75

Developers Insight var text =””; text += x.elements[0].value; text = text.replace(/I am/i,”We are”); document.write(text); } </script> </body> </html>

Open the file regex4.html in a Web browser and you will see a text box to enter data and a Submit button. If you enter a string like ‘I am good’, you will see the output message ‘we are good’ displayed on the Web page. Let us analyse the code in detail to understand how it works. There is an HTML form which contains the text box to enter data, with a button that, when pressed, will call a JavaScript method called check( ). The JavaScript code is placed inside the <script> tags. The following line of code gets the elements in the HTML form:

Figure 3: Input to regex4.html

Figure 4: Output of regex4.html var x = document.getElementById(“f1”);

In this case, there is only one element in the HTML form, the text box. The following line of code reads the content of the text box to the variable text: text += x.elements[0].value;

object for regular expression processing. Search( ) returns the starting index of the matched substring, whereas the match( ) method returns the matched substring itself. What will happen if we replace the line of code: text = text.replace(/I am/i,”We are”);

The following line of code uses the replace( ) method to test for a regular expression pattern and if a match is found, the matched substring is replaced with the replacement string: text = text.replace(/I am/i,”We are”);

In this case, the regular expression pattern is /I am/i and the replacement string is We are. If you observe carefully, you will see that the regular expression pattern is followed by an ‘i’. Well, we came across similar constructs throughout the series. This ‘i’ is an example of a regular expression flag, and this particular one instructs the regular expression engine to perform a case-insensitive match. So, you will get a match whether you enter ‘I AM’, ‘i am’ or even ‘i aM’. There are other flags also like g, m, etc. The flag g will result in a global match rather than stopping after the first match. The flag m is used to enable the multi-line mode. Also note the fact that the replace( ) method did not replace the contents of the variable text; instead, it returned the modified string, which then was explicitly stored in the variable text. The following line of code writes the contents on to the Web page: document.write(text);

Figure 3 shows the input for the script regex4.html and Figure 4 shows the output. A method called match( ) is also provided by the String

…in regex4.html with the following code? text = text.match(/\d+/);

If you open the file regex5.html having this modification, enter the string article part 5 in the text box and press the Submit button. You will see the number ‘5’ displayed on the Web page. Here the regular expression pattern is /\d+/ which matches for one or more occurrences of a decimal digit. Another method provided by the String object for regular expression processing is the split( ) method. This breaks the string on which it was called into an array of substrings, using the regular expression pattern as the separator. For example, replace the line of code: text = text.replace(/I am/i,”We are”);

…in regex4.html with the code: text = text.split(“.”);

…to obtain regex6.html. If you open the file regex6.html, enter the IPv4 address in dotted-decimal notation on the text box and press the Submit button. From then on, the IPv4 address will be displayed as ‘192, 100, 50, 10’. The IPv4 address string is split into substrings based on the separator ‘.’ (dot).


Insight Developers String processing of regular expressions

In previous articles in this series we mostly dealt with regular expressions that processed numbers. For a change, in this article, we will look at some regular expressions to process strings. Nowadays, computer science professionals from India face difficulties in deciding whether to use American English spelling or the British English spelling while preparing technical documents. I always get confused with colour/color, programme/program, centre/ center, pretence/pretense, etc. Let us look at a few simple techniques to handle situations like this. For example, the regular expression /colo(?:u)?r/ will match both the spellings ‘color’ and ‘colour’. The question mark symbol ( ? ) is used to denote zero or one occurrence of the preceding group of characters. The notation (?:u) groups u with the grouping operator ( ) and the notation ?: makes sure that the matched substring is not stored into a memory unnecessarily. So, here a match is obtained with and without the letter u. What about the spellings ‘programme’ and ‘program’? The regular expression /program(?:me)?/ will accept both these spellings. The regular expression /cent(?:re|er)/ will accept both the spellings, ‘center’ and ‘centre’. Here the pipe symbol ( | ) is used as an alternation operator. What about words like ‘biscuit’ and ‘cookie’? In British English the word ‘biscuit’ is preferred over the word ‘cookie’ and the reverse is the case in American English. The regular expression /(?:cookie|biscuit)/ will accept both the words — ‘cookie’ and ‘biscuit’. The regular expression /preten[cs]e/ will

match both the spellings, ‘pretence’ and ‘pretense’. Here the character class operator [ ] is used in the regular expression pattern to match either the letter c or the letter s. I have only discussed specific solutions to the problems mentioned here so as to make the regular expressions very simple. But with the help of complicated regular expressions it is possible to solve many of these problems in a more general way rather than solving individual cases. As mentioned earlier, C++ also uses ECMAScript style regular expressions; so any regular expression pattern we have developed in the article on regular expressions in C++ can be used in JavaScript without making any modifications. Just like the pattern followed in the previous articles in this series, after a brief discussion on the specific programming language, in this case, JavaScript, we moved on to the use of the regular expression syntax in that language. This should be enough for practitioners of JavaScript, who are willing to get their hands dirty by practising with more regular expressions. In the next part of this series on regular expressions, we will discuss the very powerful programming language, Java, a distant cousin of JavaScript. By: Deepu Benson The author is a free software enthusiast and his area of interest is theoretical computer science. He maintains a technical blog at He can be reached at

Would You Like More DIY Circuits? | OPEN SOURCE FOR YOU | DECEMBER 2017 | 77

Developers Insight

Insights into Machine Learning Machine learning is a fascinating study. If you are a beginner or simply curious about machine learning, this article covers the basics for you.


achine learning is a set of methods by which computers make decisions autonomously. Using certain techniques, computers make decisions by considering or detecting patterns in past records and then predicting future occurrences. Different types of predictions are possible, such as about weather conditions and house prices. Apart from predictions, machines have learnt how to recognise faces in photographs, and even filter out email spam. Google, Yahoo, etc, use machine learning to detect spam emails. Machine learning is widely implemented across all types of industries. If programming is used to achieve automation, then we can say that machine learning is used to automate the process of automation. In traditional programming, we use data and programs on computers to produce the output, whereas in machine learning, data and output is run on the computer to produce a program. We can compare machine learning with farming or gardening, where seeds --> algorithms, nutrients --> data, and the gardener and plants --> programs. We can say machine learning enables computers to learn to perform tasks even though they have not been explicitly programmed to do so. Machine learning systems crawl through the data to find the patterns and when found, adjust the program’s actions accordingly. With the help of pattern recognition and computational learning theory, one can study and develop algorithms (which can be built by learning from the sets of available data), on the basis of which the computer takes decisions. These algorithms are driven by

building a model from sample records. These models are used in developing decision trees, through which the system takes all the decisions. Machine learning programs are also structured in such a way that when exposed to new data, they learn and improve over time.

Implementing machine learning

Before we understand how machine learning is implemented in real life, let’s look at how machines are taught. The process of teaching machines is divided into three steps. 1 Data input: Text files, spreadsheets or SQL databases are fed as input to machines. This is called the training data for a machine. 2 Data abstraction: Data is structured using algorithms to represent it in simpler and more logical formats. Elementary learning is performed in this phase. 3. Generalisation: An abstract of the data is used as input to develop the insights. Practical application happens at this stage. The success of the machine depends on two things: ƒ How well the generalisation of abstraction data happens. ƒ The accuracy of machines when translating their learning into practical usage for predicting the future set of actions. In this process, every stage helps to construct a better version of the machine. Now let’s look at how we utilise the machine in real life. Before letting a machine perform any unsupervised task, the five steps listed below need to be followed.


Insight Developers

Data Program


Training data for the machine like text files, SQL databases, spreadsheets etc. Actual learning happens here by representing data in simpler and logical format using algorithm


Data Output



Figure 1: Traditional programming vs machine learning

Collecting data: Data plays a vital role in the machine learning process. It can be from various sources and formats like Excel, Access, text files, etc. The higher the quality and quantity of the data, the better the machine learns. This is the base for future learning. Preparing the data: After collecting data, its quality must be checked and unnecessary noise and disturbances that are not of interest should be eliminated from the data. We need to take steps to fix issues such as missing data and the treatment of outliers. Training the model: The appropriate algorithm is selected in this step and the data is represented in the form of a model. The cleaned data is divided into training data and testing data. The training data is used to develop the data model, while the testing data is used as reference to ensure that the model has been trained well to produce accurate results. Model evaluation: In this step, the accuracy and precision of the chosen algorithm is ensured based on the results obtained using the test data. This step is used to evaluate the choice of the algorithm. Performance improvement: If the results are not satisfactory, then a different model can be chosen to implement the same or more variables are introduced to increase efficiency.

Types of machine learning algorithms

Machine learning algorithms have been classified into three major categories. Supervised learning: Supervised learning is the most commonly used. In this type of learning, algorithms produce a function which predicts the future outcome based on the input given (historical data). The name itself suggests that it generates output in a supervised fashion. So these predictive models are given instructions on what needs to be learnt and how it is to be learnt. Until the model achieves some acceptable level of efficiency or accuracy, it iterates over the training data. To illustrate this method, we can use the algorithm for sorting apples and mangoes from a basket full of fruits.


Practical application happens here. It is used to generalize the real-time data to derive new insights

Figure 2: The process of teaching machines

Collecting data

Preparing the data

Training the model

Model Evaluation

Performance Improvement

Figure 3: Implementing machine learning

Machine Learning Algorithms Classification

Unsupervised (Clustering, Dimensionality Reduction)

Supervised (Classification, Regression/ Prediction)

Reinforcement (Association Analysis)

Figure 4: Classification of algorithms Training Text Documents, Images, Sounds... Machine Learning Algorithm


New Text Document, Image, Sound

features vector

Predictive Model

Expected Label

Figure 5: Supervised learning model (Image credit: Google)

Here we know how we can identify the fruits based on their colour, shape, size, etc. Some of the algorithms we can use here are the neural network, nearest neighbour, NaĂŻve Bayes, decision trees and regression. | OPEN SOURCE FOR YOU | DECEMBER 2017 | 79

Developers Insight

Training Text Documents, Images, Sounds...

New Text Document, Image, Sound...

features vectors Machine Learning Algorithm

features vector


Likelihood or Cluster Id or Better representation

Figure 6: Unsupervised learning model (Image credit: Google)

Unsupervised learning: The objective of unsupervised learning algorithms is to represent the hidden structure of the data set in order to learn more about the data. Here, we only have input data with no corresponding output variables. Unsupervised learning algorithms develop the descriptive models, which approach the problems irrespective of the knowledge of the results. So it is left to the system to find out the pattern in the available inputs, in order to discover and predict the output. From many possible hypotheses, the optimal one is used to find the output. Sorting apples and mangoes from a basket full of fruits can be done using unsupervised learning too. But this time the machine is not aware about the differentiating features of the fruits such as colour, shape, size, etc. We need to find similar features of the fruits and sort them accordingly. Some of the algorithms we can use here are the K-means clustering algorithm and hierarchical clustering. Reinforcement learning: In this learning method, ideas and experiences supplement each other and are also linked with each other. Here, the machine trains itself based on the experiences it has had and applies that knowledge to solving problems. This saves a lot of time, as very little human interaction is required in this type of learning. It is also called the trial-error or association analysis technique, whereby the machine learns from its past experiences and applies its best knowledge to make decisions. For example, a doctor with many years of experience links a patientâ&#x20AC;&#x2122;s symptoms to the illness based on that experience. So whenever a new patient comes, he uses his experience to diagnose the illness of the patient. Some of the algorithms we can use here are the Apriori algorithm and the Markov decision process.

Machine learning applications

Machine learning has ample applications in practically every domain. Some major domains in which it plays a vital role are shown in Figure 7. Banking and financial services: Machine learning plays an important role in identifying customers for credit card

Figure 7: Machine learning applications

offers. It also evaluates the risks involved with those offers. And it can even predict which customers are most likely to be defaulters in repaying loans or credit card bills. Healthcare: Machine learning is used to diagnose fatal illnesses from the symptoms of patients, by comparing them with the history of patients with a similar medical history. Retail: Machine learning helps to spot the products that sell. It can differentiate between the fast selling products and the rest. That analysis helps retailers to increase or decrease the stocks of their products. It can also be used to recognise which product combinations can work wonders. Amazon, Flipkart and Walmart all use machine learning to generate more business. Publishing and social media: Some publishing firms use machine learning to address the queries and retrieve documents for their users based on their requirements and preferences. Machine learning is also used to narrow down the search results and news feeds. Google and Facebook are the best examples of companies that use machine learning. Facebook also uses machine learning to suggest friends. Games: Machine learning helps to formulate strategies for a game that requires the internal decision tree style of thinking and effective situational awareness. For example, we can build intelligence bots that learn as they play computer games. Face detection/recognition: The most common example of face detection is this feature being widely available in smartphone cameras. Facial recognition has even evolved to the extent that the camera can figure out when to click â&#x20AC;&#x201C; for instance, only when there is a smile on the face being photographed. Face recognition is used in Facebook to automatically tag people in photos. Itâ&#x20AC;&#x2122;s machine learning that has taught systems to detect a particular individual from a group photo. Genetics: Machine learning helps to identify the genes associated with any particular disease.


Insight Developers Machine learning tools

There are enough open source tools or frameworks available to implement machine learning on a system. One can choose any, based on personal preferences for a specific language or environment. Shogun: Shogun is one of the oldest machine learning libraries available in the market. It provides a wide range of efficient machine learning processes. It supports many languages such as Python, Octave, R, Java/ Scala, Lua, C#, Ruby, etc, and platforms such as Linux/UNIX, MacOS and Windows. It is easy to use, and is quite fast at compilation and execution. Weka: Weka is data mining software that has a collection of machine learning algorithms to mine the data. These algorithms can be applied directly to the data or called from the Java code. Weka is a collection of tools for: ƒ Regression ƒ Clustering ƒ Association rules ƒ Data pre-processing ƒ Classification ƒ Visualisation Apache Mahout: Apache Mahout is a free and open source project. It is used to build an environment to quickly create scalable machine learning algorithms for fields such as collaborative filtering, clustering and classification. It also supports Java libraries and Java collections for various kinds of mathematical operations. TensorFlow: TensorFlow performs numerical computations using data flow graphs. It performs optimisations very well. It supports Python or C++, is highly flexible and portable, and also has diverse language options. CUDA-Convnet: CUDA-Convnet is a machine learning library widely used for neural network applications. It has been developed in C++ and can even be used by those who prefer Python over C++. The resulting neural nets obtained as output from this library can be saved as Python-pickled objects, and those objects can be accessed from Python. H2O: This is an open source machine learning as well as deep learning framework. It is developed using Java, Python and R, and it is used to control training due to its powerful graphic interface. H2O’s algorithms are mainly used for business processes like fraud or trend predictions.

Languages that support machine learning

The languages given below support the implementation of the machine language: ƒ MATLAB ƒ R ƒ Python ƒ Java

But for a non-programmer, Weka is highly recommended when working with machine learning algorithms.

Advantages and challenges

The advantages of machine learning are: ƒ Machine learning helps the system to decode based on the training data provided in the dynamic or undermined state. ƒ It can handle multi-dimensional, multi-variety data, and can extract implicit relationships within large data sets in a dynamic, complex and chaotic environment. ƒ It saves a lot of time by tweaking, adding, or dropping different aspects of an algorithm to better structure the data. ƒ It also uses continuous quality improvement for any large or complex process. ƒ There are multiple iterations that are done to deliver the highest level of accuracy in the final model. ƒ Machine learning allows easy application and comfortable adjustment of parameters to improve classification performance. The challenges of machine learning are as follows: ƒ A common challenge is the collection of relevant data. Once the data is available, it has to be pre-processed depending on the requirements of the specific algorithm used, which has a serious effect on the final results. ƒ Machine learning techniques are such that it is difficult to optimise non-differentiable, discontinuous loss functions. Discontinuous loss functions are important in cases such as sparse representations. Non-differentiable loss functions are approximated by smooth loss functions without much loss in sparsity. ƒ It is not guaranteed that machine learning algorithms will always work in every possible case. It requires some awareness about the problem and also some experience in choosing the right machine learning algorithm. ƒ Collection of such large amounts of data can sometimes be an unmanageable and unwieldy task.

References [1] machine-learning-basics-newbies/3 [2] machine-learning-basics/ [3] [4]

By: Palak Shah The author is an associate consultant in Capgemini and loves to explore new technologies. She is an avid reader and writer of technology related articles. She can be reached at | OPEN SOURCE FOR YOU | DECEMBER 2017 | 81

Developers Overview

A Peek at Popular and Preferred Open Source Web Development Tools Web development tools allow developers to test the user interface of a website or a Web application, apart from debugging and testing their code. These tools should not be mistaken for Web builders and IDEs.


ver wondered why different commercial software applications such as eBay, Amazon or various social platforms like Facebook, Twitter, etc, were initially developed as Web applications? The obvious answer is that users can easily access or use different Web applications whenever they feel like, with only the Internet. This is what helps different online retail applications lure their customers to their products. There is no need to install Web applications specifically on a given system and the user need not even worry about the platform dependency associated with that application. Apart from these, there are many other factors that make Web applications very user friendly, which we will discuss as we go along. A Web application is any client-server software application that makes use of a website as its interface or front-end. The user interface for any Web application runs in an Internet browser. The main function of any browser is to display the information received from a server and also to send the userâ&#x20AC;&#x2122;s data back to the server. Letâ&#x20AC;&#x2122;s consider the

instance of Microsoft Word and Google Docs. The former is a common desktop based word-processing application which uses the MS Word software installed on the desktop. Google Docs is also a word processing application, but all its users perform the word processing functions using the Web browser on which it runs, instead of using the software installed on their computers. Different Web applications use Web documents, which are written in a standard format such as JavaScript and HTML. All these formats are supported by a number of Web browsers. Web applications can actually be considered as variants of the client-server software model, where the client software is downloaded to the client system when the relevant Web page is visited, using different standard procedures such as HTTP. There can be different client Web software updates which may take place whenever we visit the Web page. During any session, the Web browser interprets and then displays the pages and hence acts as the universal client for any Web application.


Overview Developers







Figure 1: Different stages in Web application development (Image source:

Web application development since its inception

Initially, each individual Web page was delivered as a static document to the client, but the sequence of the pages could still provide an interactive experience, since a user’s input was returned using Web form elements present in the page markup. In the 1990s, Netscape came up with a client-side scripting language named JavaScript. Programmers could now add some dynamic elements to the user interface which ran on the client side. Just after the arrival of Netscape, Macromedia introduced Flash, which is actually a vector animation player that can be added to different browsers as a plugin to insert different animations on the Web pages. It even allows the use of scripting languages to program the interactions on the client side, without communicating with the server. Next, the concept of a Web application was introduced in the Java programming language in the Servlet Specification (version 2.2). This was when XMLHttpRequest object had also been introduced on Internet Explorer 5 as an ActiveX object. After all this, Ajax came in and applications like Gmail made their client sides more interactive. Now, a Web page script is able to actually contact the server for retrieving and storing data without downloading the entire Web page. We should not forget HTML5, which was developed to provide multimedia and graphic capabilities without any need for client side plugins. The APIs and Document Object Model (DOM) are fundamental parts of the HTML5 specification. WebGL API led the way for advanced 3D graphics using the HTML5 canvas and JavaScript language. If we talk about the current situation, we have different programming languages like Python and PHP (apart from Java), which can be used to develop any Web application. We also have different frameworks and open source tools that really help in developing a full-fledged Web application quite easily. So let’s discuss a few such tools as we go forward.

But first, let’s discuss why Web applications are needed and what are their benefits. 1. Cost-effective development: Different users can access any Web application via a uniform environment, which is nothing but a Web browser. The interaction of the user with the application needs to be tested on different Web browsers, but the application can only be developed for a single operating system. As it is not necessary to develop and test the application on all possible operating system (OS) versions and configurations, it makes the development and troubleshooting task much easier. 2. Accessible anywhere: Web applications are accessible anywhere, anytime and on any PC with the help of an Internet connection. This also opens up the possibilities of using Web applications for real-time collaboration and accessing them remotely. 3. Easily customisable: It is easier to customise the user interface of Web based applications than of desktop applications. Hence, it’s easier to update and customise the look and feel of the Web applications or the way their information is presented to different user groups. 4. Can be accessed by a range of devices: The content of any Web application can also be customised for use on any type of device connected to the Internet. This helps to access the application through mobile phones, tablets, PCs, etc. Users can receive or interact with the information in a way that best suits them. 5. Improved interoperability: The Web based architecture of an application makes it possible to easily integrate the enterprise systems, improve workflow and other such business processes. With the help of Internet technologies, we get an adaptable and flexible business model that can be changed according to market demands. 6. Easier installation and maintenance of the application: Any Web based application can be installed and maintained with comparatively less complications. Once a new version of the application is installed on the host server, all users can access it directly without any need to upgrade the PC of each user. The roll-out of new software can also be accomplished easily, requiring only that the users have updated browsers and plugins. Quick & Easy Installation

Increased Revenue

Efficiency & Effectiveness

Benefits of Web Application Development

Easy Customization

Figure 2: Benefits of Web application development (Image source: | OPEN SOURCE FOR YOU | DECEMBER 2017 | 83

Developers Overview 7. Can adapt to increased workloads: If a Web application requires higher power to perform certain tasks then only the server hardware needs to be upgraded. The capacity of a Web application can be increased by ‘clustering’ the software on different servers simultaneously. 8. Increased security: Web applications are deployed on dedicated servers which are maintained and monitored by experienced server administrators. This leads to tighter security and any potential breaches are noticed far more quickly. 9. Flexible core technologies: We can use any of the available three core technologies for building Web applications, based on the requirements of that application. Java-based solutions (J2EE) involve technologies such as servlets and JSP. The recent Microsoft .NET platform makes use of SQL Server, Active Server Pages and .NET scripting languages. The last option is the open source platforms (PHP and MySQL), which are best suited for smaller and low budget websites.

Different open source tools for developing Web applications KompoZer

This is an open source HTML editor, which is based on the Nvu editor. It’s maintained as a community-driven software development project, and is a project on Sourceforge. KompoZer’s WYSIWYG (what you see is what you get) editing capabilities are among its main attractions. The latest of its pre-release versions is KompoZer 0.8 beta 3. Its stable version was KompoZer 0.7.10, released in August 2007. It complies with the Web standards of W3C. By default, the Web pages are created in accordance with HTML 4.01 Strict. It uses Cascading Style Sheets (CSS) for styling purposes, but the user can even change the settings and choose between the following styling options: HTML 4.01 and XHTML 1.0 Strict and transitional DTD CSS styling and the old <font> based styling. KompoZer can actually call on the W3C HTML validator, which uploads different Web pages to the W3C Markup Validation Service and then checks for compliance. Features 1. Available free of cost. 2. Easy to use, hence even non-techies can work with it. 3. Combines Web file management and easy-to-use WYSIWYG Web page editing. 4. Allows direct code editing. 5. Supports Split code graphic view.


phpMyAdmin is an open source software tool written in PHP. It handles the administration of MySQL over the Web.

Figure 3: Login page for phpMyAdmin (Image source:

It supports a wide range of different operations on MariaDB and MySQL. Some of the frequently used operations (such as managing databases, relations, tables, indexes, users, etc) can be performed with the help of the user interface, while we can still directly execute any of the SQL statements. Features 1. Has an intuitive Web interface. 2. Imports data from SQL and CSV. 3. Can administer multiple servers. 4. Has the ability to search globally in a database or a subset of it. 5. Can create complex queries using QBE (Query-byexample). 6. Can create graphics of our database layout in various formats. 7. Can export data to various formats like SQL, CSV, PDF, etc. 8. Supports most of the MySQL features such as tables, views, indexes, etc.


In XAMPP, the X denotes ‘cross-platform’, A stands for the Apache HTTP server, M for MySQL, and the two Ps for PHP and Perl. This platform is very popular, and is widely preferred for open source Web application development. The development of any Web application using XAMPP helps to easily stack together a different number of programs in order to constitute an application as desired. The best part of Web applications developed using XAMPP is that these are open source with no licensing required, and are free to use. They can be customised according to one’s requirement. Although XAMPP can be installed on all the different platforms, its installation file is specific to a platform. Features 1. Can be installed on all operating systems.


Overview Developers 2. Easy installation and configuration. 3. Live community support. 4. Supports easy syndication of the operating system, application server, used programming language and database to develop any open source Web application for the desired outcome in an optimal development time. 5. It is an all-in-one solution, with just one control panel for installing and configuring all packaged programs

Firefox Web Developer Toolbar

The Firefox Developer Toolbar gives us command-line access to a large number of developer tools within Firefox. It’s a graphical command line interpreter, which provides integrated help for its commands and also displays rich output with the power of a command line. It is considered to be extensible, as we can add our own local commands and even convert those into add-ons so that others can also install and use them. We can open the Developer Toolbar by pressing Shift+F2. This will appear at the bottom of the browser as shown in Figure 4.

Some Developer Toolbar commands

Developer Toolbar command

Command definition


Open, close, and clear the console


View and manipulate appcache entries


Command to control the debugger


Log function calls to the console


List all the installed add-ons, enable or disable a specific add-on


List, add, or remove breakpoints


Disconnect from a remote server


Edit one of the resources loaded by the page


Export the page


Examine a node in the inspector

The command-line prompt takes up most of the toolbar, with the ‘Close’ button on its left and a button to toggle the Toolbox on the right. Pressing Shift+F2 or even selecting the Developer Toolbar menu item will set off the toolbar.

References [1] [2] [3] [4]

By: Vivek Ratan

Figure 4: Firefox Developer Toolbar present at the bottom of the browser (Image source:


The author is a B. Tech in electronics and instrumentation engineering. He currently works as an automation test engineer at Infosys, Pune. He can be reached at for any suggestions or queries.

The latest from the Open Source world is here. Join the community at Follow us on Twitter @OpenSourceForU | OPEN SOURCE FOR YOU | DECEMBER 2017 | 85

Developers Let's Try

Simplify and Speed Up App Development with OpenShift OpenShift is Red Hatâ&#x20AC;&#x2122;s container application platform that brings Docker and Kubernetes into play when deployed. With OpenShift, you can easily and quickly build, develop and deploy applications, irrespective of the platform being used. It is an example of a Platform as a Service.


latform as a Service or PaaS is a cloud computing service model that reduces the complexity of building and maintaining the computing infrastructure. It gives an easy and accessible environment to create, run and deploy applications, saving developers all the chaotic work such as setting up, configuring and managing resources like servers and databases. It speeds up app development, allowing users to focus on the application itself rather than worry about the infrastructure and runtime environment. Initially, PaaS was available only on the public cloud. Later, private and hybrid PaaS options were created. Hybrid PaaS is typically a deployment consisting of a mix of public and private deployments. PaaS services available in the cloud can be integrated with resources available on the premises. PaaS offerings can also include facilities for application design, application development, testing and deployment. PaaS services may include team collaboration, Web service integration, marshalling, database integration, security, scalability, storage, persistence, state management, application versioning and developer community facilitation, as well as

mechanisms for service management, such as monitoring, workflow management, discovery and reservation. There are some disadvantages of using PaaS. Every user may not have access to the full range of tools or to the highend tools like the relational database. Another problem is that PaaS is open only for certain platforms. Users need to depend on the cloud service providers to update the tools and to stay in sync with other changes in the platform. They donâ&#x20AC;&#x2122;t have control over this aspect.


OpenShift is an example of a PaaS and is offered by Red Hat. It provides an API to manage its services. OpenShift Origin allows you to create and manage containers. OpenShift helps you to develop, deploy and manage applications which are container-based, and enables faster development and release life cycles. Containers are standalone processes, with their own environment, and are not dependent on the operating system or the underlying infrastructure on which they run.


Let's Try Developers Types of OpenShift services

OpenShift Origin: OpenShift Origin is an open source application container platform from Red Hat, which has been released under the Apache licence. It uses the core of Docker container packaging and Kubernetes container cluster management, which enables it to provide services to create and manage containers easily. Essentially, it helps you to create, deploy and manage applications in containers that are independent of the operating system and the underlying infrastructure. OpenShift Online: This is Red Hat’s public cloud service. OpenShift Dedicated: As its name suggests, this is Red Hat’s offering for maintaining private clusters. Red Hat OpenShift Dedicated provides support for application images, database images, Red Hat JBoss middleware for OpenShift, and Quickstart application templates. Users can get this on the Amazon Web Services (AWS) and Google Cloud Platform (GCP) marketplaces.

OpenShift Container: OpenShift Container or OpenShift Enterprise is a private PaaS product from Red Hat. It includes the best of both worlds—containers powered by Docker and the management provided by Kubernetes. Red Hat announced OpenShift Container Platform 3.6 on August 9, 2017.

OpenShift features ƒ





ƒ ƒ

ƒ Figure 1: Types of OpenShift services


In OpenShift, we can create applications by using programming languages such as Java, Node.js, .NET, Ruby, Python and PHP. OpenShift also provides templates that allow you to build (compile and create packages) and release application frameworks and databases. It provides service images and templates of JBoss middleware. These are available as a service on OpenShift. A user can build (compile and create packages) applications and deploy them across different environments. OpenShift provides full access to a private database copy with full pledge control, as well as a choice of datastores like MariaDB, MySQL, PostgreSQL, MongoDB, Redis, and SQLite. Users can benefit from a large community of Dockerformatted Linux containers. OpenShift has the capability to work directly with the Docker API, and unlocks a new world of content for developers. Simple methods are used to deploy OpenShift, such as clicking a button or entering a Git push command. OpenShift is designed to reduce many systems administration problems related to building and deploying containerised applications. It permits the user to fully control the deployment life cycle. The OpenShift platform includes Jenkins, which is an open source automation server that can be used for continuous integration and delivery. It can integrate unit test case results, promote builds, and orchestrate build jobs. This is done by using upstream and downstream jobs, and creating a pipeline of jobs. OpenShift supports integration with IDEs such as Eclipse, Visual Studio and JBoss Developer Studio. It is easier to use any of these IDEs and work with OpenShift.

Getting started with OpenShift Online

Let’s take a quick tour of OpenShift Online. Go to

Figure 2: OpenShift Dedicated supports application images, database images, Red Hat JBoss middleware for OpenShift, and Quickstart application templates

Figure 3: OpenShift Online dashboard | OPEN SOURCE FOR YOU | DECEMBER 2017 | 87

Developers Let's Try

Figure 4: OpenShift online login

Figure 6: OpenShift Online — authorize Red Hat developer

Project and on the Welcome to OpenShift page, provide the name and display name. Next, click on Create. Select the language from the Browse Catalogue option. Select Red Hat JBoss Web server (Tomcat). Select the version, provide the name and the Git repository URL. Next, click on Create. You will get the ‘Application created’ message. Click on Continue to overview. Go to the Overview section of the project created and verify the details related to the application. By following the above steps, you have created your first project. Now, you can continue exploring further to get a better understanding of OpenShift Online.

References Figure 5: OpenShift Online – sign in to GitHub

Click on Login and log in by using any social media account. Sign in to GitHub. Click on Authorize redhat-developer. Provide your account details. Then verify the email address using your email account. Next, select a starter plan, followed by the region you want. Then confirm subscription. Now your account will be provisioned. On the OpenShift online dashboard, click on Create


[1] [2]

By: Bhagyashri Jain and Mitesh S. Bhagyashri Jain is a systems engineer and loves Android development. She likes to read and share daily news on her blog at Mitesh S. is the author of the book, ‘Implementing DevOps with Microsoft Azure’. He occasionally contributes to and Book link: https://

Let's Try Developers

Cloud Foundry is an industry standard cloud application platform. Developers can use it to build apps without having to worry about the nitty gritty of hardware and software maintenance. By focusing solely on the application, they can be more productive.


Why opt for a PaaS offering like Cloud Foundry?

loud Foundry is an open source, Platform as a Service (PaaS) offering, governed by the Cloud Foundry Foundation. You can deploy it on AWS, Azure, GCP, vSphere or your own computing infrastructure.

The different landscapes for applications

Let’s take a step back and quickly check out all the landscapes for applications. If you want an application to cater to one of your needs, there are several ways of getting it to do so. 1. Traditional IT: In a traditional landscape, you can procure your infrastructure, manage all the servers, handle the data and build applications on top of it. This gives you the most control, but also adds operational complexity and cost. 2. Infrastructure as a Service (IaaS): In this case, you can buy or lease the infrastructure from a service provider, install your own operating system, programming runtimes, databases, etc, and build your custom applications on top of it. Examples include AWS, Azure, etc. 3. Platform as a Service (PaaS): With this, you get a complete platform from the service provider, with the hardware, operating system, and runtimes managed by the service provider --you can build applications on top of it. Examples include Cloud Foundry, OpenShift, etc. 4. Software as a Service (SaaS): Here, the service provider has already a pre-built application running on the cloud —if it suits your needs, you just get a subscription and use it. There might be a pre-existing application to meet your needs but if there isn’t, this offering provides very little room for customisation. Examples include Gmail, Salesforce, etc.

Choosing a PaaS offering has multiple benefits. It abstracts away the hardware and infrastructure details so your workforce can concentrate more on application development, and you require very few operations to be managed by the IT team. This leads to faster turnaround times for your applications and better cost optimisation. It also helps in rapid prototyping as you have the platform taken care of; so you can build prototypes around your business problems more rapidly.

Cloud Foundry: A brief description

Cloud Foundry is multi-cloud, open source software that can be hosted on AWS, Azure, GCP or your own stack. Since Cloud Foundry is open source, you get application portability out-of-the-box, i.e., you are not locked in to a vendor. You can build your apps on it and move them across any of the Cloud Foundry providers. The Cloud Foundry project is managed by the Cloud Foundry Foundation, whose mission is to establish and sustain the development of the platform and to provide continuous innovation and value to the users and operators of Cloud Foundry. The members of the Cloud Foundry Foundation include Cisco, Dell, EMC, GE, Google, IBM, Microsoft, Pivotal, SAP, SUSE and VMware. From a developer’s point of view, Cloud Foundry has support for buildpacks and services. Buildpacks provide the framework and runtime support for apps. Typically, they examine your apps to determine what dependencies | OPEN SOURCE FOR YOU | DECEMBER 2017 | 89

Developers Let's Try Router


UAA Server


Cloud Controller

Application availability services

BLOS Store

Application Execution Framework

App Lifecycle

App Storage and Execution

Figure 1: Cloud landscapes Service Broker

to download, and how to configure the apps to communicate Components for internal communication between VMs with bound services. Services are often externally managed components that may or may not be hosted on the Cloud Metrics Collector Loggregator Foundry stack (examples include databases, caches, etc). They are available in the marketplace, and can be consumed by the application by binding to them. Figure 2: Cloud Foundry architecture

Cloud Foundry architecture

Cloud Foundry components can be broadly classified under the following categories. 1. Routing • Router: This is the entry point into the Cloud Foundry (CF) instance. CF provides REST services for administration purposes too. So, the call is initially received by the router and redirected to the cloud controller, if it’s an administration call, or to an application running on the stack. 2. Authentication • User account and authentication (UAA) server: The role of the UAA server is to log in users and issue OAuth tokens for those logged in, which can be used by the applications. It can also provide SSO services, and has endpoints for registering OAuth clients and user management functions. 3. Application life cycle • Cloud connector: The cloud connector is responsible for the deployment of applications. When you push an app to CF, it reaches the cloud connector, which coordinates with other components and deploys the application on individual cells in your space. • Application availability components (nsync, BBS and Cell Reps): These components are responsible for the health management of the applications. They constantly monitor an application’s state and reconcile them with their expected states, starting and stopping processes as required. 4. App storage and execution • BLOB storage: This is binary storage, which stores your application binaries and the buildpacks that are used to run the applications. • Application execution framework (Diego): Application instances, application tasks and staging tasks all run as Garden containers on the Diego cell VMs. The Diego cell rep component manages the life cycle of those containers and the processes running in them. It reports their status to the Diego BBS, and emits their logs and metrics to Loggregator.


Messaging Metrics and Logging

5. Services • Service broker: Cloud Foundry has external components like a database, SaaS applications and platform features (e.g., the platform can offer services for analytics or machine learning), which are classified as services. These services can be bound to an application and be consumed by it. The service broker is responsible for provisioning an instance of the service and binding it to the application. 6. Messaging • The platform’s component VMs communicate with each other and share messages over HTTP and HTTPS. This component is responsible for sharing messages, and also storing long-lived data (like the IP address of a container in a Consul server) and short-lived data (like application status and heartbeat messages) on Diego’s bulletin board system. 7. Metrics and logging • Metrics collector: This collects statistics from the components, which are used for book-keeping and health management by the framework as well as by the operators managing the infrastructure. • Loggregator: Applications built on top of the Cloud Foundry stack need to write their logs on the system output streams. These streams are received by the Loggregator, which can be used to redirect them to file systems, databases or to external log management services. Cloud Foundry is a highly scalable, easy-to-manage, open source platform that can be used to develop applications of all types and sizes. To get further information about the ecosystem, you can visit


By: Shiva Saxena The author is a FOSS enthusiast. He currently works as a consultant, and is involved in developing enterprise application and Software-as-a-Service (SaaS) products. He has hands-on development experience with Android, Apache Camel, C#, .NET, Hadoop, HTML5, Java, OData, PHP, React, etc, and loves to explore new and bleeding-edge technologies. He can be reached at

Let's Try For U & Me

DokuWiki: An Ace Among the Wikis A wiki could be defined as a simple, collaborative content management system which allows content editing. DokuWiki can be used for several purposes like a discussion forum for members of a team, providing tutorials and guides, providing knowledge about a topic, and so on. Once you have finished typing the content on the page, you can preview it by clicking the ‘Preview’ button at the bottom of the editor. This will show you how your page will be displayed after it is published. Pages can be removed by purging all their contents. Once a page is removed, all its interlinked pages too get removed. But we can restore a page by choosing its ‘Old revisions’ option. This option stores snapshots of the page over different time periods, so it is easy to restore a page with its contents from a particular time period.



okuWiki is PHP powered, modest but versatile wiki software that handles all the data in plain text format, so no database is required. It has a clear and understandable structure for your data, which allows you to manage your wiki without hassles. DokuWiki is really flexible and offers various customisation options at different levels. Since DokuWiki is open source, it has a huge collection of plugins and templates which extend its functionalities. It is also well documented and supported by a vast community. Although DokuWiki has numerous features, this article focuses only on the basics, in order to get readers started.


Pages can be easily created in DokuWiki by simply launching it on your browser. The first time, you will be shown a page like the one in Figure 1. Find the pencil icon on the right side; clicking on it will open up an editor and that’s it. You can start writing content on that page. The various sections of the page can be identified by the headings provided on it. The sections are listed out as a ‘Table of Contents’ on the top right side of the page.

Typically, a wiki may contain lots of pages. It is important to organise the pages so that the information the user seeks can be found easily. Namespaces serve this purpose by keeping all the relevant pages in one place. The following namespaces are bundled along with the standard DokuWiki installation: ƒ wiki ƒ playground It is recommended that if you want to try anything before getting into a live production environment, use the ‘playground’ namespace. To create a namespace, use the following syntax: namespace_name: page_name

If the defined namespace doesn’t exist, DokuWiki automatically creates it without any break in linkage with the rest of the wiki. To delete a namespace, simply erase all of its pages, which leads to empty namespaces; DokuWiki automatically deletes these.


The linkage between the pages is vital for any wiki site. This ‘linkability’ keeps the information organised and easily accessible. By the effective use of links, the pages are organised in a concise manner. DokuWiki supports the following types of links in a page. External links: These links deal with the external resources, i.e., websites. You can use a complete URL for a website such as or you can add an alternative text for that website like [[https://www. | search with DuckDuckGo]]. Also, we can link an email ID by enclosing it with the angled brackets, for example <>. | OPEN SOURCE FOR YOU | DECEMBER 2017 | 91

For U & Me Let's Try recognised by your browser; so it is recommended that you add media files in the multiple formats prescribed above, so that any one of the formats can be recognised by your browser.

Access control lists

Figure 1: Sample blank page

Figure 2: Old revisions of a page

Internal links: Internal links point to the pages within the wiki. To create an internal link, just enclose the page name within the square brackets. The colour of the page links shows the availability of the page. If the link is in green, then the page is available. And if the link is red, then the page is unavailable. Sectional links: Sectional links are used to link different sections of a page by using the ‘#’ character followed by the section name, enclosed by the double square brackets on both the sides. For example: [[#section|current section]]


Media files can be added to your wiki by clicking the ‘picture frame’ icon (next to the smiley in the editor). On clicking the icon, a new dialogue box will pop up, and by clicking the ‘Select files’ button you can select the files for your wiki. When you’re finished, click the ‘Upload’ button and once it is uploaded, the image will show up in the dialogue box. You can further customise the image by clicking on it. This will display the ‘Link settings’ option. You can define how the image can be linked to your wiki, the alignment, and the size of your image. After customising the image, click on the ‘Insert’ button to insert the image in the page. You can add metadata to your image by using the ‘Media manager’ option and choose the ‘Edit’ option for the particular image. DokuWiki supports the following media formats: Images: gif, jpg, png Video: webm, ogv, mp4 Audio: ogg, mp3, wav Flash: swf

Sometimes, the media file you added may not be

ACL or Access Control Lists are one of the core features of the DokuWiki. They define the access rights of Figure 3: Examples of links the wiki for its users. There are seven types of permissions that can be assigned to the users, which are Read, Edit, Create, Upload, Delete, None and Admin. Of these, ‘Delete’ is the highest permission and ‘Read’ is the lowest, so if the ‘Delete’ permission is assigned to users, they can have ‘Read’, ‘Edit’, ‘Create’ and ‘Upload’ permissions as well. Admin permissions are assigned to the members of the admin group, and surpass all the other permissions. Also, the members of the admin group are considered as ‘super users’, so regardless of the ACL restrictions, they can have unlimited access on the wiki. Please note that ‘Create’, ‘Upload’ and ‘Delete’ permissions can be applied to namespaces only. To change the access rules for a page or a namespace, follow the steps given below: 1. Log in to DokuWiki as the admin. 2. Now click on the gear wheel icon next to the admin label on the top of the page. 3. Choose the ‘Access control list management’ option in the administration page. 4. On the left side you will see the available namespaces and their pages. You just have to click on your choice and select the group or the user by supplying the respective name. 5. For each page or namespace, the current permissions for the selected group or user will be displayed. Below that, you can change the permissions and save them. These will get updated on the ACL rules table. DokuWiki determines the access rights for each user by the following constraints: 1. DokuWiki will check all the permission rules against a person or the group to which s/he belongs, but the catch lies with the permission rule, which is closer to the namespace: page level will take precedence and this will determine the access for that person. 2. When more than one rule is matched on the same level, the rule with the highest permission will be chosen. Please note that users in DokuWiki are compiled into groups. Before a user is manually added to any group by the administrator, all of its users will belong to the


Let's Try For U & Me User 1: Name: Stella Ritchie, a non-registered user User group: @ALL For this user, in the rule table, the first rule and the third one matches, but the third rule matches on the namespace level since her access to the washington_team is None. Figure 4: Image uploaded

Figure 5: Link settings

following groups: ƒ @user: All registered users will belong to this group. ƒ @ALL: Every user of DokuWiki falls into this group, including registered and non-registered users. Let’s assume you want to create a namespace for your team in Washington and name it as washington_team. Now we want this namespace to be accessible only to your team members. To achieve that, you will add team members to a user group @wa_dc. Now let’s analyse the ACL rule table definition for the namespace washington_team. The first line of the table tells us that any user, including the registered and non-registered users, can only read the contents on the root namespace. The second line of the table shows us that the registered users of the wiki can have ‘Upload’ access, which enables them additional accesses such as ‘Read’, ‘Edit’ and ‘Create.’ The third line tells us that users in the group @ALL cannot access this namespace. This line specifically restricts access to all, including the intended users. The fourth line shows that the user group @wa_dc can have ‘Upload’ access. This line is the continuation of the third line, since we have now successfully made the namespace washington_team exclusive for the user group @wa_dc. In the fifth line, it shows us that ‘Admin’ has given ‘Delete’ access to the namespace washington_team; hence, the admin can have full and unrestricted access to the namespace. Now let’s create two pages—tasks and swift_private — in washington_team to see how ACL rules are applied to different users (Figure 7).

Figure 6: ACL rule table for the Washington namespace

User 2: Name: Priscilla Hilpert, a registered user User group: @wa_dc For Priscilla, the Rules 1, 2, 4, 6 and 8 are matching. Priscilla can have access to the namespace washington_team via Rule 4. In Rule 8, it shows that Priscilla can have ‘Read only’ access to the page washington_team: tasks since this permission was set at the namespace level. Rule 6 shows that Priscilla is prohibited from accessing the page washington_team: swift_private since the permission for accessing this page is set to None. User 3: Name: Sylvester Swift User group: @wa_dc For Sylvester, the Rules 1, 2, 4, 7 and 8 are matching. Rule 4 enables Sylvester to access the washington_ team. Rule 7 gives the Edit and Read access to the page washington_team: swift_private. Rule 8 shows that Sylvester has ‘Read only’ access to the page washington_team: tasks.


The default look of DokuWiki can be changed by choosing a template from the template collection that is available for download. It is recommended that you download the template version that is equal to or more recent than your DokuWiki version. DokuWiki offers a huge collection of plugins. As of now, it has more than 900 plugins that are available for download, each of which extend the functions of DokuWiki. Plugin installation can be automated by using the extension manager; or you could do it manually by downloading the package on your computer and manually upload it via the extension manager. Please note that the plugins are contributed by the user community and may not be properly reviewed by the DokuWiki team. Always look out for the warnings and update information in a plugin page to avoid problems. By: Magimai Prakash

Figure 7: ACL rule table for the Washington namespace and its pages

The author has a B.E. degree in computer science. As he is deeply interested in Linux, he spends most of his leisure time exploring open source. | OPEN SOURCE FOR YOU | DECEMBER 2017 | 93

For U & Me Overview

The Connect Between

Deep Learning and AI Deep learning is a sub-field of machine learning and is related to algorithms. Machine learning is a kind of artificial intelligence that provides computers with the ability to learn, without explicitly programming them.


eep learning is a new area of machine learning research, which has been introduced with the objective of moving machine learning closer to one of its original goals—artificial intelligence (AI). Deep learning is the sub-field of machine learning that is concerned with algorithms. Its structure and function is inspired by that part of the human brain called neural networks. It is the work of well-known researchers like Andrew Ng, Geoff Hinton, Yann LeCun, Yoshua Bengio and Andrej Karpathy which has brought deep learning into the spotlight. If you follow the latest tech news, you may have even heard about how important deep learning has become among big companies such as: ƒ Google buying DeepMind for US$ 400 million ƒ Apple and its self-driving car ƒ NVIDIA and its GPUs ƒ Toyota’s billion dollar AI research investments All of this tells us that deep learning is really gaining in importance.

The first neural nets were born out of the need to address the inaccuracy of an early classifier, the perceptron. It was shown that by using a layered web of perceptrons, the accuracy of predictions could be improved. This new breed of neural nets was called a multi-layer perceptron or MLP. You may have guessed that the prediction accuracy of a neural net depends on its weights and biases. We want the accuracy to be high, i.e., we want the neural net to predict a value that is as close to the actual output as possible, every single time. The process of improving a neural net’s accuracy is called training, just like with other machine learning methods. Here’s that forward prop again – to train the net, the output from forward prop is compared to the output that is known to be correct, and the cost is the difference of the two. The point of training is to make that cost as small as possible, across millions of training examples. Once trained well, a neural net has the potential to make accurate predictions each time. This is a neural net in a nutshell (refer to Figure 1).

Neural networks

When the patterns get really complex, neural nets start to outperform all of their competition. Neural nets truly have the potential to revolutionise the field of artificial intelligence. We all know that computers are very good with repetitive calculations and detailed instructions, but they’ve historically been bad at recognising patterns. Thanks to deep learning, this is all about to change. If you only need to analyse simple patterns, a basic classification tool like an SVM or logistic regression is typically good enough. But when your data has tens of different inputs or more, neural nets start to win out over the other methods.

The first thing you need to know is that deep learning is about neural networks. The structure of a neural network is like any other kind of network; there is an interconnected Web of nodes, which are called neurons, and there are edges that join them together. A neural network’s main function is to receive a set of inputs, perform progressively complex calculations, and then use the output to solve a problem. This series of events, starting from the input, where each activation is sent to the next layer and then the next, all the way to the output, is known as forward propagation, or forward prop.

Three reasons to consider deep learning


Overview For U & Me Simple Neural Network

Input Layer

Deep Learning Neural Network

Hidden Layer

Output Layer

Figure 1: Deep learning and neural networks

Still, as the patterns get even more complex, neural networks with a small number of layers can become unusable. The reason is that the number of nodes necessary in each layer grows exponentially with the number of possible patterns in the data. Eventually, training becomes way too expensive and the accuracy starts to suffer. So for an intricate pattern – like an image of a human face, for example – basic classification engines and shallow neural nets simply aren’t good enough. The only practical choice is a deep net. But what enables a deep net to distinguish these complex patterns? The key is that deep nets are able to break the multifaceted patterns down into a series of simpler patterns. For example, let’s say that a net has to decide whether or not an image contains a human face. A deep net would first use edges to detect different parts of the face – the lips, nose, eyes, ears, and so on – and would then combine the results together to form the whole face. This important feature – using simpler patterns as building blocks to detect complex patterns – is what gives deep nets their strength. These nets have now become very accurate and, in fact, a deep net from Google recently beat a human at a pattern recognition challenge.

What is a deep net platform?

A platform is a set of tools that other people can build on top of. For example, think of the applications that can be built off the tools provided by iOS, Android, Windows, MacOS, IBM Websphere and even Oracle BEA. Deep learning platforms come in two different forms – software platforms and full platforms. A deep learning platform provides a set of tools and an interface for building custom deep nets. Typically, it provides a user with a selection of deep nets to choose from, along with the ability to integrate data from different sources, manipulate data, and manage models through a UI. Some platforms also help with performance if a net needs to be trained with a large data set. There are some advantages and disadvantages of using a platform rather than using a software library. A platform is an out-of-the-box application that lets you configure a deep net’s hyper-parameters through an intuitive UI. With a platform, you don’t need to know anything about coding in order to use the tools. The downside is that you are constrained by the platform’s selection of deep nets as well as the configuration options. But for anyone looking to quickly deploy a deep

net, a platform is the best way to go. We’ll also look at two machine learning software platforms called H2O, and GraphLab Create, both of which offer deep learning tools. H2O: This started out as an open source machine learning platform, with deep nets being a recent addition. Besides a set of machine learning algorithms, the platform offers several useful features, such as data pre-processing. H2O has builtin integration tools for platforms like HDFS, Amazon S3, SQL and NoSQL. It also provides a familiar programming environment like R, Python, JSON, and several others to access the tools, as well as to model or analyse data with Tableau, Microsoft Excel, and R Studio. H2O also provides a set of downloadable software packages, which you’ll need to deploy and manage on your own hardware infrastructure. H2O offers a lot of interesting features, but the website can be a bit confusing to navigate. GraphLab: The deep learning project requires graph analytics and other vital algorithms, and hence Dato’s GraphLab Create can be a good choice. GraphLab is a software platform that offers two different types of deep nets depending on the nature of your input data – one is a convolutional net and the other is a multi-layer perceptron. The convolutional net is the default one. It also provides graph analytics tools, which is unique among deep net platforms. Just like the H2O platform, GraphLab provides a great set of data mugging features. It provides built-in integration for SQL databases, Hadoop, Spark, Amazon S3, Pandas data frames, and many others. GraphLab also offers an intuitive UI for model management. A deep net platform can be selected based on your project.

Deep learning is gaining popularity

Deep learning is a topic that is making big waves at the moment. It is basically a branch of machine learning that uses algorithms to, among other things, recognise objects and understand human speech. Scientists have used deep learning algorithms with multiple processing layers to make better models from large quantities of unlabelled data (such as photos with no descriptions, voice recordings or videos on YouTube). The three main reasons why deep learning is gaining popularity are accuracy, efficiency and flexibility. Deep learning automatically extracts features by which to classify data, as opposed to most traditional machine learning algorithms, which require intense time and effort on the part of data scientists. The features that it manages to extract are more complex, because of the feature hierarchy possible in a deep net. They are also more flexible and less brittle, because the net is able to continue to learn on unsupervised data. By: Neetesh Mehrotra The author works at TCS as a systems engineer. His areas of interest are Java development and automation testing. You can contact him at | OPEN SOURCE FOR YOU | DECEMBER 2017 | 95

For U & Me Insight

What Can

BIG DATA Do For You?

In today’s world, there is a proliferation of data. So much so that the one who controls data today, holds the key to wealth creation. Let’s take a long look at what Big Data means and what it can do for us.


ig Data has undoubtedly gained much attention within academia and the IT industry. In the current digital and computing world, information is generated and collected at an alarming rate that is rapidly exceeding storage capabilities. About 4 billion people across the globe are connected to the Internet, and over 5 billion individuals own mobile phones, out of which more than 3.39 billion users use the mobile Internet. Several social networking platforms like WhatsApp, Facebook, Instagram, Twitter, etc, have a big hand in the indiscriminate increase in the production of data. Apart from the social media giants, there is a large amount of data being generated by different devices such as sensors, actuators, etc, which are used as part of the IoT and in robots as well. By 2020, it is expected that more than 50 billion devices will be connected to the Internet. At this juncture, predicted data production will be almost 44 times greater than that in

2010. As a result of the tech advances, all these millions of people are actually generating tremendous amounts of data through the increased use of smart devices. Remote sensors, in particular, continuously produce an even greater volume of heterogeneous data that can be either structured or unstructured. All such data is referred to as Big Data. We all know that this high volume of data is shared and transferred at great speed on different optical fibres. However, the fast growth rate of such huge data volumes generates challenges in the following areas: ƒ In searching, sharing and transferring data ƒ Analysis and capturing of data ƒ Data curation ƒ Storing, updating and querying data ƒ Information privacy Big Data is broadly identified by three aspects:



Colocated shows

Profit from IoT India’s #1 IoT show. At Electronics For You, we strongly believe that India has the potential to become a superpower in the IoT space, in the upcoming years. All that's needed are platforms for different stakeholders of the ecosystem to come together. We’ve been building one such platform: event for the creators, the enablers and customers of IoT. In February 2018, the third edition of will bring together a B2B expo, technical and business conferences, the Start-up Zone, demo sessions of innovative products, and more. Who should attend? • Creators of IoT solutions: OEMs, design houses, CEOs, CTOs, design engineers, software developers, IT managers, etc • Enablers of IoT solutions: Systems integrators, solutions providers, distributors, resellers, etc • Business customers: Enterprises, SMEs, the government, defence establishments, academia, etc Why you should attend • Get updates on the latest technology trends that define the IoT landscape • Get a glimpse of products and solutions that enable the development of better IoT solutions • Connect with leading IoT brands seeking channel partners and systems integrators • Connect with leading suppliers/service providers in the electronics, IT and telecom domain who can help you develop better IoT solutions, faster • Network with the who’s who of the IoT world and build connections with industry peers • Find out about IoT solutions that can help you reduce costs or increase revenues • Get updates on the latest business trends shaping the demand and supply of IoT solutions

India’s Electronics Manufacturing Show Is there a show in India that showcases the latest in electronics manufacturing such as rapid prototyping, rapid production and table top manufacturing? Yes, there is now - EFY Expo 2018. With this show’s focus on the areas mentioned and it being co-located at India Electronics Week, it has emerged as India's leading expo on the latest manufacturing technologies and electronic components. Who should attend? • Manufacturers: CEOs, MDs, and those involved in firms that manufacture electronics and technology products • Purchase decision makers: CEOs, purchase managers, production managers and those involved in electronics manufacturing • Technology decision makers: Design engineers, R&D heads and those involved in electronics manufacturing • Channel partners: Importers, distributors, resellers of electronic components, tools and equipment • Investors: Startups, entrepreneurs, investment consultants and others interested in electronics manufacturing Why you should attend • Get updates on the latest technology trends in rapid prototyping and production, and in table top manufacturing • Get connected with new suppliers from across India to improve your supply chain • Connect with OEMs, principals and brands seeking channel partners and distributors • Connect with foreign suppliers and principals to represent them in India • Explore new business ideas and investment opportunities in this sector

Showcasing the Technology that Powers Light Our belief is that the LED bulb is the culmination of various advances in technology. And such a product category and its associated industry cannot grow without focusing on the latest technologies. But, while there are some good B2B shows for LED lighting in India, none has a focus on ‘the technology that powers lights’. Thus, the need for Who should attend? • Tech decision makers: CEOs, CTOs, R&D and design engineers and those developing the latest LED-based products • Purchase decision makers: CEOs, purchase managers and production managers from manufacturing firms that use LEDs • Channel partners: Importers, distributors, resellers of LEDs and LED lighting products • Investors: Startups, entrepreneurs, investment consultants interested in this sector • Enablers: System integrators, lighting consultants and those interested in smarter lighting solutions (thanks to the co-located Why you should attend • Get updates on the latest technology trends defining the LED and LED lighting sector • Get a glimpse of the latest components, equipment and tools that help manufacture better lighting products • Get connected with new suppliers from across India to improve your supply chain • Connect with OEMs, principals, lighting brands seeking channel partners and systems integrators • Connect with foreign suppliers and principals to represent them in India • Explore new business ideas and investment opportunities in the LED and lighting sector • Get an insider’s view of ‘IoT + Lighting’ solutions that make lighting smarter

Colocated shows

The themes • Profit from IoT

• Rapid prototyping and production

• Table top manufacturing

• LEDs and LED lighting

The co-located shows

Why exhibit at IEW 2018? More technology decision makers and influencers attend IEW than any other event

India’s only test and measurement show is also a part of IEW

Bag year-end orders; meet prospects in early February and get orders before the FY ends

It’s a technologycentric show and not just a B2B event

360-degree promotions via the event, publications and online!

The world’s No.1 IoT show is a part of IEW and IoT is driving growth

Over 3,000 visitors are conference delegates

The only show in Bengaluru in the FY 2017-18

It’s an Electronics For You Group property

Besides purchase orders, you can bag ‘Design Ins’ and ‘Design-Wins’ too

Your brand and solutions will reach an audience of over 500,000 relevant and interested people

IEW is being held at a venue (KTPO) that’s closer to where all the tech firms are

Co-located events offer cross-pollination of business and networking opportunities

IEW connects you with customers before the event, at the event, and even after the event

Special packages for ‘Make in India’, ‘Design in India’, ‘Start-up India’ and ‘LED Lighting’ exhibitors

Why you should risk being an early bird 1. The best locations sell out first 2. The earlier you book—the better the rates; and the more the deliverables 3. We might just run out of space this year!

To get more details on how exhibiting at IEW 2018 can help you achieve your sales and marketing goals,

Contact us at +91-9811155335 Or

Write to us at

EFY Enterprises Pvt Ltd | D-87/1, Okhla Industrial Area, Phase -1, New Delhi– 110020

Colocated shows

Reasons Why You Should NOT Attend IEW 2018

India’s Mega Tech Conference The EFY Conference (EFYCON) started out as a tiny 900-footfall community conference in 2012, going by the name of Electronics Rocks. Within four years, it grew into ‘India’s largest, most exciting engineering conference,’ and was ranked ‘the most important IoT global event in 2016’ by Postscapes. In 2017, 11 independent conferences covering IoT, artificial intelligence, cyber security, data analytics, cloud technologies, LED lighting, SMT manufacturing, PCB manufacturing, etc, were held together over three days, as part of EFY Conferences. Key themes of the conferences and workshops in 2018 • Profit from IoT: How suppliers can make money and customers save it by using IoT • IT and telecom tech trends that enable IoT development • Electronics tech trends that enable IoT development • Artificial intelligence and IoT • Cyber security and IoT • The latest trends in test and measurement equipment • What's new in desktop manufacturing • The latest in rapid prototyping and production equipment Who should attend • Investors and entrepreneurs in tech • Technical decision makers and influencers • R&D professionals • Design engineers • IoT solutions developers • Systems integrators • IT managers SPECIAL PACKAGES FOR • Academicians • Defence personnel • Bulk/Group bookings

We spoke to a few members of the tech community to understand why they had not attended earlier editions of India Electronics Week (IEW). Our aim was to identify the most common reasons and share them with you, so that if you too had similar reasons, you may choose not to attend IEW 2018. This is what they shared… #1. Technologies like IoT, AI and embedded systems have no future Frankly, I have NO interest in new technologies like Internet of Things (IoT), artificial intelligence, etc. I don't think these will ever take off, or become critical enough to affect my organisation or my career.

Where most talks will not be by people trying to sell their products? How boring! I can't imagine why anyone would want to attend such an event. I love sales talks, and I am sure everybody else does too. So IEW is a big 'no-no' for me. #7. I don't think I need hands-on knowledge I don't see any value in the tech workshops being organised at IEW. Why would anyone want hands-on knowledge? Isn't browsing the Net and watching YouTube videos a better alternative?

#2. I see no point in attending tech events What's the point in investing energy and resources to attend such events? I would rather wait and watch—let others take the lead. Why take the initiative to understand new technologies, their impact and business models?

#8. I love my office! Why do people leave the comfort of their offices and weave through that terrible traffic to attend a technical event? They must be crazy. What’s the big deal in listening to experts or networking with peers? I'd rather enjoy the coffee and the cool comfort of my office, and learn everything by browsing the Net!

#3. My boss does not like me My boss is not fond of me and doesn't really want me to grow professionally. And when she came to know that IEW 2018 is an event that can help me advance my career, she cancelled my application to attend it. Thankfully, she is attending the event! Look forward to a holiday at work.

#9. I prefer foreign events While IEW's was voted the ‘World's No.1 IoT event’ on, I don't see much value in attending such an event in India—and that, too, one that’s being put together by an Indian organiser. Naah! I would rather attend such an event in Europe.

#4. I hate innovators! Oh my! Indian startups are planning to give LIVE demonstrations at IEW 2018? I find that hard to believe. Worse, if my boss sees these, he will expect me to create innovative stuff too. I better find a way to keep him from attending.

Hope we've managed to convince you NOT to attend IEW 2018! Frankly, we too have NO clue why 10,000-plus techies attended IEW in March 2017. Perhaps there's something about the event that we've not figured out yet. But, if we haven't been able to dissuade you from attending IEW 2018, then you may register at

#5. I am way too BUSY I am just too busy with my ongoing projects. They just don't seem to be getting over. Once I catch up, I'll invest some time in enhancing my knowledge and skills, and figure out how to meet my deadlines. #6. I only like attending vendor events Can you imagine an event where most of the speakers are not vendors?

Conference Pass Pricing

Special privileges and packages for...

One day pass

Defence and defence electronics personnel

INR 1999 PRO pass

INR 7999

Academicians Group and bulk bookings

Insight For U & Me BIG DATA SOURCES Business systems Transactions

Social Media

Unstructured data

Sensor data

Facebook Blogs Twitter

Figure 2: Major sources of Big Data (Image source:

Figure 1: Challenges of Big Data (Image source:

1. The data is of very high volume. 2. It is generated, stored and processed very quickly. 3. The data cannot be categorised into regular relational databases. Big Data has a lot of potential in business applications. It plays a role in the manufacture of healthcare machines, social media, banking transactions and satellite imaging. Traditionally, the data is stored in a structured format in order to be easily retrieved and analysed. However, present data volumes comprise both unstructured as well as semi-structured data. Hence, end-to-end processing can be impeded during the translation between the structured data in a relational database management system and the unstructured data for analytics. Among the problems linked to the staggering volumes of data being generated is the transfer speed of data, the diversity of data, and security issues. There have been several advances in data storage and mining technologies, which enable the preservation of such increased amounts of data. Also, during this preservation process, the nature of the original data generated by organisations is modified.

Some big sources of Big Data

Letâ&#x20AC;&#x2122;s have a quick look at some of the main sources of data along with some statistics (Data source: 1. Social media: There are around 1,209,600 (1.2 million) new data producing social media users every day. 2. Twitter: There are approximately 656 million tweets per day! 3. YouTube: There are more than 4 million hours of content uploaded to YouTube every day, with all its users watching around 5.97 billion hours of YouTube videos each day. 4. Instagram: There are approximately 67,305,600 (67.30 million) Instagram posts uploaded each day. 5. Facebook: There have been more than 2 billion monthly active Facebook users in 2017 so far, compared to 1.44

billion at the start of 2015 and around 1.65 billion at the start of 2016. On an average, there are approximately 1.32 billion daily active users as of June 2017. Every day, 4.3 billion Facebook messages get posted. There are around 5.75 billion Facebook likes every day. 6. Mobile text messages: There are almost 22 billion text messages sent every day (for personal and commercial purposes). 7. Google: On an average, in 2017, more than 5.2 billion daily Google searches get initiated. 8. IoT devices: Devices are a huge source of the 2.5 quintillion bytes of data that we create every day â&#x20AC;&#x201C; this not only includes mobile devices, but smart TVs, airplanes, cars, etc. Hence, the Internet of Things is producing an increasing amount of data.

Characteristics of Big Data

There are several characteristics of Big Data as listed below. Volume: This refers to the quantity of generated and stored data sets. The size of the data helps in determining the value and potential insights into it; hence, it helps us to know if a specific set of data can actually be considered as Big Data or not. Variety: This property deals with the different types and nature of the data. This actually helps people who analyse the large data sets to effectively use the resulting insights obtained after analysis. If a specific set of data contains different varieties of data, then we can consider it as Big Data. Velocity: The speed of data generation also plays a big role when we classify something as Big Data. The speed data is generated and further processed at to arrive at results that can be analysed for further use is one of the major properties of Big Data. Variability: When we talk about Big Data, there is always some inconsistency associated with it. We consider the data set as inconsistent if it does not have a specific pattern or structure. This can hamper the different processes required to handle and manage the data. Veracity: The quality of the captured data can also vary a lot, which affects the accurate analysis of the large data sets. If the captured dataâ&#x20AC;&#x2122;s quality is not good enough to be analysed then it needs to be processed before analysis. | OPEN SOURCE FOR YOU | DECEMBER 2017 | 101

For U & Me Insight

Figure 3: Different types of Big Data (Image source:

How is Big Data analysed?

We all know that we cannot analyse Big Data manually, as it’s a highly challenging and tedious task. In order to make this task easier, there are several techniques that help us to analyse the large sets of data very easily. Let us look at some of the famous techniques being used for data analysis. 1. Association rule learning: This is a rule-based Big Data analysis technique which is used to discover the interesting relations between different variables present in large databases. It is intended to identify the strong rules that are discovered in the databases using different measures of what is considered ‘interesting’. It makes use of a set of techniques for discovering several interesting relationships, also called ‘association rules’, among all the different variables present in the large databases. All such techniques use a variety of algorithms in order to generate and then test different possible rules. One of its most common applications is the market basket analysis. This helps a retailer to determine the several products frequently bought together and use that information for more focused marketing (like the discovery that most of the supermarket shoppers who buy diapers also go to buy beer, etc). Association rules are widely being used nowadays in continuous production, Web usage mining, bioinformatics and intrusion detection. These rules do not take into consideration the order of different items either within the same transaction or across different transactions. 2. A/B testing: This is a technique that compares the two different versions of an application to determine which one performs better. It is also called split testing or bucket testing. It actually refers to a specific type of the randomised experiment under which a set of users are presented with two variations of the same product (advertisements, emails, Web pages, etc) – let’s call them Variation A and Variation B. All the users exposed to Variation A are often referred to as the control group, since its performance is considered as the baseline against which any improvement in performance observed from presenting the Variation B is measured. Also, at times, Variation A itself acts as the original version of the product which is being tested against what existed before

the test. All the users in the group exposed to Variation B are referred to as the treatment group. This technique is used to optimise a conversion rate by measuring the performance of the treatment against that of the control using some mathematical calculations. This testing methodology removes the possible uesswork from the website optimisation process, and hence enables various data-informed decisions which shift the business conversations from what ‘we think’ to what ‘we know’. We can make sure that each change produces positive results just by measuring the impact that various changes have on our metrics. 3. Natural language processing: This area of computational linguistics is linked to the interactions between different computers and human languages. In particular, it is concerned with programming several computers to process large natural language corpora. The different challenges in natural language processing are natural language generation, natural language understanding, connecting the machine and language perception or some combinations thereof. Natural language processing research has mostly relied on machine learning. Initially, there were many language-processing tasks which involved direct hand coding of rules. Nowadays, different machine learning pattern calls are being used instead of the statistical inference to automatically learn various rules by analysing large sets of data from real-life examples. Many different classes of machine learning algorithms have been used for NLP tasks. These algorithms utilise large sets of ‘features’ as inputs. These features are developed from the input data set. Recent research has focused more on statistical models, which take probabilistic decisions based on attaching the real-valued weights to each input feature. Such models really have the edge because they can easily express the relative certainty for more than one different possible answer rather than only one, therefore producing more reliable results, compared to when such a model is included as only one of the many components of a larger system.

How can Big Data benefit your business?

Big Data may seem to be out of reach for different non-profit and government agencies that do not have the funds to buy into this new trend. We all have an impression that ‘big’ usually means expensive, but Big Data is not really about using more resources; rather, it’s about the effective usage of the resources at hand. Hence, organisations with limited financial resources can also stay competitive and grow. For

Figure 4: Different processes involved in a Big Data system (Image source:


Insight For U & Me that, we need to understand where we can find this data and what we can do with it. Let us see how Big Data can really help different organisations in their business 1. Targeted marketing: There are several small businesses which cannot compete with the huge advertising budgets that large organisations have at their disposal. In order to remain in the game, they have to spend less, yet reach qualified buyers. This is where the need for analysis and measurement of data comes in, in order to target the person most likely to turn into a customer. There is a huge amount of data that is freely accessible through different tools like Google Insights, etc. Organisations can find exactly what different people are looking for, when they are really looking for it and also find out their locations. For instance, the CDC (Centre for Disease Control, USA) uses the Big Data provided by Google to analyse a large number of searches relating to the flu. With the obtained data, researchers are able to focus their efforts where there is a greater need for flu vaccines. The same technique can be applied for other products as well. 2. Actionable insights: Big Data can really become like drinking from a fire hose if we do not know how to turn different facts and figures into useable information. But as soon as an organisation learns how to master different analytical tools, which turn its metrics into readable reports, graphs and charts, it can make decisions that are more proactive and targeted. And that’s when it will gain a clear understanding of the ‘big problems’ affecting the business. 3. Social eavesdropping: A large chunk of the information in Big Data is obtained from social chatter on several social networking sites like Twitter and Facebook. By keeping an eagle eye on what is being said in different social channels, organisations can really understand how the public perceives them and what to do if they need to improve their reputation. For example, the Twitter mood predicts the stock market. Johan Bollen once tracked how the collective mood from large sets of Twitter feeds correlated with the Dow Jones Industrial Average. The algorithm which was used by Bollen and his group actually predicted market changes with 87.6 per cent accuracy.

Applications of Big Data

There is a huge demand for Big Data nowadays, and there are numerous areas where it is already being implemented. Let’s have a look at some of them. 1. Big Data is used in different government sectors for different tasks like power theft investigation, deceit recognition and ecological fortification. Big Data is also used to examine different food based infections by the FDA. 2. It is widely used in the healthcare industry by physicians and doctors to keep track of their patients’ history. 3. Big Data is also used in the education sector by implementing different techniques such as adaptive learning,


How can I optimize my marketing budget?


How do I define my target market?



Which offer generates the greatest response?

How do I improve customer retention?

Using Data to Improve Performance



What is the Lifetime Value of my customer?

Which channel is most effective?


How can I measure marketing results?


Which demographic responds to my offer?

Figure 5: Different ways in which Big Data can help any business (Image source:

problem control, etc, to reform different educational courses. 4. Big Data is used in fraud detection in the banking sector. 5. It is used by different search engines to provide the best search results. 6. Different price comparison websites make use of Big Data to come up with the best options for their users. 7. Big Data is also used for analysing and processing the data obtained from different sensors and actuators connected to IoT. 8. Different speech recognition products such as Google Voice and Siri also make use of Big Data to recognise the speech patterns of the user. 9. Big Data and data science have taken the gaming experience to new heights. Different games are now designed using various Big Data and machine learning algorithms, which have the self-improving capability when a player jumps to a higher level. 10. Big Data is of great help for the recommender and suggestion tool which prompts us about similar products to purchase on different online shopping platforms like Amazon, Flipkart, etc.

References [1] [2] [3] [4]

‘Data Science for Business’ by Tom Facet

By: Vivek Ratan The author has completed his B. Tech in electronics and instrumentation engineering. He is currently working as an automation test engineer at Infosys, Pune. He can be reached at | OPEN SOURCE FOR YOU | DECEMBER 2017 | 103




Tips you can use daily on a Linux computer

1. Shortcut for opening a terminal in Ubuntu To open a terminal in Ubuntu, press the Ctrl+Alt+t keys. This creates a new terminal. 2. Running the previous command with ‘sudo’ in the terminal In case you have forgotten to run a command with ‘sudo’, you need not re-write the whole command. Just type ‘sudo!!’ and the last command will run with sudo. 3. How to change a file permission An admin can change the file permissions by executing chmod u+<permission> filename on the terminal where… <permission> can be r(read), w(write), x(execute)

The admin can change permissions on the file that are given by other users, by executing the above command, and replacing ‘u’ with ‘g’ for group access and ‘u’ with ‘o’ for others. —Anirudh Kalwa,

Moving between the current and last working directories easily

Everyone knows that typing ‘cd’ in the terminal in Ubuntu takes the user to the home directory. However, if you want to go to the last working directory, instead of entering the following:

The command takes you to the last working directory. This helps in directly moving to the last working directory from the current one, instead of remembering and typing the whole last working directory path. —Abhinay Badam,

Execute parallel ssh on multiple hosts

Here are the steps to do a parallel ssh on multiple hosts. We are going to use pssh, which is a program for executing ssh in parallel on a number of hosts. It provides features such as sending inputs to all the processes, passing a password to ssh, saving the output to files, and timing out. You can access the complete manual of pssh at https://linux. First off, let us look at how to install it on a CentOS 7 system: # yum install epel-release

Now install pssh, as follows: # yum install pssh

Create pssh_hosts.txt file and enter the hosts you need to target: # cat pssh_hosts.txt # write hosts per line like follows #user@target_ip root@

$cd <directory path>

…directly type the command shown below in the terminal:

We should create a key-pair between the master host and targets -- this is the only way to get things done. Simply log in the target from the master node for host key verification:

$cd -

# ssh root@


Next, test with single commands: # pssh -h /path/to/pssh_hosts.txt -A -O PreferredAuthentication s=password -i “hostname” # pssh -h /path/to/pssh_hosts.txt -A -O PreferredAuthentication s=password -i “uptime”

Now try logging into the machine with ssh root@ serverB-IP, and check to make sure that only the key(s) you wanted were added. Now, try to use pssh without the password on the command line. —Ranjithkumar T.,

Find out what an unknown command does, by using whatis

The output is: [root@master pssh]# pssh -h ./pssh_hosts.txt -A -O PreferredAut hentications=password -i “uptime” Warning: Do not enter your password if anyone else has superuser privileges or access to your account. Password: [1] 16:27:59 [SUCCESS] root@ 21:27:57 up 1 day, 1:30, 1 user, load average: 0.00, 0.01,

If you are new to the Linux terminal, then you will probably wonder what each command does. You are most likely to do a Google search for each command you come across. To avoid doing that, use the whatis command, followed by the command you don’t know. You will get a short description of the command. Here is an example:

0.05 $ whatis ls

To execute scripts on the target machines, type:

ls (1)

-list directory contents

Now you’ll know what the command does, and won’t have to open your browser and search.

# cat pssh/ #!/bin/bash touch /root/CX && echo “File created”

—Siddharth Dushantha, # pssh -h ./pssh_hosts.txt -A -O PreferredAuthentications=passw ord -I<./

Know how many times a user has logged in

One way to find out the number of times users have logged into a multi-user Linux system is to execute the following command:

Now let us make it simple: # pssh -H ‘’ -l ‘root’ -A -O PreferredAuthentica tions=password -I< ./

$last | grep pts | awk ‘{print $1}’ | sort | uniq -c

The output is: [root@master pssh]# pssh -H ‘’ -l ‘root’ -A -O PreferredAuthentications=password -I< ./ Warning: Do not enter your password if anyone else has superuser privileges or access to your account. Password: [1] 16:24:30 [SUCCESS]

To execute commands without password prompting, we need to create a key-pair between the servers. Let us look at how to do that. We are trying to attempt to log in to serverB from serverA. Create SSH-Kegen keys on serverA, as follows: # ssh-keygen -t rsa

Copy the file from serverA to master serverB: # ssh-copy-id root@<serverB-ip>

The above command provides the list of users who recently logged into the system. The grep utility is used to remove the unnecessary information, the result of which is then sent to awk using the shell pipe. awk, which is used for processing text based data, extracts only the user names from the text. This list of extracted names is now sorted by passing the list of names to the sort command, through a shell pipe. The sorted list of names is then piped to the uniq command, which filters adjacent matching lines, and the matching lines are merged to the first occurrence. The -c option of the uniq command, which displays the number of times a line is repeated, gives you the number of logins of each user along with the user’s name. —Sathyanarayanan S.,

Share Your Open Source Recipes! The joy of using open source software is in finding ways to get around problems—take them head on, defeat them! We invite you to share your tips and tricks with us for publication in OSFY so that they can reach a wider audience. Your tips could be related to administration, programming, troubleshooting or general tweaking. Submit them at The sender of each published tip will get a T-shirt. | OPEN SOURCE FOR YOU | DECEMBER 2017 | 105


DVD OF THE MONTH Linux for your desktop

Solus 3 Gnome

Solus is an operating system that is designed for home computing. It ships with a variety of software out-of-the-box, so you can set it up without too much fuss. It comes with the latest version of the free LibreOffice suite, which allows you to work on your documents, spreadsheets and presentations right away. It has many useful tools for content creators. Whether animating in Synfig Studio, producing music with MuseScore or Mixxx, trying out graphics design with GIMP or Inkscape, or editing video with Avidemux, Kdenlive and Shotcut, Solus provides software to help express your creativity. A collection of open source software

This month, we also have a collection of different open source software that can be installed on a computer with the MS Windows operating system. The collection includes a browser, email clients, integrated development environments (IDEs) and different productivity tools.

What is a live DVD? A live CD/DVD or live disk contains a bootable operating system, the core program of any computer, which is designed to run all your programs and manage all your hardware and software. Live CDs/DVDs have the ability to run a complete, modern OS on a computer even without secondary storage, such as a hard disk drive. The CD/DVD directly runs the OS and other applications from the DVD drive itself. Thus, a live disk allows you to try the OS before you install it, without erasing or installing anything on your current system. Such disks are used to demonstrate features or try out a release. They are also used for testing hardware functionality, before actual installation. To run a live DVD, you need to boot your computer using the disk in the ROM drive. To figure out how to set a boot device in BIOS, refer to the hardware documentation for your computer/laptop.


November 2017

Loonycorn is hiring


Mail Resume + Cover Letter to You:    

Really into tech - cloud, ML, anything and everything Interested in video as a medium Willing to work from Bangalore in the 0-3 years of experience range

Us:  ex-Google | Stanford | INSEAD  100,000+ students  Video content on Pluralsight, Stack, Udemy...