Page 1

Tricks To Try Out On Thunderbird And SeaMonkey

How To Identify Fumbling To Secure A Network

` 120 ISSN-2456-4885

Volume: 06 | Issue: 04 | Pages: 108 | January 2018

Storage SolutionS For Securing Data The Best Tools For Backing Up Enterprise Data Build Your Own Cloud Storage System Using OSS Open Source Storage Solutions You Can Depend On

My Journey:

Niyam Bhushan

One Of India’s Most Passionate OSS Evangelists


January 2018


FOR U & ME 25


Tricks to Try Out on Thunderbird and SeaMonkey

AdMin 40

DevOps Series Deploying Graylog Using Ansible


Analysing Big Data with Hadoop


Getting Past the Hype Around Hadoop


Open Source Storage Solutions You Can Depend On


Build Your Own Cloud Storage System Using OSS


A Quick Look at Cloonix, the Network Simulator


Use These Python Based Tools for Secured Backup and Recovery of Data


Encrypting Partitions Using LUKS

A Hands-on Guide on Virtualisation with VirtualBox



dE v E l O p E Rs 82

Machines Learn in Many Different Ways


Regular Expressions in Programming Languages: Java for You


Explore Twitter Data Using R


Demystifying Blockchains

C O l UM n s 77



Exploring Software: Python is Still Special

Top Three Open Source Data Backup Tools

The Best Tools for Backing Up Enterprise Data

r E G u L a r F E AT U R E S 07


4 | January 2018 | OPEn SOurCE FOr yOu |


New Products


Tips & Tricks




EDITORIAL, SUBSCRIPTIONS & ADVERTISING DELHI (HQ) D-87/1, Okhla Industrial Area, Phase I, New Delhi 110020 Ph: (011) 26810602, 26810603; Fax: 26817563 E-mail:




BACK ISSUES Kits ‘n’ Spares New Delhi 110020 Ph: (011) 26371661, 26371662 E-mail:


Ph: 011-40596600 E-mail:


MUMBAI Ph: (022) 24950047, 24928520 E-mail: BENGALURU Ph: (080) 25260394, 25260023 E-mail:

How to Identify Fumbling to Keep a Network Secure

PUNE Ph: 08800295610/ 09870682995 E-mail: GUJARAT Ph: (079) 61344948 E-mail: JAPAN Tandem Inc., Ph: 81-3-3541-4166 E-mail:

Using Jenkins to Create a Pipeline for Android Applications



SINGAPORE Publicitas Singapore Pte Ltd Ph: +65-6836 2272 E-mail: TAIWAN J.K. Media, Ph: 886-2-87726780 ext. 10 E-mail: UNITED STATES E & Tech Media Ph: +1 860 536 6677 E-mail:

“My Love Affair with Freedom”

Ubuntu Desktop (64-bit)



Workstation 27



terial, if found l e ma

nab tio ec


Any o


Fedora Workstation is a polished, easy-to-use operating system for laptop and desktop computers, with a complete set of tools for developers and makers of all kinds

The latest, stable Linux for your desktop. in


Ubuntu comes with everything you need to run your organisation, school, home or enterprise

efy. m@ tea cd ail: e-m

• Ubuntu Desktop 17.10 (Live) • Fedora Workstation 27 • MX Linux 17

January 2018

am Te

Kindly add ` 50/- for outside Delhi cheques. Please send payments only in favour of EFY Enterprises Pvt Ltd. Non-receipt of copies may be reported to—do mention your subscription number.



— — US$ 120


tended, and sh oul d




he complex n d to t atu ute re

You Pay (`) 4320 3030 1150

s c, i


Five Three One

Newstand Price (`) 7200 4320 1440





MX Linux 17 MX Linux is a cooperative venture between the antiX and former MEPIS communities, which uses the best tools and talent from each distro

Re co mm en d


a. t dat

In c ase this DV Dd oe sn ot wo r

placement. a free re

e ern Int


for ort@ upp at s us to rite ,w rly pe o pr

M Drive VD-RO M, D B RA , 1G : P4 nts me ire qu Re

Niyam Bhushan, who has kickstarted a revolution in UX design in India

Get Familiar with the Basics of R tem ys dS

Printed, published and owned by Ramesh Chopra. Printed at Tara Art Printers Pvt Ltd, A-46,47, Sec-5, Noida, on 28th of the previous month, and published from D-87/1, Okhla Industrial Area, Phase I, New Delhi 110020. Copyright © 2018. All articles in this issue, except for interviews, verbatim quotes, or unless otherwise explicitly mentioned, will be released under Creative Commons Attribution-NonCommercial 3.0 Unported License a month after the date of publication. Refer to for a copy of the licence. Although every effort is made to ensure accuracy, no responsibility whatsoever is taken for any loss due to publishing errors. Articles that cannot be used are returned to the authors if accompanied by a self-addressed and sufficiently stamped envelope. But no responsibility is taken for any loss or delay in returning the material. Disputes, if any, will be settled in a New Delhi court only.


FOSSBYTES Compiled By: OSFY Bureau

Juniper Networks reinforces longstanding commitment to open source During the recently organised annual NXTWORK user conference, Juniper Networks announced its intent to move the code base of OpenContrail, an open source network virtualisation platform for the cloud, to the Linux Foundation. OpenContrail is a scalable, network virtualisation control plane. It provides both featurerich software-defined networking (SDN) and strong security. Juniper first open sourced its Contrail products in 2013, and built a vibrant user and developer community around this project. In early 2017, expanded the project’s governance, creating an even more open, community-led effort to strengthen the project for its next growth phase. Adding its code base to the Linux Foundation’s networking projects will further Juniper’s objective to grow the use of open source platforms in cloud ecosystems. OpenContrail has been deployed by various organisations, including cloud providers, telecom operators and enterprises, to simplify operational complexities and automate workload management across diverse cloud environments, including multi-clouds. Arpit Joshipura, vice president of networking and orchestration at the Linux Foundation said, “We are excited at the prospect of our growing global community being able to broadly adopt, manage and integrate OpenContrail’s code base to manage and secure diverse cloud environments. Having this addition to our open source projects will be instrumental in achieving the level of technology advancements our community has become known for.” Once the Linux Foundation takes over the governance of OpenContrail’s code base, Juniper’s mission to ensure the project truly remains community-led will be fulfilled. As a result, this will accelerate pioneering advances and community adoption, as well as enable easier and secure migration to multi-cloud environments.

GIMP 2.9.8 image editor now comes with better PSD support and on-canvas gradient editing

The latest release of the GIMP, the popular open source image editor, introduces on-canvas gradient editing and various other enhancements while focusing on bugfixing and stability. You can now create and delete colour stops, select and shift them, assign colours to colour stops, change blending and colouring for segments between colour stops, etc, from mid-points. “Now, when you try to change an existing gradient from a system folder, the GIMP will create a copy of it, call it a ‘custom gradient’ and preserve it across sessions. Unless, of course, you edit another ‘system’ gradient, in which case it

Heptio and Microsoft join the effort to bring Heptio Ark to Azure

The new collaboration between Heptio and Microsoft aims to ensure Heptio Ark delivers a strong Kubernetes disasterrecovery solution for customers who want to use it on Azure. The companies will also work together to make the Ark project an efficient solution to move Kubernetes applications across on-premise computing environments and Azure, and to ensure that Azurehosted backups are secure. The Ark project provides a simple, configurable and operationally robust way to back up and restore applications and persistent volumes from a series of checkpoints. With the Heptio-Microsoft collaboration, the two firms will ensure that organisations are not only able to back up and restore content into Azure Container Service (AKS), but that snapshots created using Ark are persisted in Azure and are encrypted at rest.

“I’m excited to see Heptio and Microsoft deliver a compelling solution that satisfies an important and unmet need in the Kubernetes ecosystem,” said Brendan Burns, distinguished engineer at Microsoft and cocreator of Kubernetes. It will also help manage disaster recovery for Kubernetes cluster resources and persistent volumes. | OPEN SOURCE FOR YOU | JANUARY 2018 | 7

FOSSBYTES Google’s AI division releases open source update to DeepVariant

Google has released the open source version of DeepVariant, a deep learning technology to reconstruct the true genome sequence from HTS sequencer data with significantly greater accuracy than previous classical methods. This work is the product of more than two years of research by the Google Brain team, in collaboration with Verily Life Sciences.

will become the new custom gradient,” said Alexandre Prokoudine in the release announcement. He added that, “Since this feature is useful for more than just gradients, it was made generic enough to be used for brushes and other types of resources in the future. We expect to revisit this in the future releases of GIMP.” The release announcement also states that the PSD plug-in has been fixed to properly handle Photoshop files with deeply nested layer groups, and preserve the expanded state of groups for both importing and exporting. Additional changes fix the mask position and improve layer opacity for importing/exporting.

India’s first FIWARE Lab node to be operational from April 2018

The Brain team programmed it in TensorFlow, a library of open source programming code for numerical computation that is popular for deep learning applications. The technology works well on data from all types of sequencers and eases the process of transitioning to new sequencers. DeepVariant is being released as open source software to encourage collaboration and to accelerate the use of this technology to solve real world problems. The software will make it easier to receive inputs from researchers about the use cases they were interested in. This is part of a broader effort to make the data of genomics compatible with the way deep learning machinery works. Also, this move will push Google technologies to healthcare and other scientific applications, and make the results of these efforts broadly accessible.

Addressing the growing demand for smart city applications across India, NEC Corporation and NEC Technologies India Private Limited (NECTI) will soon establish a FIWARE Lab node in India. Having a FIWARE Lab node within India will encourage more participation from Asian countries, as they can keep all experimental and research data within the boundaries of the region. FIWARE is an open source platform which enables real-time smart services through data sharing across verticals and agencies via open standards based APIs. It focuses on specifications for common context information APIs, data publication platforms and standard data models in order to achieve and improve cross-sector interoperability for smart applications, with FIWARE NGSI as a starting point. The technology is in use in more than 100 cities in 23 countries in Europe and other regions. It is royalty-free and avoids any vendor lock-in. The Lab node in India will help to foster a culture of collaboration between various participating entities and promote their solutions in the FIWARE community. “The FIWARE Foundation welcomes the new FIWARE Lab node starting in India. FIWARE is used by an increasing number of cities in Europe and other regions and I wish this new FIWARE Lab node will trigger the adoption of FIWARE both in India and other APAC countries,” said Ulrich Ahle, CEO of the FIWARE Foundation. “It is also our pleasure to have the commitment of the NEC Technologies India team to contribute to the FIWARE community, which will strengthen the FIWARE technology as well as its globalisation as a smart city platform,” he added. The facility is expected to start operations from April 2018, and is endorsed by the FIWARE Foundation. Organisations, entrepreneurs and individuals can use this lab to learn FIWARE, as well as to test their applications while capitalising on open data published by cities and other organisations.


React js

Node js Angular js Swift

Mongo DB

Chennai 994 044 6586




FOSSBYTES Red Hat OpenShift Container Platform 3.7 released

Red Hat has launched the latest OpenShift Container Platform 3.7, the version of Red Hat’s enterprisegrade Kubernetes container application platform. As application complexity and cloud incompatibility increase, Red Hat OpenShift Container Platform 3.7 will help IT organisations to build and manage applications that use services from the data centre to the public cloud. The latest version of the industry’s most comprehensive enterprise Kubernetes platform includes native integrations with Amazon Web Services (AWS) service brokers, which enable developers to bind services across AWS and on-premise resources to create modern applications while providing a consistent, open standards-based foundation to drive business evolution. “We are excited about our collaboration with Red Hat and the general availability of the first AWS service brokers in Red Hat OpenShift. The ability to seamlessly configure and deploy a range of AWS services from within OpenShift will allow our customers to benefit from AWS’s rapid pace of innovation, both on-premises and in the cloud,” said Matt Yanchyshyn, director, partner solution architecture, Amazon Web Services, Inc. Red Hat OpenShift Container Platform 3.7 will ship with the OpenShift template broker, which turns any OpenShift template into a discoverable service for application developers using OpenShift. OpenShift templates are lists of OpenShift objects that can be implemented within specific parameters, making it easier for IT organisations to deploy reusable, composite applications comprising microservices.

Building secure container infrastructure with Kata Containers The OpenStack Foundation has announced a new open source project—Kata Containers, which aims to unite the security advantages of virtual machines (VMs) with the speed and manageability of container technologies. The project is designed to be hardware agnostic and compatible with Open Container Initiative (OCI) specifications, as well as the container runtime interface (CRI) for Kubernetes. Intel is contributing its open source Intel Clear Containers project and Hyper is contributing its runV technology to initiate the project. Besides Intel and Hyper, 99cloud, AWcloud, Canonical, China Mobile, City Network, CoreOS, Dell/EMC, EasyStack, Fiberhome, Google, Huawei,, Mirantis, NetApp, Red Hat, SUSE, Tencent, Ucloud, UnitedStack, and ZTE are also supporting the project’s launch. The Kata Containers project will initially comprise six components, which include the agent, runtime, proxy, Shim, kernel and packaging of QEMU 2.9. It is designed to be architecture agnostic and to run on multiple hypervisors. Kata Containers offers the ability to run container management tools directly on bare metal. “The Kata Containers Project is an exciting addition to the OpenStack Foundation family of projects. Lighter, faster, more secure VM technology fits perfectly into the OpenStack Foundation family and aligns well with Canonical’s data centre efficiency initiatives. Like Clear Containers and previously, Kata Container users will find their hypervisor and guests well supported on Ubuntu,” said Dustin Kirkland, vice president, product, Canonical.

Fedora 27 released

The Fedora Project, a Red Hat sponsored and community-driven open source collaboration, has announced the general availability of Fedora 27. All editions of Fedora 27 are built from a common set of base packages and, as with all new Fedora releases, these packages have seen numerous tweaks, incremental improvements and new additions. For Fedora 27, this includes the GNU C Library 2.26 and RPM 4.14. “Building and supporting the next generation of applications remains a critical focus for the Fedora community, showcased in Fedora 27 by our continued support and refinement of system containers and containerised services like Kubernetes and Flannel. More traditional developers and end users will be pleased


Had oop

Open Source Is Hot In The IT World

Apache Post gres

id Andro

OSS Open Stac k

la Joom

THE COMPLETE MAGAZINE ON OPEN SOURCE To find dealers to buy copy from news-stand, visit: To subscribe, visit: To buy an ezine edition, visit: choose Open Source For You





FOSSBYTES Canonical and Rancher Labs announce Kubernetes cloud native platform

Canonical, in partnership with Rancher Labs, has announced a turnkey application delivery platform built on Ubuntu, Kubernetes and Rancher 2.0. The new cloud native platform will make it easy for users to deploy, manage and operate containers on Kubernetes through a single workflow management portal—from development-and-testing to production environments. Built on Canonical’s distribution of Kubernetes and Rancher 2.0, the cloud native platform will simplify enterprise usage of Kubernetes with seamless user management, access control and cluster administration.

“Our partnership with Rancher provides end-to-end workflow automation for the enterprise development and operations team on Canonical’s distribution of Kubernetes,” said Mark Shuttleworth, CEO of Canonical. “Ubuntu has long been the platform of choice for developers driving innovation with containers. Canonical’s Kubernetes offerings include consulting, integration and fully-managed Kubernetes services onpremises and on-cloud,” Shuttleworth added.

with the additions brought about by GNOME 3.26 to Fedora 27 Workstation, making it easier to build applications and improving the overall desktop experience,” said Matthew Miller, a Fedora project leader.

Canon joins the Open Invention Network community

Open Invention Network (OIN), the largest patent non-aggression community in history, has announced that Canon has joined as a community member. As a global leader in such fields as professional and consumer imaging and printing systems and solutions, and having expanded its medical and industrial equipment businesses, Canon is demonstrating its commitment to open source software as an enabler of innovation across a wide spectrum of industries. “Open source technology, especially Linux, has led to profound increases in capabilities across a number of key industries, while increasing overall product and service efficiency,” said Hideki Sanatake, an executive officer, as well as deputy group executive of corporate intellectual properties and legal headquarters at Canon. “By joining Open Invention Network, we are demonstrating our continued commitment to innovation, and supporting it with patent non-aggression in Linux.” OIN’s community practices patent non-aggression in core Linux and adjacent open source technologies by cross-licensing Linux System patents to one another on a royalty-free basis.

GTech partners with Red Hat to catalyse open source adoption in Kerala

The Kerala government’s IT policy encourages the adoption of open source and open technologies in the public domain. Hence, the Group of Technology Companies (GTech), the industry body for IT companies in Kerala, has recently signed an MoU with Red Hat. The partnership aims to create enhanced awareness on various open source technologies amongst IT professionals in the state. The MoU will facilitate partnerships between Red Hat and GTech member companies. The efforts will focus on research and product development in open source software technologies. The state government has also emphasised the need to promote open source among SMEs. According to the terms of the MoU, Red Hat will organise events in IT parks across the state. These events were kickstarted in November 2017, and include lectures, seminars and presentations spanning the Internet of Things (IoT), artificial intelligence, analytics, development tools, content management systems, desktop publishing and other connected topics.

Amazon extends support to Facebook and Microsoft

Amazon has announced its ONNX-MXNet Python package to import Open Neural Network Exchange (ONNX) deep learning models into Apache MXNet. This move indicates the company’s support for Facebook and Microsoft in their efforts to open source artificial intelligence (AI). With this package, developers running models based on open source ONNX will be able to run them on Apache MXNet. Basically, this allows AI developers to keep models but switch networks, as opposed to starting from scratch.


FOSSBYTES Microsoft launches Azure location based services

Addressing a gathering at Automobility LA 2017 in Los Angeles, California, Sam George, director – Azure IoT, Microsoft said, “Microsoft is making an effort to solve mobility challenges and bring government bodies, private companies and automotive OEMs together, using Microsoft’s intelligent cloud platform.” The new location capabilities will provide cloud developers critical geographical data to power smart cities and Internet of Things (IoT) solutions across industries. This includes manufacturing, automotive, logistics, urban planning and retail, etc.

TomTom Telematics will be the first official partner for the service, supplying critical location and real-time traffic data, providing Microsoft customers with advanced location and mapping capabilities. Microsoft’s Azure location based services will offer enterprise customers location capabilities integrated in the cloud to help any industry improve traffic flow. Microsoft also announced that Azure LBS will be launched in 2018, and will be available globally in more than 30 languages.

It has become increasingly evident that the future of AI needs more than just ethical direction and government oversight. It would be comforting to know that the tech giants are on the same page too. The machines, and the humans who will rely on them, need the biggest companies building AI to take on a fair share of responsibility for the future.

Four tech giants using Linux change their open source licensing policies

The GNU Public License version 2 (GPLv2) is arguably the most important open source licence for one reason—Linux uses it. On November 27, 2017, three tech power houses that use Linux—Facebook, Google and IBM, as well as the major Linux distributor Red Hat, announced they would extend additional rights to help companies who’ve made GPLv2 open source licence compliance errors and mistakes. The GPLv2 and its close relative, GNU Lesser General Public License (LGPL), are widely used open source software licences. When the GPL version 3 (GPLv3) was released, it came with an express termination approach. This termination policy in GPLv3 provided a way for companies to correct licensing errors and mistakes. This approach allows licence compliance enforcement that is consistent with community norms.

FreeNAS 11.1 provides greater performance and cloud integration

FreeNAS 11.1 adds cloud integration and OpenZFS performance improvements, including the ability to prioritise ‘resilvering’ operations, and preliminary Docker support to the world’s most popular software-defined storage operating system. It also adds a cloud sync (data import/ export to the cloud) feature, which lets you sync (similar to back up), move (erase from source) or copy (only changed data) data to and from public cloud providers that include Amazon S3 (Simple Storage Services), Backblaze B2 Cloud, Google Cloud and Microsoft Azure. OpenZFS has noticeable performance improvements for handling multiple snapshots and large files. Resilver Priority has been added to the ‘Storage’ screen of the graphical user interface, allowing you to configure ‘resilvering’ at a higher priority at specific times. This helps to mitigate the inherited challenges and risks associated with storage array rebuilds on very large capacity drives. The latest release includes an updated preview of the beta version of the new administrator graphical user interface, including the ability to select display themes. It can be downloaded from

For more news, visit



KTPO Whitefield Bengaluru

Profit from IoT India’s #1 IoT show. At Electronics For You, we strongly believe that India has the potential to become a superpower in the IoT space, in the upcoming years. All that's needed are platforms for different stakeholders of the ecosystem to come together. We’ve been building one such platform: event for the creators, the enablers and customers of IoT. In February 2018, the third edition of will bring together a B2B expo, technical and business conferences, the Start-up Zone, demo sessions of innovative products, and more. Who should attend? • Creators of IoT solutions: OEMs, design houses, CEOs, CTOs, design engineers, software developers, IT managers, etc • Enablers of IoT solutions: Systems integrators, solutions providers, distributors, resellers, etc • Business customers: Enterprises, SMEs, the government, defence establishments, academia, etc Why you should attend • Get updates on the latest technology trends that define the IoT landscape • Get a glimpse of products and solutions that enable the development of better IoT solutions • Connect with leading IoT brands seeking channel partners and systems integrators • Connect with leading suppliers/service providers in the electronics, IT and telecom domain who can help you develop better IoT solutions, faster • Network with the who’s who of the IoT world and build connections with industry peers • Find out about IoT solutions that can help you reduce costs or increase revenues • Get updates on the latest business trends shaping the demand and supply of IoT solutions


Manufacture Electronics Is there a show in India that showcases the latest in electronics manufacturing such as rapid prototyping, rapid production and table top manufacturing? Yes, there is now - EFY Expo 2018. With this show’s focus on the areas mentioned and it being co-located at India Electronics Week, it has emerged as India's leading expo on the latest manufacturing technologies and electronic components. Who should attend? • Manufacturers: CEOs, MDs, and those involved in firms that manufacture electronics and technology products • Purchase decision makers: CEOs, purchase managers, production managers and those involved in electronics manufacturing • Technology decision makers: Design engineers, R&D heads and those involved in electronics manufacturing • Channel partners: Importers, distributors, resellers of electronic components, tools and equipment • Investors: Startups, entrepreneurs, investment consultants and others interested in electronics manufacturing Why you should attend • Get updates on the latest technology trends in rapid prototyping and production, and in table top manufacturing • Get connected with new suppliers from across India to improve your supply chain • Connect with OEMs, principals and brands seeking channel partners and distributors • Connect with foreign suppliers and principals to represent them in India • Explore new business ideas and investment opportunities in this sector

Colocated shows

Showcasing the Technology that Powers Light

India’s Only Electronics Centric T&M Show

Our belief is that the LED bulb is the culmination of various advances in technology. And such a product category and its associated industry cannot grow without focusing on the latest technologies. But, while there are some good B2B shows for LED lighting in India, none has a focus on ‘the technology that powers lights’. Thus, the need for

Test & Measurement India (T&M India) is Asia’s leading exposition for test & measurement products and services. Launched in 2012 as a co-located show along with Electronics For You Expo, it has established itself as the must-attend event for users of T&M equipment, and a must-exhibit event for suppliers of T&M products and services.

Who should attend? • Tech decision makers: CEOs, CTOs, R&D and design engineers and those developing the latest LED-based products • Purchase decision makers: CEOs, purchase managers and production managers from manufacturing firms that use LEDs • Channel partners: Importers, distributors, resellers of LEDs and LED lighting products • Investors: Startups, entrepreneurs, investment consultants interested in this sector • Enablers: System integrators, lighting consultants and those interested in smarter lighting solutions (thanks to the co-located

In 2015, T&M India added an important element by launching the T&M Showcase-a platform for show-casing latest T&M products and technologies. Being a first-of-its-kind event in India, the T&M Showcase was well received by the audience and the exhibitors.

Why you should attend • Get updates on the latest technology trends defining the LED and LED lighting sector • Get a glimpse of the latest components, equipment and tools that help manufacture better lighting products • Get connected with new suppliers from across India to improve your supply chain • Connect with OEMs, principals, lighting brands seeking channel partners and systems integrators • Connect with foreign suppliers and principals to represent them in India • Explore new business ideas and investment opportunities in the LED and lighting sector • Get an insider’s view of ‘IoT + Lighting’ solutions that make lighting smarter

Who should attend? • Sr Technical Decision Makers--Manufacturing, Design, R&D & Trade Channel organisations • Sr Business Decision Makers--Manufacturing, Design, R&D & Trade Channel organisations • R&D Engineers • Design Engineers • Test & Maintenance Engineers • Production Engineers • Academicians • Defence & Defence Electronics Personnel Why you should attend? • India’s Only Show Focused on T&M for electronics • Experience Latest T&M Solutions first-hand • Explore Trade Channel Opportunities from Indian & Foreign OEMs • Attend Demo sessions of latest T&M equipment launched in India • Special Passes for Defence

KTPO Whitefield Bengaluru

Colocated shows

Reasons Why You Should NOT Attend IEW 2018

India’s Mega Tech Conference The EFY Conference (EFYCON) started out as a tiny 900-footfall community conference in 2012, going by the name of Electronics Rocks. Within four years, it grew into ‘India’s largest, most exciting engineering conference,’ and was ranked ‘the most important IoT global event in 2016’ by Postscapes. In 2017, 11 independent conferences covering IoT, artificial intelligence, cyber security, data analytics, cloud technologies, LED lighting, SMT manufacturing, PCB manufacturing, etc, were held together over three days, as part of EFY Conferences. Key themes of the conferences and workshops in 2018 • Profit from IoT: How suppliers can make money and customers save it by using IoT • IT and telecom tech trends that enable IoT development • Electronics tech trends that enable IoT development • Artificial intelligence and IoT • Cyber security and IoT • The latest trends in test and measurement equipment • What's new in desktop manufacturing • The latest in rapid prototyping and production equipment Who should attend • Investors and entrepreneurs in tech • Technical decision makers and influencers • R&D professionals • Design engineers • IoT solutions developers • Systems integrators • IT managers SPECIAL PACKAGES FOR • Academicians • Defence personnel • Bulk/Group bookings

We spoke to a few members of the tech community to understand why they had not attended earlier editions of India Electronics Week (IEW). Our aim was to identify the most common reasons and share them with you, so that if you too had similar reasons, you may choose not to attend IEW 2018. This is what they shared… #1. Technologies like IoT, AI and embedded systems have no future Frankly, I have NO interest in new technologies like Internet of Things (IoT), artificial intelligence, etc. I don't think these will ever take off, or become critical enough to affect my organisation or my career.

Where most talks will not be by people trying to sell their products? How boring! I can't imagine why anyone would want to attend such an event. I love sales talks, and I am sure everybody else does too. So IEW is a big 'no-no' for me. #7. I don't think I need hands-on knowledge I don't see any value in the tech workshops being organised at IEW. Why would anyone want hands-on knowledge? Isn't browsing the Net and watching YouTube videos a better alternative?

#2. I see no point in attending tech events What's the point in investing energy and resources to attend such events? I would rather wait and watch—let others take the lead. Why take the initiative to understand new technologies, their impact and business models?

#8. I love my office! Why do people leave the comfort of their offices and weave through that terrible traffic to attend a technical event? They must be crazy. What’s the big deal in listening to experts or networking with peers? I'd rather enjoy the coffee and the cool comfort of my office, and learn everything by browsing the Net!

#3. My boss does not like me My boss is not fond of me and doesn't really want me to grow professionally. And when she came to know that IEW 2018 is an event that can help me advance my career, she cancelled my application to attend it. Thankfully, she is attending the event! Look forward to a holiday at work.

#9. I prefer foreign events While IEW's was voted the ‘World's No.1 IoT event’ on, I don't see much value in attending such an event in India—and that, too, one that’s being put together by an Indian organiser. Naah! I would rather attend such an event in Europe.

#4. I hate innovators! Oh my! Indian startups are planning to give LIVE demonstrations at IEW 2018? I find that hard to believe. Worse, if my boss sees these, he will expect me to create innovative stuff too. I better find a way to keep him from attending.

Hope we've managed to convince you NOT to attend IEW 2018! Frankly, we too have NO clue why 10,000-plus techies attended IEW in March 2017. Perhaps there's something about the event that we've not figured out yet. But, if we haven't been able to dissuade you from attending IEW 2018, then you may register at

#5. I am way too BUSY I am just too busy with my ongoing projects. They just don't seem to be getting over. Once I catch up, I'll invest some time in enhancing my knowledge and skills, and figure out how to meet my deadlines. #6. I only like attending vendor events Can you imagine an event where most of the speakers are not vendors?

Conference Pass Pricing

Special privileges and packages for...

One day pass

Defence and defence electronics personnel

INR 1999 PRO pass

INR 7999

Academicians Group and bulk bookings

KTPO Whitefield Bengaluru

Colocated shows

The themes • Profit from IoT

• Rapid prototyping and production

• Table top manufacturing

• LEDs and LED lighting

The co-located shows

Why exhibit at IEW 2018? More technology decision makers and influencers attend IEW than any other event

India’s only test and measurement show is also a part of IEW

Bag year-end orders; meet prospects in early February and get orders before the FY ends

It’s a technologycentric show and not just a B2B event

360-degree promotions via the event, publications and online!

The world’s No.1 IoT show is a part of IEW and IoT is driving growth

Over 3,000 visitors are conference delegates

The only show in Bengaluru in the FY 2017-18

It’s an Electronics For You Group property

Besides purchase orders, you can bag ‘Design Ins’ and ‘Design-Wins’ too

Your brand and solutions will reach an audience of over 500,000 relevant and interested people

IEW is being held at a venue (KTPO) that’s closer to where all the tech firms are

Co-located events offer cross-pollination of business and networking opportunities

IEW connects you with customers before the event, at the event, and even after the event

Special packages for ‘Make in India’, ‘Design in India’, ‘Start-up India’ and ‘LED Lighting’ exhibitors

Why you should risk being an early bird 1. The best locations sell out first 2. The earlier you book—the better the rates; and the more the deliverables 3. We might just run out of space this year!

To get more details on how exhibiting at IEW 2018 can help you achieve your sales and marketing goals,

Contact us at +91-9811155335 Or

Write to us at

EFY Enterprises Pvt Ltd | D-87/1, Okhla Industrial Area, Phase -1, New Delhi– 110020


For U & Me

Top Tech Trends to Watch Out For in 2018 The technologies that will dominate the tech world in the new year are based on open source software.


t the start of a brand new year, I looked into the crystal ball to figure out the areas that no technologist can afford to ignore. Here is my rundown on some of the top trends that will define 2018.

Automation and artificial intelligence

Two of the most talked about trends are increasingly utilising open source. Companies including Google, Amazon and Microsoft have released the code for Open Network Automation Platform (ONAP) software frameworks that are designed to help developers build powerful AI applications. In fact, Gartner says that artificial intelligence is going to widen its net to include data preparation, integration, algorithm selection, training methodology selection, and model creation. I can point out so many examples right now such as chatbots, autonomous vehicles and drones, video games as well as other real-life scenarios such as design, training and visualisation processes.

Open source containers are no longer orphans now

DevOps ecosystems are now seeing the widespread adoption of containers like Docker for open source e-commerce development. Containers are one of the hottest tickets in open source technology. You can imagine them as a lightweight packing of application software that has all its dependencies bundled for easy portability. This removes a lot of hassles for enterprises as they cut down on costs and time. According to 451 Research, the market is expected to grow by more than 250 per cent between 2016 and

2020. Microsoft recently contributed to the mix by launching its Virtual Kubelet connector for Azure, streamlining the whole container management process.

Blockchain finds its footing

As Bitcoin is likely to hit the US$ 20,000 mark, we’re all in awe of the blockchain technology behind all the crypto currencies. Other industries are expected to follow suit, such as supply chain, healthcare, government services, etc. The fact that it’s not controlled by any single authority and has no single point of failure makes it a very robust, transparent, and incorruptible technology. Russia has also become one of the first countries to embrace the technology by piloting its banking industry’s first ever payment transaction. Sberbank, Russia’s biggest bank by assets, has executed a realtime money transfer over an IBM-built blockchain based on the Hyperledger open source collaborative project. One more case in point is a consortium comprising more than a dozen food companies and retailers — including Walmart, Nestle and Tyson Foods—dedicated to using blockchain technology to gather better information on the origin and state of food.

IoT-related open source tools/libraries

IoT has already made its presence felt. Various open source tools are available now that are a perfect match to further

the IoT challenges such as Arduino, Home Assistant, Zetta, Device Hive, and ThingSpeak. Open source has already served as the foundation for IoT’s growth till now and will continue to do so.

OpenStack to gain more acceptance

OpenStack has enjoyed tremendous success since the beginning, with its exciting and creative ways to utilise the cloud. But it lags behind when it comes to adoption, partly due to its complex structure and dependence on virtualisation, servers and extensive networking resources. But new fixes are in the works as several big software development and hosting companies work overtime to resolve the underlying challenges. In fact, OpenStack has now expanded its scope to include containers with the recent launch of the Kata containers project. Open source is evolving at a great pace, which presents tremendous opportunities for enterprises to grow bigger and better. Today, the cloud also shares a close bond with the open source apps, with the services of various big cloud companies like AWS, Google Cloud, and Microsoft Azure being quite open source-friendly. I can think of no better way to say this, but it is poised to be the driver behind various innovations. I’d love to hear your thoughts on other trends that will dominate 2018. Do drop me a line (

By: Dinesh Kumar The author is the CEO of Sedin Technologies and the co-founder of RailsFactory. He is a passionate proponent of open source and keenly observes the trends in this space. In this new column, he digs into his experience of servicing over 200 major global clients across the USA, UK, Australia, Canada and India. | OPEN SOURCE FOR YOU | JANUARY 2018 | 19


` 4,990

Travel- and pocket-friendly leather headset from Astrum Leading ‘new technology’ brand Astrum has unveiled a travel-friendly headset – the HT600. The affordable yet stylish headphones come in a lightweight, compact design with no wires. The headset’s twist-folding design allows compact storage, making it easily portable. Packed with its own hard case and pouch, the leather headband ensures a perfect fit with foam earcups. The headphones come with noisecancelling technology and 3.5mm drivers, delivering the full range of deep bass and clear high notes. These also come with handy controls, conveniently placed on the outside of the device for users to easily pause, play or rewind the music. The HT6000 supports Bluetooth version 4.0, and can be paired with two smartphones on the go. It comes with a built-in microphone to make or receive calls. With NFC technology, a user can simply pair the headphones by touch to non-stop music. The headphones offer up to 96 hours of standby and eight hours of call and playback on a single charge, company sources claim. They are available in black colour online and at retail stores. Address: Astrum India, 3rd Floor, Plot No. 79, Sector-44, Gurugram, Haryana – 122003; Ph: 09711118615

Fitness smartwatch with the ‘tap to pay’ feature from Garmin Innovative GPS technology firm, Garmin, has unveiled its latest smartwatch – the Vivoactive 3, in India. It is the first Price: smartwatch from the company to ` 24,990 offer a feature like ‘tap and pay’. The device offers contactless sources claim, and seven days on payments and has 15 preloaded sports apps along with inbuilt GPS functionality. smartwatch mode. Compatible with all Android and With the company’s in-house chroma display with LED backlighting, iOS devices, the Garmin Vivoactive smartwatch is available in black and the smartwatch features a 3.04cm (1.2 white colours via selected retail and inch) screen with a 240 x 240 pixel screen resolution. Its display is protected online stores. by Corning Gorilla Glass 3 with a stainless steel bezel, and the case is Address: Garmin India, D186, 2nd made of fibre-reinforced polymer. Floor, Yakult Building, Okhla Industrial The device offers 11 hours of Area, Phase 1, New Delhi – 110020; battery life with GPS mode, company Ph: 09716661666

This activity tracker from Timex has an SOS trigger Timex, the manufacturer of watches and accessories, has recently expanded its product portfolio by launching its latest activity tracker in India – the Timex Blink. The device is a blend of a traditional watch and a fitness tracker, and uses Bluetooth to connect with a smartphone. Apart from just being a funky watch, the device is designed to track all dayto-day activities such as calories burnt, distance covered, hours of sleep, etc. The tracker is capable of sending instant mails and SMSs with the GPS location of the user in case of an emergency. Compatible with both Android and iOS, the Blink activity tracker comes with a 2.28cm (0.9 inch) OLED touchscreen display with a stainless steel 304L case and six-axis motion sensors, an SOS trigger and a Nordic nRF52832 CPU. Backed



` 4,459

for the leather strap and

` 4,995

for bracelet style variants

with a 90mAh battery, the device supposedly offers 10 days of battery backup on a single charge. The Timex Blink is available in two variants – with a leather strap and as a bracelet, at retail stores. Address: Timex Group India Ltd, Tower B, Plot No. B37, Sector 1, Near GAIL Building, Noida, Uttar Pradesh – 201301

Speakers from Harman Kardon with Amazon Alexa support now in India Well-known manufacturer of audio equipment, Harman Kardon, recently launched its latest premium speakers – the Harman Kardon Allure. The highlight of the speakers is the support for Amazon’s Alexa – a proprietary voice technology assistant, which can help users manage day-to-day tasks, such as playing music, making purchases, reading out the news, etc. With a built-in four-microphone array and the latest voice technology, the speaker is capable of responding to commands even in noisy environments. It supports up to 24-bit/96kHz HD audio streaming and delivers 360-degree sound with its transducers and built-in sub-woofers. The multi-coloured lighting on the top of the device adapts to the


` 22,490

surrounding environment to help the speakers blend in. The device supports Bluetooth v4.2, aux and WPS/Wi-Fi for connecting

to all Android, iOS and Windows smartphones, laptops and TVs. The Harman Kardon Allure is available by invitation only at

Address: Harman International, A-11, Jawahar Park, Devli Road, New Delhi – 110062; Ph: 011-29552509

Feature loaded mid-range smartwatch collection from Misfit Misfit, the consumer electronics company owned by the Fossil group, has recently unveiled its much-awaited smartwatch collection called Vapor. The all-in-one smartwatch collection comes with a stunning 3.53cm (1.39 inch) full round AMOLED display with a vibrant colour palette in 326 ppi (pixels per inch) and is designed with a 44mm satin-finish stainless steel upper casing. Vapor is powered by Android 2.0, the latest wearable operating system. The ‘OK Google’ feature enables access to hundreds of top rated apps to get things done. The devices’ many features include a useful customised watch face, an enhanced fitness experience, onboard music functionality, the Google assistant, limitless apps, etc. The smartwatches come with the Qualcomm Snapdragon Wear 2100 processor and 4GB memory with


` 14,495

Bluetooth and Wi-Fi connectivity. Water-resistant to 50 metres, the Vapor range allows users to browse the menu of applications to respond to notifications easily. Fitness features include a calorie counter, as well as a distance and heart

rate monitor. The smartwatches also offer sensors such as an accelerometer, gyroscope, etc, and can be paired with any device running Android 4.3 or iOS9 and above. The Misfit Vapor collection can be purchased from Flipkart.

Address: Fossil Group, 621, 12th Main Road, HAL 2nd Stage, Indiranagar, Bengaluru, Karnataka 560008

The prices, features and specifications are based on information provided to us, or as available on various websites and portals. OSFY cannot vouch for their accuracy.

Compiled by: Aashima Sharma | OPEN SOURCE FOR YOU | JANUARY 2018 | 21

For U & Me

Open Journey


Affair with Freedom” Wearing geeky eyewear, this dimple-chinned man looks content with his life. When asked about his sun sign, he mimes the sun with its rays, but does not reveal his zodiac sign. Yes, this is the creative and very witty Niyam Bhushan, who has kickstarted a revolution in UX design in India through the workshops conducted by his venture DesignRev. in. In a tete-a-tete with Syeda Beenish of OSFY, this industry veteran, who has spent 30 odd years in understanding and sharing the value of open source with the masses, speaks passionately about the essence of open source. Excerpts:

Discovering Ghostscript back in 1988/89

Being a graphics designer, I came across Ghostscript circa 1988 or 1989. It was a muft and mukt alternative to Postscript. What intrigued me most about it was the licence—GPL, which got me started. Those were the days when people were curious to understand the difference between freeware, shareware, crippleware and adware. But it was

the GPL that made me realise this was a powerful hack of an idea that could transform the IT industry. I was excited from my first encounter and eventually devoted 14 years exclusively to the FOSS movement and its offshoots, most notably, Creative Commons.

The journey

From 1982 to 1985 I was busy learning how to program in machine code on


Zilog chips, and later in COBOL and BASIC on DEC mini computers PDP 11/70. But in a few years, I realised the game would be in digital graphics and design. So, I started with pioneering many techniques and workflows with digital graphics design, typography, and imaging in publishing. Eventually, I started consulting for the best IT companies like Apple, Adobe and Xerox in this field, and also with the advertising, publishing and even textile-printing industry. Concurrently, I focused on what was then called human computer interaction (HCI) and is now more popularly known as user-

Open Journey

For U & Me

Your definition of open source: Muft and mukt is a state of mind, not software Favourite book: ‘The Cathedral and the Bazaar’ by Eric S. Raymond Past-time: Tasting the timeless through meditation Favourite movie: ‘Snowden’ by Oliver Stone Dream destination: Bhutan, birthplace of ‘Schumacher Economics’ that gives a more holistic vision to the open source philosophy Idol: Osho, a visionary who talked about true freedom and how to exercise your individual freedom in your society

interface design and UX. This is the ultimate love affair between intuition and engineering. The huge impact of the computer industry on billions of people directly can be attributed to this synergy. I’ve brought tens of thousands of people into the free and open source movement in India. How? By writing extensively about it in mainstream newspapers as well as in tech magazines, and by conducting countless seminars and public talks for the industry, government, academia, and the community. Besides, I was a core member of the event, and helped to set up several chapters of Linux user groups across India. I ventured into consultancy, and guided companies on free and open source software. During my journey, I also contributed extensively to bug reports of a few GPL software in the graphics design space.

Establishing ILUG-D

I still remember one cool evening back in 1995, when a couple of us hackers were huddled around an assembled PC. Somebody was strumming a badly-tuned guitar, an excited pet dog was barking at new guests… This was the founding of the Indian Linux Users Group, Delhi (ILUG-D). This was also the first official meet at the home of the late Raj Mathur, founding member of the ILUG-D. That meeting shaped free and open source software as a movement, and not just a licence. Everyone knows what happened over the next decade-and-a-half.

The reality of open source adoption in India

Today, it is all about free and open

source software, and of open knowledge, which for me is way beyond Linux. Honestly, I am not happy with the way open source adoption has happened in India. In this vast country, there is one and only one challenge—the mindset of people towards open source. What’s happening in India is ‘digital colonialism’ as our minds are still ruled by proprietary software, proprietary services and a lack of understanding of privacy. We lack the understanding of our ‘digital sovereignty’. To address this mindset, I wrote two whitepapers and published them on my website,, which became very popular. The first was ‘Seven Steps to Software Samadhi: How to migrate from Windows to GNU/ Linux for the Non-techie in a Hurry’. Published under the FDL licence, this initiative acquired a life of its own among the community. The second one was ‘Guerilla Warfare for Gyaan’, which was about bringing in free knowledge, especially in academia. Both were received well | OPEN SOURCE FOR YOU | JANUARY 2018 | 23

For U & Me

Open Journey

by the community, but we are yet to unlock the true potential of open source in the country. How many of us really know that the highly sophisticated computer in our pocket is running Linux! Apple Macintosh and the iOS are based on the MACH kernel, Windows on BSD, and all of these are open source kernels. On a positive note, I would say that it is impressive to see the adoption of Android, but at other levels, the real potential of open source is yet to be realised by Indians.


You may wonder, “How did Niyam Bhushan survive and continue giving to the industry?” One should always remember that any communitybuilding needs your time and effort, but gradually, it will start giving you returns in the most unexpected manner. This was not the real driving force for me. I love people and I love ideas. Sharing your knowledge and experiences in return brings you commercial opportunities, as well as a plethora of ideas that further enhance your understanding. My intention was never to be a multi-billionaire, but to earn more than comfortably for myself while following my passion. I wanted to touch the lives of as many people as possible and enrich my life with knowledge-sharing whenever and wherever possible. The beauty of the community is that it seems like it is taking your time and effort, but it opens doors to lucrative opportunities as well. The community will continue to evolve around specific value-based pillars. For instance, in the vibrant startup communities of India, open source is fuelling a gold rush, propelling India to becoming a creator of wealth in the world. In academia, it is the highly local and focused communities that deepen learning and exploration. In the government and the public sector, their internal communities orient, adopt, collaborate and formulate policies.

What open source can do for: An organisation Free and open source software (FOSS) is a wild dragon-child that can transform any organisation into a Daenerys Targaryen. But like her, you need to know how to tame this dragon, and where and when to use this effectively. Otherwise, its fire can and will consume you instead. An individual (home user) Whatever software a home user adopts (including proprietary and commercial software), open source offers fierce competition to push costs down, keep it free, enhance its performance, make it secure, or honour your privacy better. Hence, open source browsers are free. Home users get operating systems for free or a token fee. The latest Firefox outclasses even Google Chrome, while Telegram messenger and Signal outshine WhatsApp with their privacy and security. Techie home user That’s like singing to the choir. For the techie home user, open source is the best way to tinker and hack and, hopefully, also build the next billion-dollar unicorn in your barsaati.

A ray of hope

Dos and Don’ts for developers

I insist that people should read their employment contract carefully. In most cases in India, I’ve noticed developers have signed away their rights to their contributions to FOSS in the name of the company, which may even keep them a trade secret, and may even threaten employees from using their own code ever again. Even if the software is under a free, muft and mukt licence, please carefully consider whom you want to assign the copyright of your work—to yourself, or your organisation. Check with the legal department about policies on the use of code marked as open source. Often, violations occur when developers help themselves to code without bothering to check the implications of its licence.


Unfortunately, people in India are not yet sensitised enough to the issue of digital privacy. If this sleeping giant wakes up to the importance of digital privacy, the adoption of open source will naturally become pervasive. IoT will provide the next push for open source across India, invisibly. Startups and entrepreneurs are and will continue to set up sophisticated cloud-based services deployed on free and open source software. So, here’s the magic bullet: sell your valueproposition, not your open source philosophy, and the market will adopt in droves. Beyond software, I see open source licences being adopted directly in agriculture, health, pharma and education, creating an exponentially larger impact than they could ever create as just software licences. To conclude, I would say that we’ve managed to discover the magic formula for the adoption of free and open source software in India. Just make it invisible, and people will adopt it — hence the exponential growth in the adoption of Android in India. Arduino projects bring FOSS to kids. But for me, adoption of open source is successful when people start the relationship with it after understanding its true philosophy. This is one love affair with freedom!

Let’s Try For U & Me

Tricks to Try Out on Thunderbird and SeaMonkey Learn to use and store email messages offline with Thunderbird and SeaMonkey.


n 2004, Google introduced its Gmail service with a 1GB mailbox and free POP access. This was at a time when most people had email accounts with their ISP or had free Web mail accounts with Hotmail or Yahoo. Mailbox storage was limited to measly amounts such as 5MB or 10MB. If you did not regularly purge old messages, then your incoming mail would bounce with the dreaded ‘Inbox full’ error. Hence, it was a standard practice to store email ‘offline’ using an email client. Each year now, a new generation of young people (mostly students) discover the Internet and they start with Web mail straight away. As popular Web mail services integrate online chatting as well, they prefer to use a Web browser rather than a desktop mail client to access email. This is sad because desktop email clients represent one of those rare Internet technologies that can claim to have achieved perfection. This article will bring readers up to speed on Thunderbird, the most popular FOSS email client.

Why use a desktop email client?

With an email client, you store emails offline. After the email application connects to your mail server and downloads new mail, it instructs the server to delete those messages from your mailbox (unless configured otherwise). This has several advantages. ƒ If your account gets hacked, the hacker will not get your archived messages. This also limits the fallout on your other accounts such as those of online banking. ƒ Web mail providers such as Gmail read your messages to display ‘relevant’ advertisements. This is creepy, even if it is software-driven. ƒ Email clients let you read and compose messages offline. A working Net connection is not required. Web mail requires you to log in first. ƒ Web mail providers such as Gmail automatically


ƒ ƒ

tell your contacts whether you are online or if your camera is on. Email clients do not do this. Modern Web browsers take many liberties without asking. Chrome, by default, listens to your microphone and uploads conversations to Google servers (for your convenience of course). Email clients are not like that. Searching archived messages is extremely powerful on desktop mail clients. There is no paging of the results. When popular Web mail providers offer free POP access, why suffer the slowness of the Web?

POP or IMAP access to email

Email clients use two protocols, POP and IMAP, to receive mail. POP is ideal if you want to download and delete mail. IMAP is best if you need access on multiple devices or at different locations. POP is more prevalent than IMAP. For offline storage, POP is the best. Popular Web mail providers provide both POP and IMAP access. Before you can use an email client, you will have to log in to your Web mail provider in a browser, check the settings and activate POP/IMAP access for incoming mail. Email clients use the SMTP protocol for outgoing mail. In Thunderbird/ SeaMonkey, you may have to add SMTP server settings separately for each email account. If you have lots of email already online, then it may not be possible to make your email client create an offline copy in one go. Each time you choose to receive messages, the mail client will download a few hundred of your old messages. After it has downloaded all your old archived messages, the mail client will then settle down to downloading only your newest messages. The settings for some popular Web mail services are as follows: ƒ Hotmail/Live/Outlook | OPEN SOURCE FOR YOU | JANUARY 2018 | 25

For U & Me

Let’s Try

• POP: • SMTP: ƒ Gmail • POP: • SMTP: ƒ Yahoo • POP: • SMTP: The following settings are common for them: ƒ POP • Connection security/Encryption method: SSL • Port: 995 ƒ SMTP • Connection security/Encryption method: SSL/TLS/ STARTTLS • Port: 465/587 Some ISPs and hosting providers provide unencrypted mail access. Here, the connection security method will be ‘None’, and the ports are set to 110 for POP and 25 for SMTP. However, please be aware that most ISPs block Port 25, and many mail servers block mail originating from that port.

Figure 1: Live off the grid with no mail online. To get this Gmail note, you will have to empty the Inbox and Trash, and also delete all archived messages.

Even on a desktop screen, space may be at a premium. Currently, Thunderbird and SeaMonkey do not provide an easy way to customise the date columns. I use this trick in the launcher command to fix it. export LC_TIME=en_DK.UTF-8 && seamonkey -mail

Thunderbird and SeaMonkey

Popular email clients today are Microsoft Outlook and Mozilla Thunderbird, the latter being the obvious FOSS option. Like the browser Firefox, Thunderbird is modern software and supports many extensions or add-ons. Unlike Outlook (which uses Microsoft Word as the HTML formatting engine), Thunderbird has better CSS support as it renders HTML messages using the Gecko engine (like the Firefox browser). The SeaMonkey Internet suite bundles both the Firefox browser and Thunderbird mail clients, in addition to an IRC client and a Web page designer. SeaMonkey is based on the philosophy of the old NetScape Internet Communication Suite, in which the browser was known as Netscape Navigator and the mail client was known as Netscape Communicator. Because of certain trademark objections with Mozilla, some GNU/Linux distributions were bundling Firefox and Thunderbird as IceWeasel and IceDove. SeaMonkey became IceApe. This was resolved in 2016. If you have already opened the SeaMonkey browser, then the SeaMonkey mail client can be opened in a flash, and the reverse is also true. This is very useful because website links in the SeaMonkey mails are opened in the SeaMonkey browser. Firefox is a separate application from Thunderbird and does not have the same advantage. For this reason, I use SeaMonkey instead of Thunderbird. SeaMonkey is available at By default, SeaMonkey looks like Firefox or Thunderbird. I prefer to change its appearance using the Modern theme, as it makes it look like the old Netscape 6, and also because I need the browser to look different from regular Firefox. To enable this theme, go to Tools » Add-Ons » Appearance » Seamonkey Modern. 26 | JANUARY 2018 | OPEN SOURCE FOR YOU |

Figure 2: Changing the format of the date columns requires a hack

Figure 3: Configure your own mail filters

Email providers today do a good job of filtering junk mail. You can do a better job with your own mail filters (Tools » Message Filters). You can choose to move/delete messages based on the occurrences of certain words in the From, To or Subject headers of the email.

For U & Me

Let’s Try

Figure 5: A newsgroup user sends an email message

Email backup

Figure 4: Thunderbird is also an RSS feed reader

Apart from email, Thunderbird can also display content from RSS feeds (as shown in Figure 4) and Usenet forums (as shown in Figure 5). Usenet newsgroups predate the World Wide Web. They are like an online discussion forum organised into several hierarchical groups. Forum participants post messages in the form of an email addressed to a newsgroup (say comp. lang.javascript), and the NNTP client threads the discussions based on the subject line (Google Groups is a Web based interface into the world of Usenet).

SeaMonkey ChatZilla

Apart from the Firefox-based browser and the Thunderbirdbased email client, SeaMonkey also bundles an IRC chat client. IRC is yet another Internet-based communication protocol that does not use the World Wide Web. It is the preferred medium of communication for hackers. Here is a link for starters: irc://

Read more stories on Security in

When you store email offline, the burden of doing regular backups falls on you. You also need to ensure that your computer is not vulnerable to malware such as email viruses. Web mail providers do a good job of eliminating email-borne malware, but malware can still arrive from other sources. Windows computers are particularly vulnerable to malware spread by USB drives and browser toolbars and extensions. In Windows, simply creating a directory named ‘autorun.inf’ at the root level stops most USB drive infections. SeaMonkey stores all its data (email messages and accounts, RSS feeds, website user names/ passwords/ preferences, etc,) in the ~/.mozilla/Seamonkey directory. For backup, just zip this directory regularly. If you move to a new GNU/Linux system, restore the backed-up directory to your new ~/.mozilla directory. By: V. Subhash The author is a writer, illustrator, programmer and FOSS fan. His website is at You can contact him at


2015 is expected to double by • CCTV camera market ices dev etric biom in st late • The to a bright future turers can look forward • CCTV Camera manufac proactive tool a into V CCT a s Turning • Video Analytics System with new technologies g lvin evo eras cam • Security eras • The latest in dome cam t security cameras -proof and vandal-resistan • The latest in weather


INDUSTRY IS AT A Log on to and be in touch with the Electronics B2B Fraternity 24x7


Asias #1 website on open source in an all-new avatar! Tutorials Latest News Feature Stories


Interviews from the world of open source

Just at the click of a button. Whether you surf the Open Source For You website from your smartphone, tablet, PC or Mac, you will enjoy a unified experience across all devices. You can also submit your tips, contribute with your ideas or extend your subscription directly from the website. Remember to follow us on Twitter (@OpenSourceForU) and like us on Facebook ( to get regular updates on open source developments.


How To

A Hands-on Guide on Virtualisation with VirtualBox Virtualisation is the process of creating software based (or virtual) representation of a resource rather than a physical one. Virtualisation is applicable at the compute, storage or network levels. In this article we will discuss compute level virtualisation, which is commonly referred to as server virtualisation.


erver virtualisation (henceforth referred to as virtualisation) allows us to run multiple instances of operating systems (OS) simultaneously on a single server. These OSs can be of the same or of different types. For instance, you can run Windows as well as Linux OS on the same server simultaneously. Virtualisation adds a software layer on top of the hardware, which allows users to share physical hardware (memory, CPU, network, storage and so on) with multiple OSs. This virtualisation layer is called the virtual machine manager (VMM) or a hypervisor. There are two types of hypervisors. Bare metal hypervisors: These are also known as Type-1 hypervisors and are directly installed on hardware. This enables the sharing of hardware resources with a guest OS (henceforth referred to as ‘guest’) running on top of them. Each guest runs in an isolated environment without interfering with other guests. ESXi, Xen, Hyper-V and KVM are examples of bare metal hypervisors. Hosted hypervisors: These are also known as Type-2 hypervisors. They cannot be installed directly on hardware. They run as applications and hence require an OS to run them. Similar to bare metal hypervisors, they are able to share physical resources among multiple guests and 30 | JANUARY 2018 | OPEN SOURCE FOR YOU |

the physical host on which they are running. VMware Workstation and Oracle VM VirtualBox (hereafter referred to as VirtualBox) are examples of hosted hypervisors.

An introduction to VirtualBox

VirtualBox is cross-platform virtualisation software. It is available on a wide range of platforms like Windows, Linux, Solaris, and so on. It extends the functionality of the existing OS and allows us to run multiple guests simultaneously along with the host’s other applications.

VirtualBox terminology

To get a better understanding of VirtualBox, let’s get familiar with its terminology. 1) Host OS: This is a physical or virtual machine on which VirtualBox is installed. 2) Virtual machine: This is the virtual environment created to run the guest OS. All its resources, like the CPU, memory, storage, network devices, etc, are virtual. 3) Guest OS: This is the OS running inside VirtualBox. VirtualBox supports a wide range of guests like Windows, Solaris, Linux, Apple, and so on. 4) Guest additions: These are additional software bundles

How To installed inside a guest to improve its performance and extend the functionality. For instance, these allow us to share folders between the host and guest, and to drag and drop functionality.

Features of VirtualBox

Let us discuss some important features of VirtualBox. 1) Portability: VirtualBox is highly portable. It is available on a wide range of platforms and its functionality remains identical on each of those platforms. It uses the same file and image format for VMs on all platforms. Because of this, a VM created on one platform can be easily migrated to another. In addition, VirtualBox supports the Open Virtualisation Format (OVF), which enables VM import and export functionality. 2) Commodity hardware: VirtualBox can be used on a CPU that doesn’t support hardware virtualisation instructions, like Intel’s VT-x or AMD-V. 3) Guest additions: As stated earlier, these software bundles are installed inside a guest, and enable advanced features like shared folders, seamless windows and 3D virtualisation. 4) Snapshot: VirtualBox allows the user to take consistent snapshots of the guest. It records the current state of the guest and stores it on disk. It allows the user to go back in time and revert the machine to an older configuration. 5) VM groups: VirtualBox allows the creation of a group of VMs and represents them as a single entity. We can perform various operations on that group like Start, Stop, Pause, Reset, and so on.

Getting started with VirtualBox System requirements

VirtualBox runs as an application on the host machine and for it to work properly, the host must meet the following hardware and software requirements: 1) An Intel or AMD CPU 2) A 64-bit processor with hardware virtualisation is required to run 64-bit guests 3) 1GB of physical memory 4) Windows, OS X, Linux or Solaris host OS

Downloading and installation

To download VirtualBox, visit wiki/Downloads link. It provides software packages for Windows, OS X, Linux and Solaris hosts. In this column I’ll be demonstrating VirtualBox on Mint Linux. Refer to the official documentation if you wish to install it on other platforms. For Debian based Linux, it provides the ‘.deb’ package. Its format is virtualbox-xx_xx-yy-zz.deb where xx_xx-yy is the version and build number respectively and zz is the host OS’s name and platform. For instance, in case of a Debian based 64-bit host, the package name is virtualbox-5.2_5.2.0118431-Ubuntu-xenial_amd64.deb.


To begin installation, execute the command given below in a terminal and follow the on-screen instructions: $ sudo dpkg -i virtualbox-5.2_5.2.0-118431-Ubuntu-xenial_amd64. deb

Using VirtualBox

After successfully installing VirtualBox, let us get our hands dirty by first starting VirtualBox from the desktop environment. It will launch the VirtualBox manager window as shown in Figure 1.

Figure 1: VirtualBox manager

This is the main window from which you can manage your VMs. It allows you to perform various actions on VMs like Create, Import, Start, Stop, Reset and so on. At this moment, we haven’t created any VMs; hence, the left pane is empty. Otherwise, a list of VMs are displayed there.

Creating a new VM

Let us create a new VM from scratch. Follow the instructions given below to create a virtual environment for OS installation. 1) Click the ‘New’ button on the toolbar. 2) Enter the guest’s name, its type and version and click the ‘Next’ button to continue. 3) Select the amount of memory to be allocated to the guest and click the ‘Next’ button. 4) From this window we can provide storage to the VM. It allows us to create a new virtual hard disk or use the existing one. 4a) To create a new virtual hard disk, select the ‘Create a virtual hard disk now’ option and click the ‘Create’ button. 4b) Select the VDI disk format and click on ‘Continue’. 4c) On this page, we can choose between a storage policy that is either dynamically allocated or a fixed size: i) As the name suggests, a dynamically allocated disk will grow on demand up to the maximum provided size. ii) A fixed size allocation will reserve the required storage upfront. If you are concerned about performance, then go with a fixed size allocation. 4d) Click the ‘Next’ button. 5) Provide the virtual hard disk’s name, location and size before clicking on the ‘Create’ button. This will show a newly created VM on the left pane as seen in Figure 2. | OPEN SOURCE FOR YOU | JANUARY 2018 | 31

Loonycorn is hiring


Mail Resume + Cover Letter to You:    

Really into tech - cloud, ML, anything and everything Interested in video as a medium Willing to work from Bangalore in the 0-3 years of experience range

Us:  ex-Google | Stanford | INSEAD  100,000+ students  Video content on Pluralsight, Stack, Udemy...

Our Content:


 The Ultimate Computer Science Bundle 9 courses | 139 hours  The Complete Machine Learning Bundle 10 courses | 63 hours  The Complete Computer Science Bundle 8 courses | 78 hours  The Big Data Bundle 9 courses | 64 hours  The Complete Web Programming Bundle 8 courses | 61 hours  The Complete Finance & Economics Bundle 9 courses | 56 hours  The Scientific Essentials Bundle 7 courses | 41 hours  ~35 courses on Pluralsight ~80 on StackSocial ~75 on Udemy


How To

Installing a guest OS

To begin OS installation, we need to attach an ISO image to the VM. Follow the steps given below to begin OS installation: 1) Select the newly created VM. 2) Click the ‘Settings’ button on the toolbar. 3) Select the storage option from the left pane. 4) Select the optical disk drive from the storage devices. 5) Provide the path of the ISO image and click the ‘OK’ button. Figure 3 depicts the first five steps. 6) Select the VM from the left pane. Click the ‘Start’ button on the toolbar. Follow the on-screen instructions to complete OS installation.

VM power actions

Let us understand VM power actions in detail. 1) Power On: As the name suggests, this starts the VM at the state it was powered off or saved in. To start the VM, right-click on it and select the ‘Start’ option. 2) Pause: In this state, the guest releases the CPU but not the memory. As a result, the contents of the memory are preserved when the VM is resumed. To pause the VM, right-click on it and select the ‘Pause’ option. 3) Save: This action saves the current VM state and releases the CPU as well as the memory. The saved machine can be started again in the same state. To save the VM, rightclick on it and select the ‘Close->Save State’ option. 4) Shutdown: This is a graceful turn-off operation. In this case, the shutdown signal is sent to the guest. To shut down the VM, right-click on it and select the ‘Close>ACPI Shutdown’ option. 5) Poweroff: This is non-graceful turn-off operation. It can cause data loss. To power off the VM, right-click on it and select the ‘Close->Poweroff’ option. 6) Reset: The Reset option will turn off and turn on the VM, respectively. It is different from Restart, which is a graceful turn-off operation. To reset the VM, right-click on it and select the ‘Reset’ option.

Figure 2: Creating a VM

Figure 3: Installing the OS

Removing the VM

Let us explore the steps we need to take to remove a VM. The remove operation can be broken up into two parts. 1) Unregister VM: This removes the VM from the library, i.e., it will just unregister the VM from VirtualBox so that it won’t be visible in VirtualBox Manager. To unregister a VM, right-click on it, select the ‘Remove’ option and click the ‘Remove Only’ option. You can re-register this VM by navigating to the ‘Machine->Add’ option from VirtualBox Manager. 2) Delete VM: This action is used to delete the VM permanently. It will delete the VM’s configuration files and virtual hard disks. Once performed, this action cannot be undone. To remove a VM permanently, right-click on it, select the ‘Remove’ option and click the ‘Delete all files’ option. 34 | JANUARY 2018 | OPEN SOURCE FOR YOU |

Figure 4: Starting the VM

VirtualBox—beyond the basics

Beginners will get a fair idea about virtualisation and VirtualBox by referring to the first few sections of this article. However, VirtualBox is a feature-rich product; this section describes its more advanced features.

How To Export appliance

We can export a VM as an appliance in the Open Virtualisation Format (OVF). It comes in a two-file format. 1) OVF file format: In this format, several VM related files will be generated. For instance, there will be separate files for virtual hard disks, configurations and so on. 2) OVA file format: In this format, all VM related files will be archived into a single file and the .ova extension will be assigned to it. By leveraging this feature, we can create a Golden Image of a VM and deploy multiple instances of it. OVF is a platform-independent, efficient, extensible, and open packaging and distribution format for VMs. As it is platform-independent, it allows the import of OVF virtual machines exported from VirtualBox into VMware Workstation Player and vice versa. To export a VM, perform the steps listed below: 1) Select a VM from the VirtualBox manager. Navigate to the ‘File->Export Appliance’ option. 2) Select the VMs to be exported, and click the ‘Next’ button. 3) Provide the directory’s location and OVF format version. 4) Provide the appliance settings and click the ‘Export’ button.

Import appliance

To import a VM, perform the steps given below: 1) Open VirtualBox Manager and navigate to the ‘File>Import Appliance’ option. 2) Select ‘Virtual appliance’ and click on the ‘Next’ button. 3) Verify appliance settings and click on the ‘Import’ button. You will see that a new VM appears in VirtualBox Manager’s left pane.

Cloning a VM

VirtualBox also provides an option to clone existing VMs. As the name suggests, it creates an exact copy of the VM. It supports the following two types of clones. 1) Full clone: In case of a full clone, it will duplicate all the VM’s files. As this is a totally separate VM copy, we can easily move this VM to another host. 2) Linked clone: In case of linked clones, it will not copy virtual hard disks but, instead, it will take a snapshot of the original VM. It will create a new VM, but this one will refer to the virtual hard disks of the original VM. This is a space efficient clone operation, but the downside is that you cannot move the VM to another host as the original VM and the cloned one share the same virtual hard disks. To create a clone, perform the steps given below: 1) Select the VM from VirtualBox Manager. Right-click the VM and select the ‘Clone’ option. 2) Provide the name of the clone VM and click the ‘Next’ button. 3) Select the clone type and click the ‘Clone’ button. You will see that a new cloned VM appears in VirtualBox Manager’s left pane.


Group VMs

VirtualBox allows you to create groups of VMs, and to manage and perform actions on them as a single entity. You can perform various actions on them like expanding/shrinking the group, renaming the group, or Start, Stop, Reset and Pause actions on the group of VMs. To create a VM group, perform the following steps: 1) Select multiple VMs from VirtualBox Manager. Hold the ‘Ctrl’ key for multiple selections. 2) Right-click it and select the ‘Group’ option. This will create a VM group called ‘New group’ as shown in Figure 5. If you right-click on the group, it will show various options like ‘Add VM to group’, ‘Rename group’, ‘Ungroup’, ‘Start’ and so on. To remove a VM from the group, just drag and drop that particular VM outside the group. With snapshots, you can save a particular state of a VM for later use, at which point, you can revert to that state. To take a snapshot, perform the following steps: 1) Select a VM from VirtualBox Manager. 2) Click the ‘Machine Tools’ drop down arrow from the toolbar and select the ‘Snapshots’ option. 3) Click the ‘Take’ button. 4) Enter the snapshot’s name and description before clicking on the ‘OK’ button. Figure 6 depicts the above steps. This window provides various options related to snapshots like Delete, Clone, Restore and so on. Click on the ‘Properties’ button to see more details about the selected snapshot.

Shared folders

Shared folders enable data sharing between the guest and host OS. They require VirtualBox guest additions to be installed inside the guest. This section describes the installation of guest additions along with the shared folder feature. To enable the shared folder feature, perform the following steps: 1) Start the VM from VirtualBox Manager. 2) Go to the Devices->Insert Guest Additions CD image option. Follow the on-screen instructions to perform the guest additions installation. Figure 7 depicts the first two steps. 3) Navigate to Devices->Shared Folders->Shared Folder Settings. 4) Click the ‘Add new shared folder’ button. Enter the folder’s name, its path and select ‘Permissions’. Click the ‘OK’ button. Figure 8 illustrates the above steps. You can mount the shared folder from the guest in the same way as an ordinary network share. Given below is the syntax for that: mount -t vboxsf [-o OPTIONS] <sharename> <mountpoint>

Understanding virtual networking

This section delves deep into the aspects of VirtualBox’s networking, which supports the network modes | OPEN SOURCE FOR YOU | JANUARY 2018 | 35


How To

Figure 5: VM groups

Figure 7: Guest addition installation

Figure 6: Snapshot VM

Figure 8: Shared folder

Not Attached, NAT, bridged adapters, internal networks and host-only adapters. Perform the steps given below to view/manipulate the current network settings: 1) Select the VM from the VirtualBox Manager. 2) Click the ‘Settings’ button on the toolbar. 3) Select the ‘Network’ option from the left pane. 4) Select the adapter. The current networking mode will be displayed under the ‘Attached to’ drop-down box. 5) To change the mode, select the required network mode from the drop-down box and click the ‘OK’ button. Figure 9 illustrates the above steps.

networks from the guest, then this will serve your purpose. It is similar to a physical system connected to an external network via the router. 3) Bridged adapter: In this mode, VirtualBox connects to one of your installed network cards and exchanges network packets directly, circumventing the host operating system’s network stack. 4) Internal: In this mode, communication is allowed between a selected group of VMs only. Communication with the host is not possible. 5) Host only: In this mode, communication is allowed between a selected group of VMs and the host. A physical Ethernet card is not required; instead, a virtual network interface (similar to a loopback interface) is created on the host.

VirtualBox network modes

Let us discuss each network mode briefly. 1) Not Attached: In this mode, VirtualBox reports to the guest that the network card is installed but it is not connected. As a result of this, networking is not possible in this mode. If you want to compare this scenario with a physical machine, then it is similar to the Ethernet card being present but the cable not being connected to it. 2) NAT: This stands for Network Address Translation and it is the default mode. If you want to access external 38 | JANUARY 2018 | OPEN SOURCE FOR YOU |

An introduction to VBoxManage

VBoxManage is the command line interface (CLI) of VirtualBox. You can manage VirtualBox from your host via these commands. It supports all the features that are supported by the GUI. It gets installed by default when the VirtualBox package is installed. Let us look at some of its basic commands.

How To


To turn on the VM

VBoxManage provides a simple command to start the VM. It accepts the VM name as an argument. $ VBoxManage startvm Mint-18 Waiting for VM “Mint-18” to power on... VM “Mint-18” has been successfully started.

To turn off the VM

The controlvm option supports various actions like pause, reset, power-off, shutdown and so on. To power off the VM, execute the command given below at a terminal. It accepts the VM name as an argument. Figure 9: Network modes

To list VMs

Execute the commands given below in a terminal to list all the registered VMs: $ VBoxManage list vms “Mint-18” {e54feffd-50ed-4880-8f81-b6deae19110d} “VM-1” {37a25c9a-c6fb-4d08-a11e-234717261abc} “VM-2” {03b39a35-1954-4778-a261-ceeddc677e65} “VM-3” {875be4d5-3fbf-4d06-815d-6cecfb2c2304}

To list groups

We can also list VM groups using the following commands: $ VBoxManage list groups “/” “/VM Group”

To show VM information

We can use the showvminfo command to display details about a VM. For instance, the command given below provides detailed information about the VM. It accepts the VM’s name as an argument. $ VBoxManage showvminfo Mint-18 Name: Mint-18 Groups: / Guest OS: Ubuntu (64-bit) UUID: e54feffd-50ed-4880-8f81-b6deae19110d Config file: /home/groot/VirtualBox VMs/Mint-18/Mint-18.vbox Snapshot folder: /home/groot/VirtualBox VMs/Mint-18/Snapshots Log folder: /home/groot/VirtualBox VMs/Mint-18/Logs Hardware UUID: e54feffd-50ed-4880-8f81-b6deae19110d Memory size: 1024MB Page Fusion: off VRAM size: 16MB

Note: The remaining output is not shown here, in order to save space.

$ VBoxManage controlvm “Mint-18” poweroff 0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...100%

To unregister VM

The command given below can be used to unregister a VM. It accepts the VM’s name as an argument. $ VBoxManage unregistervm “Mint-18”

To register VM

The command given below can be used to register a VM. It accepts the VM’s file name as an argument. $ VBoxManage registervm “/home/groot/VirtualBox VMs/Mint-18/ Mint-18.vbox”

To delete VM

To delete a VM permanently, use the --delete option with the unregistervm command. For instance, the following command will delete the VM permanently. $ VBoxManage unregistervm “VM-1” --delete 0%...10%...20%...30%...40%...50%...60%...70%...80%...90%...1 00%

VBoxManage provides many more commands and covering them all is beyond the scope of this tutorial. Anyway, you can always dig deeper into this topic by referring to VirtualBox’s official guide. To view all supported commands and their options, execute the following command in a terminal: $ VBoxManage –help

By: Narendra K. The author is a FOSS enthusiast. He can be reached at | OPEN SOURCE FOR YOU | JANUARY 2018 | 39


How To

DevOps Series

Deploying Graylog Using Ansible This 11th article in the DevOps series is a tutorial on installing Graylog software using Ansible.


raylog is a free and open source log management software that allows you to store and analyse all your logs from a central location. It requires MongoDB (a document-oriented, NoSQL database) to store meta information and configuration information. The actual log messages are stored in Elasticsearch. It is written using the Java programming language and released under the GNU General Public License (GPL) v3.0. Access control management is built into the software, and you can create roles and user accounts with different permissions. If you already have an LDAP server, its user accounts can be used with the Graylog software. It also provides a REST API, which allows you to fetch data to build your own dashboards. You can create alerts to take actions based on the log messages, and also forward the log data to other output streams. In this article, we will install the Graylog software and its dependencies using Ansible.


An Ubuntu 16.04.3 LTS guest virtual machine (VM) instance will be used to set up Graylog using KVM/QEMU. The host system is a Parabola GNU/Linux-libre x86_64 system. Ansible is installed on the host system using the distribution package manager. The version of Ansible used is: $ ansible --version ansible config file = /etc/ansible/ansible.cfg configured module search path = [u’/home/shakthi/.ansible/ 40 | JANUARY 2018 | OPEN SOURCE FOR YOU |

plugins/modules’, u’/usr/share/ansible/plugins/modules’] ansible python module location = /usr/lib/python2.7/sitepackages/ansible executable location = /usr/bin/ansible python version = 2.7.14 (default, Sep 20 2017, 01:25:59) [GCC 7.2.0]

Add an entry to the /etc/hosts file for the guest ‘ubuntu’ VM as indicated below: ubuntu

On the host system, let’s create a project directory structure to store the Ansible playbooks: ansible/inventory/kvm/ /playbooks/configuration/ /playbooks/admin/

An ‘inventory’ file is created inside the inventory/kvm folder that contains the following code: ubuntu ansible_host= ansible_connection=ssh ansible_user=ubuntu ansible_password=password

You should be able to issue commands using Ansible to the guest OS. For example: $ ansible -i inventory/kvm/inventory ubuntu -m ping

How To Admin ubuntu | SUCCESS => { “changed”: false, “failed”: false, “ping”: “pong” }


The Graylog software has a few dependency packages that need to be installed as pre-requisites. The APT package repository is updated and upgraded before installing the prerequisite software packages. --- name: Pre-requisites hosts: ubuntu become: yes become_method: sudo gather_facts: true tags: [prerequisite] tasks: - name: Update the software package repository apt: update_cache: yes - name: Update all the packages apt: upgrade: dist - name: Install pre-requisite packages package: name: “{{ item }}” state: latest with_items: - apt-transport-https - openjdk-8-jre-headless - uuid-runtime - pwgen

The above playbook can be invoked as follows: $ ansible-playbook -i inventory/kvm/inventory playbooks/ configuration/graylog.yml --tags prerequisite -K

The ‘-K’ option prompts for the sudo password for the ‘ubuntu’ user. You can append multiple ‘-v’ to the end of the playbook invocation to get a more verbose output.


Graylog uses MongoDB to store meta information and configuration changes. The MongoDB software package that ships with Ubuntu 16.04 is supported by the latest Graylog software. The Ansible playbook to install the same is as follows:

- name: Install Mongodb hosts: ubuntu become: yes become_method: sudo gather_facts: true tags: [mongodb] tasks: - name: Install MongoDB package: name: mongodb-server state: latest - name: Start the server service: name: mongodb state: started - wait_for: port: 27017

The Ubuntu software package for MongoDB is called the ‘mongodb-server’. It is installed, and the database server is started. The Ansible playbook waits for the MongoDB server to start and listen on the default port 27017. The above playbook can be invoked using the following command: $ ansible-playbook -i inventory/kvm/inventory playbooks/ configuration/graylog.yml --tags mongodb -K


Elasticsearch is a search engine that is written in Java and released under the Apache licence. It is based on Lucene (an information retrieval software library) and provides a full-text search feature. The website provides .deb packages that can be used to install the same on Ubuntu. The Ansible playbook for this is provided below: - name: Install Elasticsearch hosts: ubuntu become: yes become_method: sudo gather_facts: true tags: [elastic] tasks: - name: Add key apt_key: url: state: present - name: Add elastic deb sources lineinfile: | OPEN SOURCE FOR YOU | JANUARY 2018 | 41


How To

path: /etc/apt/sources.list.d/elastic-5.x.list create: yes line: ‘deb apt stable main’

“cluster_uuid” : “nuBTSlFBTk6PDGyrfDCr3A”, “version” : { “number” : “5.6.5”, “build_hash” : “6a37571”, “build_date” : “2017-12-04T07:50:10.466Z”, “build_snapshot” : false, “lucene_version” : “6.6.1” }, “tagline” : “You Know, for Search”

- name: Update the software package repository apt: update_cache: yes - name: Install Elasticsearch package: name: elasticsearch state: latest - name: Update cluster name lineinfile: path: /etc/elasticsearch/elastisearch.yml create: yes regexp: ‘^ my-application’ line: ‘ graylog’ - name: Daemon reload systemd: daemon_reload=yes - name: Start elasticsearch service service: name: elasticsearch.service state: started - wait_for: port: 9200 - name: Test Curl query shell: curl -XGET ‘localhost:9200/?pretty’

The stable repository package is installed before installing Elasticsearch. The cluster name is then updated in the /etc/elasticsearch/elasticsearch.yml configuration file. The system daemon services are reloaded, and the Elasticsearch service is started. The Ansible playbook waits for the service to run and listen on port 9200. The above playbook can be invoked as follows: $ ansible-playbook -i inventory/kvm/inventory playbooks/ configuration/graylog.yml --tags elastic -K

You can perform a manual query to verify that Elasticsearch is running using the following Curl command: $ curl -XGET ‘localhost:9200/?pretty’ { “name” : “cFn-3YD”, “cluster_name” : “elasticsearch”, 42 | JANUARY 2018 | OPEN SOURCE FOR YOU |



The final step is to install Graylog itself. The .deb package available from the website is installed and then the actual ‘graylog-server’ package is installed. The configuration file is updated with credentials for the ‘admin’ user with a hashed string for the password ‘osfy’. The Web interface is also enabled with the default IP address of the guest VM. The Graylog service is finally started. The Ansible playbook to install Graylog is as follows: - name: Install Graylog hosts: ubuntu become: yes become_method: sudo gather_facts: true tags: [graylog] tasks: - name: Install Graylog repo deb apt: deb: graylog-2.3-repository_latest.deb - name: Update the software package repository apt: update_cache: yes - name: Install Graylog package: name: graylog-server state: latest - name: Update database credentials in the file replace: dest: “/etc/graylog/server/server.conf” regexp: “{{ item.regexp }}” replace: “{{ item.replace }}” with_items: - { regexp: ‘password_secret =’, replace: ‘password_ secret = QXHg3Eqvsu PmFxUY2aKlgimUF05plMPXQ Hy1stUiQ1uaxgIG27 K3t2MviRiFLNot09U1ako T30njK3G69KIzqIoYqdY3oLUP’ } - { regexp: ‘#root_username = admin’, replace: ‘root_ username = admin’ }

How To Admin - { regexp: ‘root_password_sha2 =’, replace: ‘root_password_sha2 = eabb9bb2efa089223 d4f54d55bf2333ebf04a29094bff00753536d7488629399’} - { regexp: ‘#web_enable = false’, replace: ‘web_ enable = true’ } - { regexp: ‘#web_listen_uri =’, replace: “web_listen_uri = http://{{ ansible_default_ipv4.address }}:9000/” } - { regexp: ‘rest_listen_uri = api/’, replace: “rest_listen_uri = http://{{ ansible_default_ ipv4.address }}:9000/api/” } - name: Start graylog service service: name: graylog-server.service state: started

The above playbook can be run using the following command: $ ansible-playbook -i inventory/kvm/inventory playbooks/ configuration/graylog.yml --tags graylog -K

The guest VM is a single node, and hence if you traverse to System -> Nodes, you will see this node information as illustrated in Figure 3.

Figure 3: Graylog node activated

You can now test the Graylog installation by adding a data source as input by traversing System -> Input in the Web interface. The ‘random HTTP message generator’ is used as a local input, as shown in Figure 4.

Web interface

You can now open the URL in a browser on the host system to see the default Graylog login page as shown in Figure 1.

Figure 1: Graylog login page

The user name is ‘admin’ and the password is ‘osfy’. You will then be taken to the Graylog home page as shown in Figure 2.

Figure 2: Graylog home page

Figure 4: Random HTTP message generator | OPEN SOURCE FOR YOU | JANUARY 2018 | 43


How To

Figure 5: Graylog input random HTTP message generator

The newly created input source is now running and visible as a local input in the Web page as shown in Figure 5. After a few minutes, you can observe the created messages in the Search link as shown in Figure 6.

Uninstalling Graylog

An Ansible playbook to stop the different services, and to uninstall Graylog and its dependency software packages, is given below for reference: --- name: Uninstall Graylog hosts: ubuntu become: yes become_method: sudo gather_facts: true tags: [uninstall] tasks: - name: Stop the graylog service service: name: graylog-server.service state: stopped - name: Uninstall graylog server package: name: graylog-server state: absent - name: Stop the Elasticsearch server service: name: elasticsearch.service state: stopped - name: Uninstall Elasticsearch package: 44 | JANUARY 2018 | OPEN SOURCE FOR YOU |

Figure 6: Graylog random HTTP messages name: elasticsearch state: absent - name: Stop the MongoDB server service: name: mongodb state: stopped - name: Uninstall MongoDB package: name: mongodb-server state: absent - name: Uninstall pre-requisites package: name: â&#x20AC;&#x153;{{ item }}â&#x20AC;? state: absent with_items: - pwgen - uuid-runtime - openjdk-8-jre-headless - apt-transport-https

The above playbook can be invoked using: $ ansible-playbook -i inventory/kvm/inventory playbooks/ admin/uninstall-graylog.yml -K

By: Shakthi Kannan The author is a free software enthusiast and blogs at



Analysing Big Data with Hadoop Big Data is unwieldy because of its vast size, and needs tools to efficiently process and extract meaningful results from it. Hadoop is an open source software framework and platform for storing, analysing and processing data. This article is a beginner’s guide to how Hadoop can help in the analysis of Big Data.


ig Data is a term used to refer to a huge collection of data that comprises both structured data found in traditional databases and unstructured data like text documents, video and audio. Big Data is not merely data but also a collection of various tools, techniques, frameworks and platforms. Transport data, search data, stock exchange data, social media data, etc, all come under Big Data. Technically, Big Data refers to a large set of data that can be analysed by means of computational techniques to draw patterns and reveal the common or recurring points that would help to predict the next step—especially human behaviour, like future consumer actions based on an analysis of past purchase patterns. Big Data is not about the volume of the data, but more about what people use it for. Many organisations like business corporations and educational institutions are using this data to analyse and predict the consequences of certain actions. After collecting the data, it can be used for several functions like: ƒ Cost reduction ƒ The development of new products ƒ Making faster and smarter decisions ƒ Detecting faults Today, Big Data is used by almost all sectors including banking, government, manufacturing, airlines and hospitality. There are many open source software frameworks for storing and managing data, and Hadoop is one of them. It has a huge capacity to store data, has efficient data processing power and the capability to do countless jobs. It is a Java based programming framework, developed by

Apache. There are many organisations using Hadoop — Amazon Web Services, Intel, Cloudera, Microsoft, MapR Technologies, Teradata, etc.

The history of Hadoop

Doug Cutting and Mike Cafarella are two important people in the history of Hadoop. They wanted to invent a way to return Web search results faster by distributing the data over several machines and make calculations, so that several jobs could be performed at the same time. At that time, they were working on an open source search engine project called Nutch. But, at the same time, the Google search engine project also was in progress. So, Nutch was divided into two parts—one of the parts dealt with the processing of data, which the duo named Hadoop after the toy elephant that belonged to Cutting’s son. Hadoop was released as an open source project in 2008 by Yahoo. Today, the Apache Software Foundation maintains the Hadoop ecosystem.

Prerequisites for using Hadoop

Linux based operating systems like Ubuntu or Debian are preferred for setting up Hadoop. Basic knowledge of the Linux commands is helpful. Besides, Java plays an important role in the use of Hadoop. But people can use their preferred languages like Python or Perl to write the methods or functions. There are four main libraries in Hadoop. 1. Hadoop Common: This provides utilities used by all other modules in Hadoop. | OPEN SOURCE FOR YOU | JANUARY 2018 | 45



2. Hadoop MapReduce: This works as a parallel framework for scheduling and processing the data. 3. Hadoop YARN: This is an acronym for Yet Another Resource Navigator. It is an improved version of MapReduce and is used for processes running over Hadoop. 4. Hadoop Distributed File System – HDFS: This stores data and maintains records over various machines or clusters. It also allows the data to be stored in an accessible format. HDFS sends data to the server once and uses it as many times as it wants. When a query is raised, NameNode manages all the DataNode slave nodes that serve the given query. Hadoop MapReduce performs all the jobs assigned sequentially. Instead of MapReduce, Pig Hadoop and Hive Hadoop are used for better performances. Other packages that can support Hadoop are listed below. ƒ Apache Oozie: A scheduling system that manages processes taking place in Hadoop ƒ Apache Pig: A platform to run programs made on Hadoop ƒ Cloudera Impala: A processing database for Hadoop. Originally it was created by the software organisation Cloudera, but was later released as open source software ƒ Apache HBase: A non-relational database for Hadoop ƒ Apache Phoenix: A relational database based on Apache HBase ƒ Apache Hive: A data warehouse used for summarisation, querying and the analysis of data ƒ Apache Sqoop: Is used to store data between Hadoop and structured data sources ƒ Apache Flume: A tool used to move data to HDFS ƒ Cassandra: A scalable multi-database system

The importance of Hadoop

Hadoop is capable of storing and processing large amounts of data of various kinds. There is no need to preprocess the data before storing it. Hadoop is highly scalable as it can store and distribute large data sets over several machines running in parallel. This framework is free and uses costefficient methods. Hadoop is used for: ƒ Machine learning ƒ Processing of text documents ƒ Image processing ƒ Processing of XML messages ƒ Web crawling ƒ Data analysis ƒ Analysis in the marketing field ƒ Study of statistical data

Challenges when using Hadoop

Hadoop does not provide easy tools for removing noise from the data; hence, maintaining that data is a challenge. It has many data security issues like encryption problems. Streaming jobs and batch jobs are not performed 46 | JANUARY 2018 | OPEN SOURCE FOR YOU |

efficiently. MapReduce programming is inefficient for jobs involving highly analytical skills. It is a distributed system with low level APIs. Some APIs are not useful to developers. But there are benefits too. Hadoop has many useful functions like data warehousing, fraud detection and marketing campaign analysis. These are helpful to get useful information from the collected data. Hadoop has the ability to duplicate data automatically. So multiple copies of data are used as a backup to prevent loss of data.

Frameworks similar to Hadoop

Any discussion on Big Data is never complete without a mention of Hadoop. But like with other technologies, a variety of frameworks that are similar to Hadoop have been developed. Other frameworks used widely are Ceph, Apache Storm, Apache Spark, DataTorrentRTS, Google BiqQuery, Samza, Flink and HydraDataTorrentRTS. MapReduce requires a lot of time to perform assigned tasks. Spark can fix this issue by doing in-memory processing of data. Flink is another framework that works faster than Hadoop and Spark. Hadoop is not efficient for real-time processing of data. Apache Spark uses stream processing of data where continuous input and output of data happens. Apache Flink also provides single runtime for the streaming of data and batch processing. However, Hadoop is the preferred platform for Big Data analytics because of its scalability, low cost and flexibility. It offers an array of tools that data scientists need. Apache Hadoop with YARN transforms a large set of raw data into a feature matrix which is easily consumed. Hadoop makes machine learning algorithms easier.

References [1] [2] [3] [4] [5] overview.htm [6] [7] Hadoop [8] [9]

By: Jameer Babu The author is a FOSS enthusiast and is interested in competitive programming and problem solving. He can be contacted at

Overview Admin

Top Three Open Source

DATA BACKUP TOOLS This article examines three open source data backup solutions that are the best among the many available.


pen source data backup software has become quite popular in recent times. One of the main reasons for this is that users have access to the code, which allows them to tweak the product. Open source tools are now being used in data centre environments because they are low cost and provide flexibility. Letâ&#x20AC;&#x2122;s take a look at three open source backup software packages that I consider the best. All three provide support for UNIX, Linux, Windows and Mac OS.


This is one of the oldest open source backup software packages. It gets its name from the University of Maryland where it was originally conceived. Amanda stands for the Advanced Maryland Disk Archive.

Amanda is a scheduling, automation and tracking program wrapped around native backup tools like tar (for UNIX/Linux) and zip (for Windows). The database that tracks all backups allows you to restore any file from a previous version of that file that was backed up by Amanda. This reliance on native backup tools comes with advantages and disadvantages. The biggest advantage, of course, is that you will never have a problem reading an Amanda tape on any platform. The formats Amanda uses are easily available on any open-systems platform. The biggest disadvantage is that some of these tools have limitations (e.g., path length) and Amanda will inherit those limitations. On another level, Amanda is a sophisticated program that has a number of enterprise-level features, like automatically determining when to run your full backups, | OPEN SOURCE FOR YOU | JANUARY 2018 | 47



Figure 1: Selecting files and folders for file system backup

Figure 3: BackupPC server status

As of this writing, Bacula is only a file backup product and does not provide any database agents. You can shut a database down and back up its files, but this is not a viable backup method for some databases.


Figure 2: Bacula admin page

instead of having you schedule them. It’s also the only open source package to have database agents for SQL Server, Exchange, SharePoint and Oracle, as well as the only backup package to have an agent for MySQL and Ingress. Amanda is now backed by Zmanda, and this company has put its development into overdrive. Just a few months after beginning operations, Zmanda has addressed major limitations in the product that had hindered it for years. Since then, it has been responsible for the addition of a lot of functionality, including those database agents.


Bacula was originally written by Kern Sibbald, who chose a very different path from Amanda by writing a custom backup format designed to overcome the limitations of the native tools. Sibbald’s original goal was to write a tool that could take the place of the enterprise tools he saw in the data centre. Bacula also has scheduling, automation and tracking of all backups, allowing you to easily restore any file (or files) from a previous version. Like Amanda, it also has media management features that allow you to use automated tape libraries and perform disk-to-disk backups. 48 | JANUARY 2018 | OPEN SOURCE FOR YOU |

Both Amanda and Bacula feel and behave like conventional backup products. They have support for both disk and tape, scheduled full and incremental backups, and they come in a ‘backup format’. BackupPC, on the other hand, is a disk-only backup tool that forever performs incremental backups, and stores those backups in their native format in a snapshot-like tree structure that is available via a GUI. Like Bacula, it’s a file-only backup tool and its incremental nature might be hampered by backing up large database files. However, it’s a really interesting alternative for file data. BackupPC’s single most imposing feature is that it does file-level de-duplication. If you have a file duplicated anywhere in your environment, it will find that duplicate and replace it with a link to the original file.

Which one should you use?

Choosing a data backup tool entirely depends on the purpose. If you want the least proprietary backup format then go for BackupPC. If database agents are a big driver, you can choose Amanda. Or if you want a product that’s designed like a typical commercial backup application, then opt for Bacula. One more important aspect is that both BackupPC and Amanda need the Linux server to control backup and Bacula has a Windows server to do the same. All three products are very popular. Which one you choose depends on what you need. The really nice thing about all three tools is that they can be downloaded free of cost. So you can decide which one is better for you after trying out all three. By Neetesh Mehrotra The author works at TCS as a systems engineer, and his areas of interest are Java development and automation testing. For any queries, do contact him at



Getting Past the Hype Around Hadoop The term Big Data and the name Hadoop are bandied about freely in computer circles. In this article, the author attempts to explain them in very simple terms.


magine this scenario: You have 1GB of data that you need to process. The data is stored in a relational database in your desktop computer which has no problem managing the load. Your company soon starts growing very rapidly, and the data generated grows to 10GB, and then 100GB. You start to reach the limits of what your current desktop computer can handle. So what do you do? You scale up by investing in a larger computer, and you are then alright for a few more months. When your data grows from 1TB to 10TB, and then to 100TB, you are again quickly approaching the limits of that computer. Besides, you are now asked to feed your application with unstructured data coming from sources like Facebook, Twitter, RFID readers, sensors, and so on. Your managers want to derive information from both the relational data and the unstructured data, and they want this information as soon as possible. What should you do? Hadoop may be the answer. Hadoop is an open source project of the Apache Foundation. It is a framework written

in Java, originally developed by Doug Cutting, who named it after his sonâ&#x20AC;&#x2122;s toy elephant! Hadoop uses Googleâ&#x20AC;&#x2122;s MapReduce technology as its foundation. It is optimised to handle massive quantities of data which could be structured, unstructured or semistructured, using commodity hardware, i.e., relatively inexpensive computers. This massive parallel processing is done with great efficiency. However, handling massive amounts of data is a batch operation, so the response time is not immediate. Importantly, Hadoop replicates its data across different computers, so that if one goes down, the data is processed on one of the replicated computers.

Big Data

Hadoop is used for Big Data. Now what exactly is Big Data? With all the devices available today to collect data, such as RFID readers, microphones, cameras, sensors, and so on, we are seeing an explosion of data being collected worldwide. | OPEN SOURCE FOR YOU | JANUARY 2018 | 49


Overview High Level Architecture of Hadoop Maser Node TaskTracker

MapReducelayer HDFS layer

Slave Node

Slave Node








Figure 1: High level architecture

Figure 2: Hadoop architecture

Big Data is a term used to describe large collections of data (also known as data sets) that may be unstructured, and grow so large and so quickly that it is difficult to manage with regular database or statistical tools. In terms of numbers, what are we looking at? How BIG is Big Data? Well there are more than 3.2 billion Internet users, and active cell phones have crossed the 7.6 billion mark. There are now more in-use cell phones than there are people on the planet (7.4 billion). Twitter processes 7TB of data every day, and 600TB of data is processed by Facebook daily. Interestingly, about 80 per cent of this data is unstructured. With this massive amount of data, businesses need fast, reliable, deeper data insight. Therefore, Big Data solutions based on Hadoop and other analytic software are becoming more and more relevant.

Hadoop architecture

Open source projects related to Hadoop

Here is a list of some other open source projects related to Hadoop: ƒ Eclipse is a popular IDE donated by IBM to the open source community. ƒ Lucene is a text search engine library written in Java. ƒ Hbase is a Hadoop database - Hive provides data warehousing tools to extract, transform and load (ETL) data, and query this data stored in Hadoop files. ƒ Pig is a high-level language that generates MapReduce code to analyse large data sets. ƒ Spark is a cluster computing framework. ƒ ZooKeeper is a centralised configuration service and naming registry for large distributed systems. ƒ Ambari manages and monitors Hadoop clusters through an intuitive Web UI. ƒ Avro is a data serialisation system. ƒ UIMA is the architecture used for the analysis of unstructured data. ƒ Yarn is a large scale operating system for Big Data applications. ƒ MapReduce is a software framework for easily writing applications that process vast amounts of data. 50 | JANUARY 2018 | OPEN SOURCE FOR YOU |

Before we examine Hadoop’s components and architecture, let’s review some of the terms that are used in this discussion. A node is simply a computer. It is typically non-enterprise, commodity hardware that contains data. We can keep adding nodes, such as Node 2, Node 3, and so on. This is called a rack, which is a collection of 30 or 40 nodes that are physically stored close together and are all connected to the same network switch. A Hadoop cluster (or just a ‘cluster’ from now on) is a collection of racks. Now, let’s examine Hadoop’s architecture—it has two major components. 1. The distributed file system component: The main example of this is the Hadoop distributed file system (HDFS), though other file systems like IBM Spectrum Scale, are also supported. 2. The MapReduce component: This is a framework for performing calculations on the data in the distributed file system. HDFS runs on top of the existing file systems on each node in a Hadoop cluster. It is designed to tolerate a high component failure rate through the replication of the data. A file on HDFS is split into multiple blocks, and each is replicated within the Hadoop cluster. A block on HDFS is a blob of data within the underlying file system (see Figure 1). Hadoop distributed file system (HDFS) stores the application data and file system metadata separately on dedicated servers. NameNode and DataNode are the two critical components of the HDFS architecture. Application data is stored on servers referred to as DataNodes, and file system metadata is stored on servers referred to as NameNodes. HDFS replicates the file’s contents on multiple DataNodes, based on the replication factor, to ensure the reliability of data. The NameNode and DataNode communicate with each other using TCP based protocols. The heart of the Hadoop distributed computation platform is the Java-based programming paradigm MapReduce. Map or Reduce is a special type of directed acyclic graph that can be applied to a wide range of business use cases. The Map


Node Manager


App Mstr



Resource Manager

Node Manager

App Mstr

MapReduce Status Job Submission Node Status Resource Request


Node Manager



Figure 3: Resource Manager and Node Manager

function transforms a piece of data into key-value pairs; then the keys are sorted, where a Reduce function is applied to merge the values (based on the key) into a single output.

Resource Manager and Node Manager

The Resource Manager and the Node Manager form the data computation framework. The Resource Manager is the ultimate authority that arbitrates resources among all the applications in the system. The Node Manager is the per-machine framework agent that is responsible for containers, monitoring their resource usage (CPU, memory, disk and network), and reports this data to the Resource Manager/Scheduler.

Why Hadoop?

The problem with a relational database management system (RDBMS) is that it cannot process semi-structured data. It


can only work with structured data. The RDBMS architecture with the ER model is unable to deliver fast results with vertical scalability by adding CPU or more storage. It becomes unreliable if the main server is down. On the other hand, the Hadoop system manages effectively with largesized structured and unstructured data in different formats such as XML, JSON and text, at high fault tolerance. With clusters of many servers in horizontal scalability, Hadoop’s performance is superior. It provides faster results from Big Data and unstructured data because its Hadoop architecture is based on open source.

What Hadoop can’t do

Hadoop is not suitable for online transaction processing workloads where data is randomly accessed on structured data like a relational database. Also, Hadoop is not suitable for online analytical processing or decision support system workloads, where data is sequentially accessed on structured data like a relational database, to generate reports that provide business intelligence. Nor would Hadoop be optimal for structured data sets that require very nominal latency, like when a website is served up by a MySQL database in a typical LAMP stack—that’s a speed requirement that Hadoop would not serve well.

Reference [1]

By: Neetesh Mehrotra The author works at TCS as a systems engineer. His areas of interest are Java development and automation testing. He can be contacted at

OSFY Magazine Attractions During 2017-18 MONTH


March 2017

Open Source Firewall, Network security and Monitoring

April 2017

Databases management and Optimisation

May 2017

Open Source Programming (Languages and tools)

June 2017

Open Source and IoT

July 2017

Mobile App Development and Optimisation

August 2017

Docker and Containers

September 2017

Web and desktop app Development

October 2017

Artificial Intelligence, Deep learning and Machine Learning

November 2017

Open Source on Windows

December 2017

BigData, Hadoop, PaaS, SaaS, Iaas and Cloud

January 2018

Data Security, Storage and Backup

February 2018

Best in the world of Open Source (Tools and Services) | OPEN SOURCE FOR YOU | JANUARY 2018 | 51



Open Source Storage Solutions You Can Depend On Storage space is at a premium with petabytes and terabytes of data being generated almost on a daily basis due to modern day living. Open source storage solutions can help mitigate the storage problems of individuals as well as small and large scale enterprises. Letâ&#x20AC;&#x2122;s take a look at some of the best solutions and what they offer.


e have all been observing a sudden surge in the production of data in the recent past and this will undoubtedly increase in the years ahead. Almost all the applications on our smartphones (like Facebook, Instagram, WhatsApp, Ola, etc) generate data in different forms like text and images, or depend on data to work upon. With around 2.32 billion smartphone users across the globe (as per the latest data from having installed multiple applications, it certainly adds up to a really huge amount of data, daily. Apart from this, there are other sources of data as well like different Web applications, sensors and actuators used in IoT devices, process automation plants, etc. All this creates a really big challenge to store such massive amounts of data in a manner that can be used as and when needed. We all know that our businesses cannot get by without storing our data. Sooner or later, even small businesses need space for data storageâ&#x20AC;&#x201D;for documents, presentations,


e-mails, image graphics, audio files, databases, spreadsheets, etc, which act as the lifeblood for most companies. Besides, many organisations also have some confidential information that must not be leaked or accessed by anyone, in which case, security becomes one of the most important aspects of any data storage solution. In critical healthcare applications, an organisation cannot afford to run out of memory, so data needs to be monitored at each and every second. Storing different kinds of data and managing its storage is critical to any companyâ&#x20AC;&#x2122;s behind-the-scenes success. When we look for a solution that covers all our storage needs, the possibilities seem quite endless, and many of them are likely to consume our precious IT budgets. This is why we cannot afford to overlook open source data storage solutions. Once you dive into the open source world, you will find a huge array of solutions for almost every problem or purpose, which includes storage as well.

Insight Reasons for the growth in the data storage solutions segment

Let’s check out some of the reasons for this: 1. Various recent government regulations, like SarbanesOxley, ask businesses to maintain and keep a backup of different types of data which they might have otherwise deleted. 2. Many of the small businesses have now started archiving different e-mail messages, even those dating back five or more years for various legal reasons. 3. Also, the pervasiveness of spyware and viruses requires backups and that again requires more storage capacity. 4. There has been a growing need to back up and store different large media files, such as video, MP3, etc, and make the same available to users on a specific network. This is again generating a demand for large storage solutions. 5. Each newer version of any software application or operating system demands more space and memory than its predecessor, which is another reason driving the demand for large storage solutions.

Different types of storage options

There are different types of storage solutions that can be used based on individual requirements, as listed below. Flash memory thumb drives: These drives are particularly useful to mobile professionals since they consume little power, are small enough to even fit on a keychain and have almost no moving parts. You can connect any Flash memory thumb drive to your laptop’s Universal Serial Bus (USB) port and back up different files on the system. Some of the USB thumb drives also provide encryption to protect files in case the drive gets lost or is stolen. Flash memory thumb drives also let us store our Outlook data (like recent e-mails or calendar items), different bookmarks on Internet Explorer, and even some of the desktop applications. That way, you can leave your laptop at home and just plug the USB drive into any borrowed computer to access all your data elsewhere. External hard drives: An inexpensive and relatively simpler way to add more memory storage is to connect an external hard drive to your computer. External hard disk drives that are directly connected to PCs have several disadvantages. Any file stored only on the drive but not elsewhere requires to be backed up. Also, if you travel somewhere for work and need access to some of the files on an external drive, you will have to take the drive with you or remember to make a copy of the required files to your laptop’s internal drive, a USB thumb drive, a CD or any other storage media. Finally, in case of a fire or other catastrophe at your place of business, your data will not be completely protected if it’s stored on an external hard drive. Online storage: There are different services which provide remote storage and backup over the Internet. All such services offer businesses a number of benefits. By


backing up your most important files to a highly secure remote server, you are actually protecting the data stored at your place of business. You can also easily share different large files with your clients, partners or others by providing them with password-protected access to your online storage service, hence eliminating the need to send those large files by e-mail. And in most cases, you can log into your account from any system using a Web based browser, which is one of the great ways to retrieve files when you are away from your PC. Remote storage can be a bit slow, especially during an initial backup session, and only as fast as the speed of your network’s access to that storage. For extremely large files, you may require higher speed network access. Network attached storage: Network attached storage (NAS) provides fast, reliable and simple access to data in any IP networking environment. Such solutions are quite suitable for small or mid-sized businesses that require large volumes of economical storage which can be shared by multiple users over a network. Given that many of the small businesses lack IT departments, this storage solution is easy to deploy, can be managed and consolidated centrally. This type of storage solution can be as simple as a single hard drive with an Ethernet port or even built-in Wi-Fi connectivity. More sophisticated NAS solutions can also provide additional USB as well as FireWire ports, enabling you to connect external hard drives to scale up the overall storage capacity of businesses. A NAS storage solution can also offer print-server capabilities, which let multiple users easily share a single printer. A NAS solution may also include multiple hard drives in a Redundant Array of Independent Disks (RAID) Level 1 array. This storage system contains two or more equivalent hard drives (similar to two 250GB drives) in a single network-connected device. Files written to the first (main) drive are automatically written to the second drive as well. This kind of automated redundancy present in NAS solutions implies that if the first hard drive dies, we will still have access to all our applications and files present on the second drive. Such solutions can also help in offloading files being served by other servers on your network, which increases the performance. A NAS system allows you to consolidate storage, hence increasing the efficiency and reducing costs. It simplifies the storage administration, data backup and its recovery, and also allows for easy scaling to meet the growing storage needs.

Choosing the right storage solution

There are a number of storage solutions available in the market, which meet diverse requirements. At times, you could get confused while trying to choose the right one. Let’s get rid of that confusion by considering some of the important aspects of a storage solution. Scalability: This is one of the important factors to be considered while looking for any storage solution. In different distributed storage systems, storage capacity can be added in two ways. The first way involves adding disks | OPEN SOURCE FOR YOU | JANUARY 2018 | 53



Figure 1: Qualities of NAS solutions (Image source: Figure 2: Main services and components of OpenStack (Image source:

or replacing the existing disks with ones that have higher storage capacity (also called ‘scaling up’). The other method involves adding nodes with ‘scale out’ capacity. Whenever you add hardware, you increase the whole system’s performance as well as its capacity. Performance: This is what we look for while choosing any storage solution. One cannot afford to compromise on the performance of any storage solution, as this may directly impact the performance of the application that uses the given storage solution. Flexible scalability allows users to increase the capacity and performance independently as per their needs and budget. Reliability: We all look for resources that can be relied upon for a long period of time, and this is the case even when searching for a storage solution. Affordability: Since budget and pricing are important, an open source storage solution is a good option because it is available free of cost. This is an important factor for small businesses that cannot afford to spend much just for storage solutions. Availability: Sometimes, data stored in a storage solution is not available when being fetched by any application. This can occur because of some disk failure. We all want to avoid such circumstances, which may lead to unavailability of data. Data should be easily available when it’s being accessed. Simplicity: Even the most advanced storage solutions come with management interfaces that are as good as or better than the traditional storage units. All such interfaces show details about each node, capacity allocation, alerts, overall performance, etc. This is a significant factor to be considered while choosing a storage solution. Support: Last but not the least, there should be support from the manufacturer or from a group of developers, including the support for applications. Support is quite essential if you plan on installing your database, virtual server farm, email or other critical information on the storage solution. You must make sure that the manufacturer offers the level of support you require.

Some of the available open source storage solutions Here’s a glance at some of the good open source solutions available.


Figure 3: Architecture for the Ceph storage solution (Image source:

OpenStack: OpenStack is basically a cloud operating system which controls large pools of networking resources, computation and storage throughout a data centre, all of which are managed using a dashboard that gives its administrators the controls while empowering users to provision the resources through a Web interface. The OpenStack Object Storage service helps in providing software that stores and retrieves data over HTTP. Objects (also referred to as blobs of data) are stored in an organisational hierarchy which offers anonymous read-only access or ACL defined access, or even a temporary access. This type of object storage supports multiple token-based authentication mechanisms that are implemented via middleware. Ceph: This is a type of distributed object storage and file system designed to provide high performance, scalability and reliability. It is built on the Reliable Autonomic Distributed Object Store, and allows enterprises to build their own economic storage devices using different commodity hardware. It is maintained by Red Hat after its acquisition of InkTank in April 2014. It is capable of storing blocks, files and objects as well. It is scale-out, which means that multiple Ceph storage nodes are present on a single storage system which easily handles many petabytes of memory, and simultaneously increases performance and capacity. Ceph has many of the basic enterprise storage features, which include replication, thin provisioning, snapshots, auto-tiering and self-healing capabilities. RockStor: This is a free and open source NAS solution. The Personal Cloud Server present in it is a very powerful

Insight local alternative for public cloud storage, which mitigates the cost and risks associated with public cloud storage. This network attached and cloud storage platform is quite suitable for small to medium businesses as well as home users who do not have much IT experience but need to scale up to terabytes of data storage. If users are more interested in Linux and Btrfs, it is a great alternative to FreeNAS. This cloud storage platform can be managed even within a LAN or over the Web using a very simple and intuitive user interface. And with the inclusion of add-ons (named ‘Rockons’), you can extend the feature set to include different new applications, servers and services. Kinetic Open Storage: Backed by different companies like Seagate, EMC, Toshiba, Cisco, Red Hat, NetApp, Dell, etc, Kinetic is a Linux Foundation project which is dedicated to establishing standards for new kinds of object storage architecture. It is designed especially to meet the need for scale-out storage used for unstructured data. Kinetic is basically a way for storage applications to communicate directly with storage devices over the Ethernet. Most of the storage use cases targeted by Kinetic consist of unstructured data like Hadoop, NoSQL and other distributed file systems, as well as object stores in the cloud such as Amazon S3, Basho’s Riak and OpenStack Swift. Storj DriveShare and MetaDisk: Storj is a new type of cloud storage which is built on peer-to-peer and blockchain technology. It offers decentralised and end-toend encrypted cloud storage. The DriveShare application allows users to rent out all their unused hard drive space so that it can be used by the service. The MetaDisk Web application present in it allows users to save all their files to the service securely. The core protocol helps in peer-topeer negotiation and verification of the storage contracts. Providers of the storage are usually referred to as ‘farmers’ and those using the storage are called ‘renters’. Renters can periodically audit in order to check if the farmers are still keeping their files secure and safe. Conversely, farmers can also decide to stop storing any specific file if its owners do not pay and audit their services on time. Different files are cut up into smaller pieces called ‘shards’ and then are



Figure 4: Ten year Data centre revenue forecast (Image source:

stored three times redundantly, by default. The network can automatically determine a new farmer and can also move data if copies become unavailable. The system puts different measures in place to prevent renters and farmers from cheating on each other—for instance, by manipulating the auditing process. Storj offers several advantages over many traditional cloud based storage solutions. As data present here is encrypted and cut into shards at the source, there is almost no chance for any unauthorised third parties to access the data. And because data storage is distributed, the availability and download speed increases.

References [1] [2] [3] [4]

By: Vivek Ratan The author has completed his B.Tech in electronics and instrumentation engineering. He is currently working as an automation test engineer at Infosys, Pune and as a freelance educator at LearnerKul, Pune. He can be reached at

The latest from the Open Source world is here. Join the community at Follow us on Twitter @OpenSourceForU | OPEN SOURCE FOR YOU | JANUARY 2018 | 55



Build Your Own Cloud Storage System Using OSS The real threats to stored data are breaches which, of late, have been affecting many cloud service providers. Security vulnerabilities that enable breaches result in a loss of millions of user credentials. In this article, we explore the prospects of setting up a personal data store or even a private cloud.


he European Organisation for Nuclear Research (CERN), a research collaboration of over 20 countries, has a unique problemâ&#x20AC;&#x201D;it has way more data than it is possible to store! Weâ&#x20AC;&#x2122;re talking about petabytes of data per year, where one petabyte equals a million gigabytes. There are entire departments of scientists working on a subject termed DAQ (Data Acquisition and Filtering), simply to filter out 95 per cent of the experimentgenerated data and store only the useful 5 per cent. In fact, it has been estimated that data in the digital universe will amount to 40 zettabytes by 2020, which is about 5,000 gigabytes of data per person. With the recent spate of breaches affecting cloud service providers, setting up a personal data store or even a private cloud becomes an attractive prospect. 56 | JANUARY 2018 | OPEN SOURCE FOR YOU |

Data storage infrastructure is broadly classified as object-based, block storage and file systems, each with its own set of features.

Object-based storage

This construct manages data as objects instead of treating it as a hierarchy of files or blocks. Each object is associated with a unique identifier and comprises not only the data but also, in some cases, the metadata. This storage pattern seeks to enable capabilities such as application programmable interfaces, data management such as replication at objectscale, etc. It is often used to allow for the retention of massive amounts of data. Examples include the storage of photos, songs and files on a massive scale by Facebook, Spotify and Dropbox, respectively.



Figure 1: Hacked credentials

Figure 3: Selecting the boot partition [Source:]

Figure 2: Object storage, file systems and block storage [Source:]

Figure 4: FreeNAS GUI [Source:]

Block storage

example, we will look into the general steps involved in deploying such a system by taking the case of a popular representative of the set.

Data is stored as a sequence of bytes, termed a physical record. This so called ‘block’ of data comprises a whole number of records. The process of putting data into blocks is termed as blocking, while the reverse is called deblocking. Blocking is widely employed when storing data to certain types of magnetic tape, Flash memory and rotating media.

File systems

These data storage structures follow a hierarchy, which controls how data is stored and retrieved. In the absence of a file system, information would simply be a large body of data with no way to isolate individual pieces of information from the whole. A file system encapsulates the complete set of rules and logic used to manage sets of data. File systems can be used on a variety of storage media, most commonly, hard disk drives (HDDs), magnetic tapes and optical discs.

Building open source storage Software

Network Attached Storage (NAS) provides a stable and widely employed alternative for data storage and sharing across a network. It provides centralised repository of data that can be accessed by different members within the organisation. Variations include providing complete software and hardware packages serving as out-of-the-box alternatives. These include software and file systems such as Gluster, Ceph, NAS4Free, FreeNAS, and others. As an


With enterprise-grade features, richly supported plugins, and an enterprise-ready ZFS file system, it is easy to see why FreeNAS is one of the most popular operating systems in the market for data storage. Let’s take a deeper look at file systems since they are widely used in setting up storage networks today. Building your own data storage using FreeNAS involves following a few of the following simple steps: 1. You will need to download the disk image suitable for your architecture and burn it onto either a USB stick or a CD-ROM, as per your preference. 2. Since you will be booting your new disk or machine with FreeNAS, you will need to open the BIOS settings on booting it, and set the boot preference to USB so that your system first tries to boot from the USB and, if not found, then from other attached media. 3. Once you have created the storage media with the required software, you can boot up your system and install FreeNAS in the designated partition. 4. Having set the root password, when you boot into it after installation, you will have the option of using the Web GUI to log into the system. For some users, it might be much more intuitive to use this option as compared to the console-based login. 5. Using the GUI or console, you can configure and manage | OPEN SOURCE FOR YOU | JANUARY 2018 | 57



Create an admin account

Username Password Storage & database Data folder

Figure 5: Configuring storage options [Source:]

/var/www/onwncloud/data Configure the database Only MySql/MariaDB is available, Install and activate additional PHP modules to choose other database type. For more details check out the documentation

Database User Database Password Database name localhost

Figure 6: Setting up a private cloud [Source:]

Finish setup

Image 8: Final configuration for ownCloud [Source:]

Figure 7: Editing the document root in the configuration files [Source:]

your storage options depending on your application(s).

Private cloud storage

Another recent trend is cloud storage, given the sudden reduction in free cloud storage offered by providers like Microsoft and Dropbox. Public clouds have multi-tenancy infrastructure and allow for great scalability and flexibility, abstracting away the complexities associated with deploying and maintaining hardware. For instance, the creators of Gluster recently came out with an open source project called Minio to provide this functionality to users. One of the services we will look at is ownCloud, a Dropbox alternative, that offers similar functionality, along with the advantage of being open source. 1. In order to build a private cloud, you require a server running an operating system such as Linux or Windows. ownCloud allows clients to be installed on such a Linux server. 58 | JANUARY 2018 | OPEN SOURCE FOR YOU |

2. While installing and running an Apache server on Linux, the up-load_max_filesize and post_max_filesize flags need to be updated to higher values than the default (2MB). 3. The system is required to have MySQL, PHP (5.4+), Apache, GD and cURL installed before proceeding with the ownCloud installation. Further, a database must be created with privileges granted to a new user. 4. Once the system is set up, proceed with downloading the ownCloud files and extract them to /var/www/ownCloud. 5. Change the Apache virtual host to point to this ownCloud directory by modifying the document root in /etc/apache2/ sites-available/000-default.conf to /var/www/ownCloud. 6. Finally, type in the IP address of the server in your browser and you should be able to arrive at the login screen. While there are trade-offs between cloud-based storage and traditional means of storage, the former is a highly flexible, simplified and secure model of data storage. And with the providers offering more control over deployments, private clouds may well be the main file storage options in the near future! By: Swapneel Mehta The author has worked with Microsoft Research, CERN and startups in AI and cyber security. An open source enthusiast, he enjoys spending his time organising software development workshops for school and college students. You can connect with him at and find out more at



A Quick Look at Cloonix, the Network Simulator Cloonix is a Linux router and host simulation platform. It fully encapsulates applications, hosts and the network. Simulators like Cloonix offer students and researchers scope for research into various Internet technologies like the Domain Name System (DNS).


loonix is a network simulator based on KVM or UML. It is basically a Linux router and host simulation platform. You can simulate a network with multiple reconfigurable VMs in a single PC. The VMs may be different Linux distributions. You can also monitor the network’s activities through Wireshark. Cloonix can be installed on Arch, CentOS, Debian, Fedora, OpenSUSE and their derivative distros. The main features of Cloonix are: ƒ GUI based NS tool ƒ KVM based VM ƒ VMs and clients are Linux based ƒ Spice server is front-end for VMs ƒ Network activity monitoring by Wireshark The system requirements are: ƒ 32/64-bit Linux OS (tested on Ubuntu 16.04 64-bit) ƒ Wireshark ƒ Cloonix package: cloonix-37-01.tar.gz ƒ VM images: To set it up, download the Cloonix package and extract it. I am assuming that Cloonix is extracted in the $HOME directory. The directory structure of Cloonix is as follows:

cloonix ├── ├── ├── │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ├── ├── ├── ├── └──

allclean build cloonix ├── client ├── cloonix_cli ├── cloonix_config ├── cloonix_gui ├── cloonix_net ├── cloonix_ocp ├── cloonix_osh ├── cloonix_scp ├── cloonix_ssh ├── cloonix_zor ├── common ├── id_rsa ├── ├── LICENCE └── server doitall install_cloonix install_depends pack README | OPEN SOURCE FOR YOU | JANUARY 2018 | 59



Figure 2: Dynamic DNS demo

Shown below are the demo scripts available:

Figure 1: Ping simulation demo 5 directories, 19 files

To install Cloonix, run the following commands, which will install all the packages required, except Wireshark. $cd $HOME/cloonix $sudo ./install_depends build

The following command will install and configure Cloonix in your system: $sudo ./doinstall

The command given below will install Wireshark: $sudo apt-get install wireshark

You have to download VMs in $HOME/cloonix_data/ bulk, as shown below: bulk │ ├── │ ├── │ ├── │ ├── │ ├── │ ├── │ ├── │ ├── │ └──

batman.qcow2 bind9.qcow2 centos-7.qcow2 coreos.qcow2 ethereum.qcow2 jessie.qcow2 mpls.qcow2 stretch.qcow2 zesty.qcow2

To simulate the networks, you can download the ready-to-demo scripts available in stored/v-37-01/cloonix_demo_all.tar.gz. 60 | JANUARY 2018 | OPEN SOURCE FOR YOU |

├── ├── ├── ├── ├── ├── ├── ├── ├── ├── ├── ├── ├── ├── ├── ├── └──

batman cisco dns dyn_dns eap_802_1x ethereum fwmark2mpls mpls mplsflow netem ntp olsr openvswitch ospf ping strongswan unix2inet

To run any demo for ping, for instance, just go to the ping directory and run the following code: $./

This will create all the required VM(s) and network components. You can also monitor traffic by using Wireshark. Cloonix is a good tool to run network simulation. All the VMs are basically Linux VMs, which you can easily reconfigure.

References [1] [2]

By: Kousik Maiti The author is a senior technical officer at ICT & Services, CDAC, Kolkata. He has over ten years of industry experience, and his areas of interest include Linux system administration, mobile forensics and Big Data. He can be reached at

Insight Admin

The Best Tools for Backing Up Enterprise Data To prevent a disastrous loss of data, regular backups are not only recommended but are de jure. From the many open source tools available in the market for this purpose, this article helps systems administrators decide which one is best for their systems.


efore discussing the need for backup software, some knowledge of the brief history of storage is recommended. In 1953, IBM recognised the importance and immediate application of what it called the ‘random access file’. The company then went on to describe this as having high capacity with rapid random access to files. This led to the invention of what subsequently became the hard disk drive. IBM’s San Jose, California laboratory invented the HDD. This disk drive created a new level in the computer data hierarchy, then termed random access storage but today known as secondary storage. The commercial use of hard disk drives began in 1957, with the shipment of an IBM 305 RAMAC system including IBM Model 350 disk storage, for which a US Patent No. 3,503,060 was issued on March 24, 1970. The year 2016 marks the 60th anniversary of the venerable hard disk drive (HDD). Nowadays, new computers

are increasingly adopting SSDs (solid-state drives) for main storage, but HDDs still remain the champions of low cost and very high capacity data storage. The cost per GB of data has come down significantly over the years because of a number of innovations and advanced techniques developed in manufacturing HDDs. The graph in Figure 1 gives a glimpse of this. The general assumption is that this cost will be reduced further. Now, since storing data is not at all costly compared to what it was in the 1970s and ‘80s, why should one take backup of data when it so cheap to buy new storage. What are the advantages of having backup of data? Today, we are generating a lot of data by using various gadgets like mobiles, tablets, laptops, handheld computers, servers, etc. When we exceed the allowed storage capacity in these devices, we tend to push this data to the cloud or take a backup to avoid any future disastrous events. Many | OPEN SOURCE FOR YOU | JANUARY 2018 | 61


Insight 0%

Hard Drive Cost per Gigabyte 1980-2009




LVM (default)








SolidFire IBM Storwize


Dell EqualLogic




HP LeftHand















Other Black Storage Driver






60% 7%





6% 5% 4%

4% 3%




3% 2% 10%










Ceph RBD

3% 3% 2%


2% 2% 3%


Dev/QA Proof of Concept

Figure 1: Hard drive costs per GB of data (Source: http://www.mkomo. com/cost-per-gigabyte)

Figure 2: Ceph adoption rate (Source:

corporates and enterprise level customers are generating huge volumes of data, and to have backups is critical for them. Backing up data is very important. After taking a backup, we have to also make sure that this data is secure, is manageable and that the data’s integrity is not compromised. Keeping in mind these aspects, many open source backup software have been developed over a period of years. Data backup comes in different flavours like individual files and folders, whole drives or partitions, or full system backups. Nowadays, we also have the ‘smart’ method, which automatically backs up files in commonly used locations (syncing) and we have the option of using cloud storage. Backups can be scheduled, running as incremental, differential or full backups, as required. For organisations and large enterprises that are planning on selecting backup software tools and technologies, this article reviews the best open source tools. Before choosing the best software or tool, users should evaluate the features they provide, with reference to stability and open source community support. Advanced open source storage software like Ceph, Gluster, ZFS and Lustre can be integrated with some of the popular backup tools like Bareos, Bacula, AMANDA and CloneZilla; each of these is described in detail in the following section.

This script helps in backing up Ceph pools. It was developed keeping in mind backing up of specified storage pools and not only individual images; it also allows retention of dates and implements a synthetic full backup schedule if needed. Many organisations are now moving towards large scale object storage and take backups regularly. Ceph is the ultimate solution, as it provides object storage management along with state-of-art backup. It also provides integration into private cloud solutions like OpenStack, which helps one in managing backups of data in the cloud. The Ceph script can also archive data, remove all the old files and purge all snapshots. This triggers the creation of a new, full and initial snapshot. OpenStack has a built-in Ceph backup driver, which is an intelligent solution for VM volume backup and maintenance. This helps in taking regular and incremental backups of volumes to maintain consistency of data. Along with Ceph backup, one can use a tool called CloudBerry for versatile control over Ceph based backup and recovery mechanisms. Ceph also has good support from the community and from large organisations, many of which have adopted it for storage and backup management and inturn contribute back to the community. A lot of developments and enhancements are happening on a continuous basis with Ceph. A number of research organisations have predicted that Ceph’s adoption rate will increase in the future. Ceph also has certain cost advantages in comparison with other software products. More information about the Ceph RBD script can be found at


Ceph is one of the leading choices in open source software for storage and backup. Ceph provides object storage, block storage and file system storage features. It is very popular because of its CRUSH algorithm, which liberates storage clusters from the scalability and performance limitations imposed by centralised data table mapping. Ceph eliminates many tedious tasks for administrators by replicating and rebalancing data within the cluster, and delivers high performance and infinite scalability. Ceph also has RADOS (reliable autonomic distributed object store), which provides the earlier described object, block and file system storage in singly unified storage clusters. The Ceph RBD backup script in the v0.1.1 release of creates the backup solution for Ceph. 62 | JANUARY 2018 | OPEN SOURCE FOR YOU |


Red Hat’s Gluster is another open source software defined scale out, backup and storage solution. It is also called RGHS. It helps in managing unstructured data for physical, virtual and cloud environments. The advantages of Gluster software are its cost effectiveness and highly available storage that does not compromise on scale or performance.

Insight Admin Cost difference Between Red Hat Gluster Storage and Competitive NAS Storage System for 300TB Initial Procurement ($) 140,000 120,000 ($)

100,000 80,000 60,000 40,000 20,000 Year 0

Year 1

Year 2

Year 3

Year 4

Year 5

Red Hat Gluster Storage Competitive NAS Appliance Source: IDC, 2016

Figure 3: Gluster Storage cost effectiveness (Source: https://redhatstorage.redhat. com/2016/11/03/idc-the-economics-of-software-defined-storage/)

RGHS has a great feature called ‘snapshotting’, which helps in taking ‘point-in-time’ copies of Red Hat Gluster Storage server volumes. This helps administrators in easily reverting back to previous states of data in case of any mishap. Some of the benefits of the snapshot feature are: ƒ Allows file and volume restoration with a point-in-time copy of Red Hat Gluster Storage volume(s) ƒ Has little to no impact on the user or applications, regardless of the size of the volume when snapshots are taken ƒ Supports up to 256 snapshots per volume, providing flexibility in data backup to meet production environment recovery point objectives ƒ Creates a read-only volume that is a point-in-time copy of the original volume, which users can use to recover files ƒ Allows administrators to create scripts to take snapshots of a supported number of volumes in a scheduled fashion ƒ Provides a restore feature that helps the administrator return to any previous point-in-time copy ƒ Allows the instant creation of a clone or a writable snapshot, which is a space-efficient clone that shares the back-end logical volume manager (LVM) with the snapshot BareOS configured on GlusterFS has the advantage of being able to take incremental backups. One can create a ‘glusterfind’ session to remember the time when it was last synched or when processing was completed. For example, your backup application (BareOS) can run every day and get incremental results at each run. More details on the RGHS snapshot feature can be found at

The best open source backup software tools AMANDA open source backup software

Amanda or Advanced Maryland Automatic Network Disk Archive ( is a

Figure 4: AMANDA architecture

Figure 5: Bareos architecture

popular, enterprise grade open source backup and recovery software. According to the disclosure made by AMANDA, it runs on servers and desktop systems containing Linux, UNIX, BSD, Mac OS X and MS Windows. AMANDA comes as both an enterprise edition and an open source edition (though the latter may need some customisation). The latest version of the AMANDA Enterprise version is release 3.3.5. It is one of the key backup software tools to be implemented in government, databases, healthcare and cloud based organisations across the globe. AMANDA has a number of good features to tackle the explosive data growth and for high data availability. It provides and helps in managing complex and expensive backup and recovery software products. Some of its advantages and features are: ƒ Centralised management for heterogeneous environments (involving multiple OSs and platforms) ƒ Powerful protection with simple administration ƒ Wide platform and application support | OPEN SOURCE FOR YOU | JANUARY 2018 | 63

Admin ƒ ƒ


Industry standard open source support and data formats Low cost of ownership

Bareos (Backup Archiving Recovery Open Sourced)

Bareos offers high data security and reliability along with cross-network open source software for backups. Now being actively developed, it emerged from the Bacula Project in 2010. Bareos supports Linux/UNIX, Mac and Windows based OS platforms, along with both a Web GUI and CLI.


Clonezilla is a partition and disk imaging/cloning program. It is similar to many variants available in the market like Norton Ghost and True Image. It has features like bare metal backup recovery, and supports massive cloning with high efficiency in multi-cluster node environments. Clonezilla comes in two variants—Clonezilla Live and Clonezilla SE (Server Edition). Clonezilla Live is suitable for single machine backup and restore, and Clonezilla SE for massive deployment. The latter can clone many (40 plus) computers simultaneously.


Designed to be used in a cloud computing environment, Duplicati is a client application for creating encrypted, incremental, compressed backups to be stored on a server.


It works with public clouds like Amazon, Google Drive and Rackspace, as well as private clouds and networked file servers. Operating systems that it is compatible with include Windows, Linux and Mac OS X.


Like Clonezilla, FOG is a disk imaging and cloning tool that can aid with both backup and deployment. It’s easy to use, supports networks of all sizes, and includes other features like virus scanning, memory testing, disk wiping, disk testing and file recovery. Operating systems compatible with it include Linux and Windows.

References [1] To know more about the history of HDDs [2] [3] [4] [5] [6]

By: Shashidhar Soppin The author is a senior architect with 16+ years of experience in the IT industry, and has expertise in virtualisation, cloud, Docker, open source, ML, Deep Learning and open stack. He is part of the PES team at Wipro. You can contact him at

Let’s Try


How to Identify Fumbling to Keep a Network Secure Repeated systematic failed attempts by a host to access resources like a URL, an IP address or an email address is known as fumbling. Erroneous attempts to access resources by legitimate users must not be confused with fumbling. Let’s look at how we can design an effective system for identifying network fumbling, to help keep our networks secure.


etwork security implementation mainly depends on exploratory data analysis (EDA) and visualisation. EDA provides a mechanism to examine a data set without preconceived assumptions about the data and its behaviour. The behaviour of the Internet and the attackers is dynamic and EDA is a continuous process to help identify all the phenomena that are cause for an alarm, and to help detect anomalies in access to resources. Fumbling is a general term for repeated systematic failed attempts by a host to access resources. For example, legitimate users of a service should have a valid email ID or user identification. So if there are numerous attempts by a user from a different location to target the users of this service with different email identifications, then there is a chance that this is an attack from that location. From the data analysis point of view, we say a fumbling condition has happened. This indicates that the user does not have access to that system and is exploring different possibilities to break the

security of the target. It is the task of the security personnel to identify the pattern of the attack and the mistakes committed to differentiate them from innocent errors. Let’s now discuss a few examples to identify a fumbling condition. In a nutshell, fumbling is a type of Internet attack, which is characterised by failing to connect to one location with a systematic attack from one or more locations. After a brief discussion of this type of network intrusion, let’s consider a problem of network data analysis using R, which is a good choice as it provides powerful statistical data analysis tools together with a graphical visualisation opportunity for a better understanding of the data.

Fumbling of the network and services

In case of TCP fumbling, a host fails to reach a target port of a host, whereas in the case of HTTP fumbling, hackers fail to access a target URL. All fumbling is not a network attack, but most of the suspicious attacks appear as fumbling. | OPEN SOURCE FOR YOU | JANUARY 2018 | 65


Let’s Try

The most common reason for fumbling is lookup failure which happens mainly due to misaddressing, the movement of the host or due to the non-existence of a resource. Other than this, an automated search of destination targets, and scanning of addresses and their ports are possible causes of fumbling. Sometimes, to search a target host, automated measures are taken to check whether the target is up and running. These types of failed attempts are generally mistaken for network attacks, though lookup failure happens either due to misconfiguration of DNA, a faulty redirection on the Web server, or email with a wrong URL. Similarly, SMTP communication uses an automated network traffic control scheme for its destination address search. The most serious cause of fumbling is repeated scanning by attackers. Attackers scan the entire addressport combination matrix either in vertical or in horizontal directions. Generally, attackers explore horizontally, as they are most interested in exploring potential vulnerabilities. Vertical search is basically a defensive approach to identify an attack on an open port address. As an alternative to scanning, at times attackers use a hit-list to explore a vulnerable system. For example, to identify SSH host, attackers may use a blind scan and then start a password attack.

Identifying fumbling

Identifying malicious fumbling is not a trivial task, as it requires demarcating innocuous fumbling from the malevolent kind. Primarily, the task of assessing failed accesses to a resource is to identify whether the failure is consistent or transient. To explore TCP fumbling, look into all TCP communication flags, payload size and packet count. In TCP communication, the client sends an ACK flag only after receiving the SYN+ACK signal from the server. If there is no ACK after a SYN from the server, then that indicates a fumbling. Another possible way to locate a malicious attack is to count the number of packets of a flow. A legitimate TCP flow requires at least three packets of overhead before it considers transmitting data. Most retries require three to five packets, and TCP flows having five packets or less are likely to be fumbles. Since, during a failed connection, the host sends the same SYN packets options repeatedly, a ration of packet size and packet number is also a good measure of identifying TCP flow fumbling. ICMP informs a user about why a connection failed. It is also possible to look into the ICMP response traffic to identify fumbling. If there is a sudden spike in messages originating from a router, then there is a good chance that a target is probing the router’s network. A proper forensic investigation can identify a possible attacking host attacking host. Since UDP does not follow TCP as a strict communication protocol, the easiest way to identify UDP fumbling is by exploring network mapping and ICMP traffic. Identifying service level fumbling is comparatively 66 | JANUARY 2018 | OPEN SOURCE FOR YOU |

easier than communication level fumbling, as in most of the cases exhaustive logs record each access and malfunction. For example, HTTP returns three-digit status codes 4xx for every client-side error. Among the different codes, 404 and 401 are the most common for unavailability of resources and unauthorised access, respectively. Most of the 404 errors are innocuous, as they occur due to misconfiguration of the URL or the internal vulnerabilities of different services of the HTTP server. But if it is a 404 scanning, then it may be malicious traffic and there may be a chance that attackers are trying to guess the object in order to reach the vulnerable target. Web server authentication is really used by modern Web servers. In case of discovering any log entry of an 401 error, proper steps should be taken to remove the source from the server. Another common service level vulnerability comes from the mail service protocol, SMTP. When a host sends a mail to a non-existent address, the server either rejects the mail or bounces it back to the source. Sometimes it also directs the mail to a catch-all account. In all these three cases, the routing SMTP server keeps a record of the mail delivery status. But the main hurdle of identifying SMTP fumbling comes from spam. It’s hard to differentiate SMTP fumbling from spam as spammers send mail to every conceivable address. SMTP fumblers also send mails to target addresses to verify whether an address exists for possible scouting out of the target.

Designing a fumbling identification system

From the above discussion, it is apparent that identifying fumbling is more subjective than objective. Designing a fumbling identification and alarm system requires in-depth knowledge of the network and its traffic pattern. There are several network tools, but here we will cover some basic system utilities so that readers can explore the infinite possibilities of designing network intrusion detection and prevention systems of their own. In order to separate malicious from innocuous fumbling, the analyst should mark the targets to determine whether the attackers are reaching the goal and exploring the target. This step reduces the bulk of data to a manageable state and makes the task easier. After fixing the target, it is necessary to examine the traffic to study the failure pattern. If it is TCP fumbling, as mentioned earlier, this can be detected by finding traffic without the ACK flag. In case of an HTTP scanning, examination of the HTTP server log table for 404 or 401 is done to find out the malicious fumbling. Similarly, the SMTP server log helps us to find out doubtful emails to identify the attacking hosts. If a scouting happens to a dark space of a network, then the chance of malicious attack is high. Similarly, if a scanner scans more than one port in a given time frame, the chance of intrusion is high. A malicious attack can be confirmed by examining the conversation between the attacker and the target. Suspicious conversations can be subsequent transfers


Let’s Try of files or communication using odd ports. Some statistical techniques are also available to find the expected number of hosts of a target network that would be explored by a user, or to compute the likelihood of a fumbling attack test that could either pass or fail.

Capturing TCP flags

In a UNIX environment, a de facto packet-capturing tool is tcpdump. It is powerful as well as flexible. As a UNIX tool, a powerful shell script is also applicable over the outputs of tcpdump and can produce a filtered report as desired. The underlying packet-capturing tool of tcpdump is libcap and it provides the source, destination, IP address, port and IP protocol over the target network interface for each network protocol. For example, to capture TCP SYN packets over the eth0 interface, we can use the following command:

[1] 302 11 > names(rfi) [1] “ccadmin” “pts.0” [7] “X30” X17.25” > class(rfi) [1] “data.frame”

“X” “ipaddress” “Mon” “Oct” “X.1” “still” “”

To make the relevant column heading meaningful, the first and fourth column headings are changed to: > colnames(rfi)[1]=’user’ > colnames(rfi)[4]=’ipaddress’

If we consider a few selective columns of the data frame, as shown here:

$ tcpdump –i eth0 “tcp[tcpflags] & (tcp-syn) !=0” –nn –v

> c = c(colnames(rfi)[1],colnames(rfi)[2],colnames(rfi) [4],colnames(rfi)[5],colnames(rfi)[6],colnames(rfi) [7],colnames(rfi)[8])

Similarly, TCP ACK packets can be captured by issuing the command given below:

…then the first ten rows can be displayed to have a view of the table structure, as shown in Figure 1.

$tcpdump –I eth0 “tcp[tcpflags] & (tcp-ack) != 0” –nn –v

> x = rfi[, c,drop=F] > head(x,10) user pts.0 1 root pts/1 2 ccadmin pts/0 3 ccadmin pts/0 4 root pts/1 5 ccadmin pts/0 6 (unknown :0 7 root pts/0

To have a combined capture report of SYN and ACK, both the flags can be combined as follows: $tcpdump –I eth0 “tcp[tcpflags] & (tcp | tcp-ack) != 0” –nn –v

Getting network information

In this regard, netstat is a useful tool to get network connections, routing tables, interface statistics, masquerade connections, and multi-cast memberships. It provides a detailed view of the network to diagnose network problems. In our case, we can use this to identify ports that are listening. For example, to know about connections of HTTP and HTTPS traffic over TCP, we can use the following command expression with -l (to report socket), -p (to report relevant port) and –t (for only TCP) options.

ipaddress Mon :0 :0

Oct Mon Mon Wed Tue Tue Thu Thu

X30 Oct Oct Oct Oct Oct Oct Oct

X17.25 30 12:48 30 12:30 25 10:22 24 11:54 24 11:53 12 12:57 12 12:57

IP histogram 20


Data analysis

Now, let’s discuss a network data analysis example on netstat command outcomes. This will help you to understand the network traffic to carry out intrusion detection and prevention. Let’s say we have a csv file from the netstat command, as shown below: > rfi <- read.csv(“rficsv.csv”,header=TRUE, sep=”,”)

…where the dimension, columns and object class are: > dim(rfi)


$ netstat -tlp 10



IP Address

Figure 1: Histogram of IP addresses of netstat | OPEN SOURCE FOR YOU | JANUARY 2018 | 67


Let’s Try

8 root :0 :0 9 (unknown :0 :0 10 reboot system 3.10.0-123.el7.x

Wed Oct 11 12:56 Wed Oct 11 12:55 Thu Oct 12 12:37

The data shows that the data frame is not in a uniform table format and fields of records are separated by a tab character. This requires some amount of filtering of data in the table to extract relevant rows for further processing. Since I will be demonstrating the distribution of IP addresses within a system, only the IP address and other related fields are kept for histogram plotting. To have a statistical evaluation of this data, it is worth removing all the irrelevant fields from the data frame: drops = c(colnames(rfi)[2],colnames(rfi)[3],colnames(rfi) [5],colnames(rfi)[6],colnames(rfi)[7],colnames(rfi) [8],colnames(rfi)[9],colnames(rfi)[10],colnames(rfi)[11]) d = rfi[ , !(names(rfi) %in% drops)]

Then, for simplicity, extract all IP addresses attached to the user ‘ccadmin’ which start with ‘172’.

: 0 6 (unknown : 0 3 backup_u : 0 2 reboot : 0 : 1 root : 0 1 (Other) : 0 (Other) : 0 and > count(u) user ipaddress freq 1 ccadmin 2 2 ccadmin 1 3 ccadmin 1 4 ccadmin 3 5 ccadmin 21 6 ccadmin 6

For better visualisation, this frequency distribution of the IP address can be depicted using a histogram, as follows: qplot(u$ipaddress,main=’IP histogram’,xlab=’BioMass of Leaves’,ylab=’Frequency’)

u = d[like(d$user,’ccadmin’) & like(d$ipaddress,’172’),]

Now the data is ready for analysis. The R summary command will show the count of elements of each field, whereas the count command will show the frequency distribution of the IP address as shown below: > summary(u) user ccadmin :34


By: Dipankar Ray The author is a member of IEEE and IET, and has more than 20 years of experience in open source versions of UNIX operating systems and Sun Solaris. He is presently working on data analysis and machine learning using a neural network as well as on different statistical tools. He has also jointly authored a textbook called ‘MATLAB for Engineering and Science’. He can be reached at

Read more stories on Components in COMPONENTS STORIES


nverters • The latest in power co ent distributors • India’s leading compon ry onics components indust • Growth of Indian electr components for LEDs • The latest launches of ics components for electron • The latest launches of



Log on to and be in touch with the Electronics B2B Fraternity 24x7


Letâ&#x20AC;&#x2122;s Try


Use These Python Based Tools for Secured Backup and Recovery of Data Python, the versatile programming environment, has a variety of uses. This article will familiarise the reader with a few Python based tools that can be used for secured backup and recovery of data.


e keep data on portable hard disks, memory cards, USB Flash drives or other such similar media. Ensuring the long term preservation of this data with timely backup is very important. Many times, these memory drives get corrupted because of malicious programs or viruses; so they should be protected by using secure backup and recovery tools.

Popular tools for secured backup and recovery

For secured backup and recovery of data, it is always preferable to use performance-aware software tools and technologies, which can protect the data against any malicious or unauthenticated access. A few free and open source software tools which can be used for secured backup and recovery of data in multiple formats are: AMANDA, Bacula, Barcos, CloneZilla, Fog, Rsync, BURP, Duplicata, BackupPC, Mondo Rescue, GRSync, Areca Backup, etc.

Python as a high performance programming environment

Python is a widely used programming environment for almost every application domain including Big Data analytics, wireless networks, cloud computing, the Internet of Things (IoT), security tools, parallel computing, machine learning, knowledge discovery, deep learning, NoSQL databases and

many others. Python is a free and open source programming language which is equipped with in-built features of system programming, a high level programming environment and network compatibility. In addition, the interfacing of Python can be done with any channel, whether it is live streaming on social media or in real-time via satellite. A number of other programming languages have been developed, which have been influenced by Python. These languages include Boo, Cobra, Go, Goovy, Julia, OCaml, Swift, ECMAScript and CoffeeScript. There are other programming environments with the base code and programming paradigm of Python under development. Python is rich in maintaining the repository of packages for big applications and domains including image processing, text mining, systems administration, Web scraping, Big Data analysis, database applications, automation tools, networking, video processing, satellite imaging, multimedia and many others.

Python Package Index (PyPi): https://pypi.

The Python Package Index (PyPi), which is also known as Cheese Shop, is the repository of Python packages for different software modules and plugins developed as add-ons to Python. Till September 2017, there were more | OPEN SOURCE FOR YOU | JANUARY 2018 | 69


Let’s Try

than 117,000 packages for different functionalities and applications in PyPi. This escalated to 123,086 packages by November 30, 2017. The table in Figure 1 gives the statistics fetched from, which maintains data about modules, plugins and software tools. Date







Packages in PyPi







123,008 123,086

Figure 1: Statistics of modules and packages in PyPi in the last week of November (Source:

Python based packages for secured backup and recovery

As Python has assorted tools and packages for diversified applications, security and backup tools with tremendous functionalities are also integrated in PyPi. Descriptions of Python based key tools that offer security and integrity during backup follow.


Rotate-Backups is a simplified command line tool that is used for backup rotation. It has multiple features including flexible rotations on particular timestamps and schedules. The installation process is quite simple. Give the following command: $ pip install rotate-backups

The usage is as follows (the table at the bottom of this page lists the options): $ rotate-backups [Options]

The rotation approach in Rotate-Backups can be customised as strict rotation (enforcement of the time window) or relaxed rotation (no enforcement of time windows). After installation, there are two files ~/.rotate-backups. ini and /etc/rotate-backups.ini which are used by default. This default setting can be changed using the command line option --config. The timeline and schedules of the backup can be specified on the configuration file as follows: # /etc/rotate-backups.ini: [/backups/mylaptop] hourly = 24 daily = 7 weekly = 4 monthly = 12 yearly = always ionice = idle [/backups/myserver] daily = 7 * 2 weekly = 4 * 2 monthly = 12 * 4 yearly = always ionice = idle [/backups/myregion] daily = 7 weekly = 4 monthly = 2 ionice = idle [/backups/myxbmc] daily = 7 weekly = 4 monthly = 2



-M, --minutely=COUNT -H, --hourly=COUNT -d, --daily=COUNT -w, --weekly=COUNT -m, --monthly=COUNT -y, --yearly=COUNT -I, --include=PATTERN -x, --exclude=PATTERN -j, --parallel -p, --prefer-recent -r, --relaxed -i, --ionice=CLASS -c, --config=PATH -u, --use-sudo -n, --dry-run -v, --verbose -q, --quiet -h, --help

Number of backups per minute Number of hourly backups Number of daily backups Number of weekly backups Number of monthly backups Number of yearly backups Matching the shell patterns No process of backups that match the shell pattern One backup at a time, no parallel backup Ordering or preferences Strict rotation with the time window for each rotation scheme Input-output scheduling and priorities Configuration path Enabling the use of ‘sudo’ No changes, display the output Increase logging verbosity Decrease logging verbosity Messages and documentation


Let’s Try ionice = idle


Bakthat is a command line tool with the functionalities of cloud based backups. It has excellent features to compress, encrypt and upload the files with a higher degree of integrity and security. Bakthat has many features of data backup with security, including compression with tarfiles, encryption using BeeFish, uploading of data to S3 and Gracier, local backups to the SQLite database, sync based backups and many others. Installation is as follows: $ pip install bakthat

For source based installation, give the following commands: $ git clone $ cd bakthat $ sudo python install

For configuration with the options of security and cloud setup, give the command: $ bakthat configure

Usage is as follows: $ bakthat backup mydirectory

To set up a password, give the following command: $ BAKTHAT_PASSWORD=mysecuritypassword bakthat mybackup mydocument

You can restore the backup as follows:



BorgBackup (or just Borg, for short) refers to a deduplicating backup tool developed in Python, which can be used in software frameworks or independently. It provides an effective way for secured backup and recovery of data. The key features of BorgBackup include the following: ƒ Space efficiency ƒ Higher speed and minimum delays ƒ Data encryption using 256-bit AES ƒ Dynamic compression ƒ Off-site backups ƒ Backups can be mounted as a file system ƒ Compatible with multiple platforms The commands that need to be given in different distributions to install BorgBackup are given below. Distribution



GNU Guix

sudo apt-get install borgbackup pacman -S borg apt install borgbackup emerge borgbackup guix package --install borg


dnf install borgbackup


cd /usr/ports/archivers/py-borgbackup && make install clean


urpmi borgbackup


pkg_add py-borgbackup


pkg_add borgbackup


pkg install borg


zypper in borgbackup


brew cask install borgbackup


apt install borgbackup

Arch Linux Debian Gentoo

$ bakthat restore mybackup $ bakthat restore mybackup.tgz.enc

For backing up a single file, type: $ bakthat backup /home/mylocation/myfile.txt

To back up to Glacier on the cloud, type: $ bakthat backup myfile -d glacier

To disable the password prompt, give the following command: $ bakthat mybackup mymyfile --prompt no

To initialise a new backup repository, use the following command: $ borg init -e repokey /PathRepository

To create a backup archive, use the command given below: $ borg create /PathRepository::Saturday1 ~/MyDocuments

For another backup with deduplication, use the following code: $ borg create -v --stats /path/to/repo::Saturday2 ~/Documents --------------------------------------------------------Archive name: MyArchive | OPEN SOURCE FOR YOU | JANUARY 2018 | 71


Letâ&#x20AC;&#x2122;s Try MongoDB Backup:

Archive fingerprint: 612b7c35c... Time (start): Sat, 2017-11-27 14:48:13 Time (end): Sat, 2017-11-27 14:48:14 Duration: 0.98 seconds Number of files: 903 --------------------------------------------------------Original size Compressed size Deduplicated size This archive: 6.85 MB 6.85 MB 30.79 kB All archives: 13.69 MB 13.71 MB 6.88 MB

$ mongodbbackup --help

To take a backup of a single, standalone MongoDB instance, type: $ mongodbbackup -p <port> --primary-ok <Backup-Directory>

To take a backup of a cluster, config server and shards, use the following command:

Unique chunks Total chunks Chunk index: 167 330 ---------------------------------------------------------

$ mongodbbackup --ms-url <MongoS-URL> -p <port> <BackupDirectory>

MongoDB Backup

In MongoDB NoSQL, the backup of databases and collections can be retrieved using MongoDB Backup without any issues of size. The connection to Port 27017 of MongoDB can be directly created for the backup of instances and clusters. Installation is as follows:

You can use any of these reliable packages available in Python to secure data and back it up, depending on the data that needs to be protected. By: Dr Gaurav Kumar The author is the MD of Magma Research and Consultancy Pvt Ltd, Ambala. He delivers expert lectures and conducts workshops on the latest technologies and tools. He can be contacted at His personal website is

$ pip install mongodb-backup

The documentation and help files help keep track of the commands with the options that can be integrated with

Would You Like More DIY Circuits?


Let’s Try


Encrypting Partitions Using LUKS Sensitive data needs total protection. And there’s no better way of protecting your sensitive data than by encrypting it. This article is a tutorial on how to encrypt your laptop or server partitions using LUKS.


ensitive data on mobile systems such as laptops can get compromised if they get lost, but this risk can be mitigated if the data is encrypted. Red Hat Linux supports partition encryption through the Linux Unified Key Setup (LUKS) on-disks-format technology. Encrypting partitions is easiest during installation but LUKS can also be configured post installation.

Encryption during installation

When carrying out an interactive installation, tick the Encrypt checkbox while creating the partition to encrypt it. When this option is selected, the system will prompt users for a passphrase to be used for decrypting the partition. The passphrase needs to be manually entered every time the system boots. When performing automated installations, Kickstart can create encrypted partitions. Use the --encrypted and --passphrase option to encrypt each partition. For example,

the following line will encrypt the /home partition: # part /home --fstype=ext4 --size=10000 --onpart=vda2 --encrypted --passphrase=PASSPHRASE

Note that the passphrase, PASSPHRASE is stored in the Kickstart profile in plain text, so this profile must be secured. Omitting the –passphrase = option will cause the installer to pause and ask for the passphrase during installation.

Encryption post installation

Listed below are the steps needed to create an encrypted volume: 1. Create either a physical disk partition or a new logical volume. 2. Encrypt the block device and designate a passphrase, by using the following command: | OPEN SOURCE FOR YOU | JANUARY 2018 | 73


Letâ&#x20AC;&#x2122;s Try

Figure 2: The encrypted partition has been locked and verified Figure 1: An encrypted partition with an ext4 file system # cryptsetup luksFormat /dev/vdb1

3. Unlock the encrypted volume and assign it a logical volume, as follows: # cryptsetup luksOpen /dev/vdb1 name

4. Create a file system in the decrypted volume, using the following command: # mkfs -t ext4 /dev/mapper/name

As shown in Figure 1, the partition has been encrypted and opened and, finally, a file system is associated with the partition. 5. Create a mount point for the file system, mount it, and then access its contents as follows: #mkdir /secret #mount /dev/mapper/name /secret

We can verify the mounted partition using the df -h command, as shown in Figure 2. 6. When finished, unmount the file system and then lock the encrypted volume, as follows:

To boot a server with an encrypted volume unattended, a file must be created with a LUKS key that will unlock the encrypted volume. This file must reside on an unencrypted file system on the disk. Of course, this presents a security risk if the file system is on the same disk as the encrypted volume, because theft of the disk would include the key needed to unlock the encrypted volume. Typically, the file with the key is stored on removable media such as a USB drive. Here are the steps to be taken to configure a system to persistently mount an encrypted volume without human intervention. 1. First, locate or generate a key file. This is typically created with random data on the server and kept on a separate storage device. The key file should take random input from /dev/urandom, and generate our output /root/ key.txt with a block size of 4096 bytes as a single count of random numbers. # dd if=/dev/urandom of=/root/key.txt bs=4096 count=1

Make sure it is owned by the root user and the mode is 600, as follows: # chmod 600 /root/key.txt

Add the key file for LUKS using the following command:

#umount /secret # cryptsetup luksAddKey /dev/vda1 /root/key.txt

Note: The directory should be unmounted before closing the LUKS partition. After the partition has been closed, it will be locked. This can be verified using the df -h command, as shown in Figure 2.

Provide the passphrase used to unlock the encrypted volume when prompted. 2. Create an /etc/crypttab entry for the volume. /etc/crypttab

# cryptsetup luksClose name

How to persistently mount encrypted partitions

If a LUKS partition is created during installation, normal system operation prompts the user for the LUKS passphrase at boot time. This is fine for a laptop, but not for servers that may need to be able to reboot, unattended. 74 | JANUARY 2018 | OPEN SOURCE FOR YOU |

Figure 3: A key file has been generated and added to the LUKS partition

Let’s Try


command. All the steps are clearly described in Figure 5. Note: The device listed in the first field of /etc/fstab must match the name chosen for the local name to map in /etc/crypttab. This is a common configuration error.

Attaching a key file to the desired slot

LUKS offers a total of eight key slots for encrypted devices (0-7). If other keys or a passphrase exist, they can be used to open the partition. We can check all the available slots by using the luksDump command as shown below: Figure 4: Decryption of a persistent encrypted partition using the key file # cryptsetup luksDump /dev/vdb1

contains a list of devices to be unlocked during system boot. # echo “name


/root/key.txt” >> /etc/crypttab

…lists one device per line with the following spaceseparated fields: ƒ The device mapper used for the device ƒ The underlying locked device ƒ The absolute pathname to the password file used to unlock the device (if this field is left blank, or set to none, the user will be prompted for the encryption password during system boot) 3. Create an entry in /etc/fstab as shown below. After making the entry in /etc/fstab, if we open the partition using the key file, the command will be: # cryptsetup luksOpen /dev/vdb1 --key-file /root/key.txt name

As can be seen in Figure 5, Slot0 and Slot1 are enabled. So the key file we have supplied manually, by default moves to Slot1, which we can use for decrypting the partition. Slot0 carries the master key, which is supplied while creating the encrypted partition. Now we will add a key file to Slot3. For this, we have to generate a key file of random numbers by using the urandom command, after which we will add it to Slot3 as shown below. The passphrase of the encrypted partition must be supplied in order to add any secondary key to the encrypted volume. # dd if=/dev/urandom of=/root/key2.txt bs=4096 count=1. #cryptsetup luksAddKey /dev/vdb1 --key-slot 3 /root/key2. txt.

After adding the secondary key, again run the luksDump command to verify whether the key file has been added to Slot3 or not. As shown in Figure 7, the key file has been added to Slot3, as Slot2 remains disabled and Slot3 has been

As shown in the entry of the fstab file, if the device to be mounted is named, then the file system on which the encrypted partition should be permanently mounted is in the other entries. Also, no passphrase is asked for separately now, as we have supplied the key file, which has already been added to the partition. The partition can now be mounted using the mount -a command, after which the mounted partition can be verified upon reboot by using the df -h

Figure 6: Secondary key file key2.txt has been added at Slot3

Figure 5: Available slots for an encrypted partition are shown

Figure 7: Slot3 enabled successfully | OPEN SOURCE FOR YOU | JANUARY 2018 | 75


Let’s Try

Figure 8: Passphrase has been changed

enabled with the key file supplied. Now Slot3 can also be used to decrypt the partition.

Restoring LUKS headers

Figure 9: Decrypting a partition with the passphrase supplied initially

For some commonly encountered LUKS issues, LUKS header backups can mean the difference between a simple administrative fix and permanently unrecoverable data. Therefore, administrators of LUKS encrypted volumes should engage in the good practice of routinely backing up their headers. In addition, they should be familiar with the procedures for restoring the headers from backup, should the need arise.

LUKS header backup

LUKS header backups are performed using the cryptsetup tool in conjunction with the luksHeaderBackup subcommand. The location of the header is specified with the --header-backup-file option. So by using the command given below we can create the backup of any LUKS header:

Figure 10: Header is restored from the backup file

# cryptsetup luksHeaderBackup /dev/vdb1 --header-backup-file / root/back

As with all systems administration tasks, LUKS header backup should be done before every administrative task performed on a LUKS-encrypted volume. Should the LUKS header be corrupted, LUKS stores a metadata header and key slots at the beginning of each encrypted device. Thus, corruption of the LUKS header can render the encrypted data inaccessible. If a backup of the corrupted LUKS header exists, the issue can be resolved by restoring the header from this backup.

Testing and recovering LUKS headers

If an encrypted volume’s LUKS header has been backed up, the backups can be restored to the volume to overcome issues such as forgotten passwords or corrupted headers. If multiple backups exist for an encrypted volume, an administrator needs to identify the proper one to restore. The header can be restored using the following command: # cryptsetup luksHeaderRestore /dev/vdb1 --header-backup-file /root/backup

Now, let’s suppose someone has changed the password of 76 | JANUARY 2018 | OPEN SOURCE FOR YOU |

the encrypted partition /dev/vdb1 using luksChangeKey, but the password is unknown. So the only option is to restore the partition from the backup that we have created above, so that we can decrypt the partition from the previous passphrase. The backup also helps when admin forgets the passphrase. In Figure 8, a backup of /dev/vdb1 has been taken initially, and its passphrase has been subsequently changed by someone, without our knowledge. Before closing a partition, we have to unmount the locked partition. After closing the partition, trying to open the partition by using the previously set passphrase will throw the error ‘No key available with this passphrase’, because the passphrase has been changed by someone (Figure 9). But as the backup has already been taken by us, we just need to restore the LUKS header from the backup file which was created earlier. As shown in Figure 10, the header has been restored. Now we can open the header with the passphrase that was set earlier. Therefore, it is always beneficial for administrators to create a backup of their header, so that they can restore it if somehow the existing header gets corrupted or a password is changed. By: Kshitij Upadhyay The author is RHCSA and RHCE certified, and loves to write about new technologies. He can be reached at



Sandya Mannarswamy

In this month’s column, we continue our discussion on detecting duplicate questions in community question answering forums.


ased on our readers’ requests to take up a real life ML/NLP problem with a sufficiently large data set, we had started on the problem of detecting duplicate questions in community question answering (CQA) forums using the Quora Question Pair Dataset. Let’s first define our task as follows: Given a pair of questions <Q1, Q2>, the task is to identify whether Q2 is a duplicate of Q1, in the sense that, will the informational needs expressed in Q1 satisfy the informational needs of Q2? In simpler terms, we can say that Q1 and Q2 are duplicates from a lay person’s perspective if both of them are asking the same thing in different surface forms. An alternative definition is to consider that Q1 and Q2 are duplicates if the answer to Q1 will also provide the answer to Q2. However, we will not consider the second definition since we are concerned only with analysing the informational needs expressed in the questions themselves and have no access to answer text. Therefore, let’s define our task as a binary classification problem, where one of the two labels (duplicate or non-duplicate) needs to be predicted for each given question pair, with the restriction that only the question text is available for the task and not answer text. As I pointed out in last month’s column, a number of NLP problems are closely related to duplicate question detection. The general consensus is that duplicate question detection can be solved as a by-product by using these techniques themselves. Detecting semantic text similarity and recognising textual entailment are the closest in nature to that of duplicate question detection. However, given that the goal of each of these problems is distinct from that of duplicate question detection, they fail to solve the latter problem adequately. Let me illustrate this with a few example question pairs.

Example 1 Q1a: What are the ways of investing in the share market? Q1b: What are the ways of investing in the share market in India? One of the state-of-art tools available online for detecting semantic text similarity is SEMILAR ( A freely available state-of-art tool for entailment recognition is Excitement Open Platform or EOP ( SEMILAR gave a semantic similarity score of 0.95 for the above pair whereas EOP reported it as textual entailment. However, these two questions have different information needs and hence they are not duplicates of each other. Example 2 Q2a: In which year did McEnroe beat Becker, who went on to become the youngest winner of the Wimbledon finals? Q2b: In which year did Becker beat McEnroe and go on to become the youngest winner in the finals at Wimbledon? SEMILAR reported a similarity score of 0.972 and EOP marked this question pair as entailment, indicating that Q2b is entailed from Q2a. Again, these two questions are about entirely two different events, and hence are not duplicates. We hypothesise that humans are quick to see the difference by extracting the relations that are being sought for in the two questions. In Q2a, the relational event is “<McEnroe (subject), beat (predicate), Becker (object)> whereas in Q2b, the relational event is <Becker (subject), beat (predicate), McEnroe (object)> which is a different relation from that in Q2a. By quickly scanning for a relational match/mismatch at the cross-sentence level, humans quickly mark this as non-duplicate, | OPEN SOURCE FOR YOU | JANUARY 2018 | 77


Guest Column

even though there is considerable textual similarity across the text pair. It is also possible that the entailment system gets confused due to sub-classes being entailed across the two questions (namely, the clause, “Becker went on to become youngest winner”). This lends weight to our claim that while semantic similarity matching and textual entailment are closely related problems to the duplicate question detection task, they cannot be used as solutions directly for the duplicate detection problem. There are subtle but important differences in the relations of entities—cross-sentence word level interaction between two sentences which mark them as non-duplicates when examined by humans. We can hypothesise that humans use these additional checks on top of the coarse grained similarity comparison they do in their minds when they look at these questions in isolation, and then arrive at the decision of whether they are duplicates or not. If we consider the example we discussed in Q2a and Q2b, the fact is that the relation between the entities in Question 2a does not hold good in Question 2b and, hence, if this crosssentence level semantic relations are checked, it would be possible to determine that this pair is not a duplicate. It is also important to note that not all mismatches are equally important. Let us consider another example. Example 3 Q3a: Do omega-3 fatty acids, normally available as fish oil supplements, help prevent cancer? Q3b: Do omega-3 fatty acids help prevent cancer? Though Q3b does not mention the fact that omega-3 fatty acids are typically available as fish oil supplements, its information needs are satisfied by the answer to Q3a, and hence these two questions are duplicates. From a human perspective, we hypothesise that the word fragment “normally available as fish oil supplements” is not seen as essential to the overall semantic compositional meaning of Q3a; so we can quickly discard this information when we refine the overall representation of the first question when doing a pass over the second question. Also, we can hypothesise that humans use cross-sentence word level interactions to quickly check whether similar information needs are being met in the two questions. Example 4 Q4a: How old was Becker when he won the first time at Wimbledon? Q4b: What was Becker’s age when he was crowned as the youngest winner at Wimbledon? Though the surface forms of the two questions are quite dissimilar, humans tend to compare cross-sentence word level interactions such as (<old, age>, <won, crowned>) in the context of the entity in question, namely, Becker to conclude that these two questions are duplicates. Hence any system which attempts to solve the task of duplicate question detection should not depend blindly on a single aggregated coarse-grained similarity measure to compare the sentences, 78 | JANUARY 2018 | OPEN SOURCE FOR YOU |

but instead should consider the following: ƒ Do relations that exist in the first question hold true for the second question? ƒ Are there word level interactions across the two questions which cause them to have different informational needs (even if the rest of the question is pretty much identical across the two sentences)? Now that we have a good idea of the requirements for a reasonable duplicate question detection system, let’s look at how we can start implementing this solution. For the sake of simplicity, let us assume that our data set consists of single sentence questions. Our system for duplicate detection first needs to create a representation for each input sentence and then feed the representations for each of the two questions to a classifier, which will decide whether they are duplicates or not, by comparing the representations. The high-level block diagram of such a system is shown in Figure 1. First, we need to create an input representation for each question sentence. We have a number of choices for this module. As is common in most neural network based approaches, we use word embeddings to create a sentence representation. We can either use pre-trained word embeddings such as Word2Vec embeddings/Glove embeddings, or we can train our own word embeddings using the training data as our corpus. For each word in a sentence, we look up its corresponding word embedding vector and form the sentence matrix. Thus, each question (sentence) is represented by its sentence matrix (a matrix whose rows represent each word in the sentence and hence each row is the word-embedding vector for that word). We now need to convert the sentence-embedding matrix into a fixed length input representation vector. One of the popular ways of representing an input sentence is by creating a sequence-to-sequence representation using recurrent neural networks. Given a sequence of input words (this constitutes the sentence), we now pass this sequence through a recurrent neural network (RNN) and create an output sequence. While RNN generates an output for each input in the sequence, we are only interested in the final aggregated representation of the input sequence. Hence, we take the output of the last unit of the RNN and use it as our sentence representation. We can use either vanilla RNNs, or gated recurrent units (GRU), or long short term memory (LSTM) units for creating a fixed length representation from a given input sequence. Given that LSTMs have been quite successfully used in many of the NLP tasks, we decided to use LSTMs to create the fixed length representation of the question. The last stage output from each of the two LSTMs (one LSTM for each of the two questions) represents the input question representation. We then feed the two representations to a multi-layer perceptron (MLP) classifier. An MLP classifier is nothing but a fully connected multilayer feed forward neural network. Given that we have

Guest Column


Do-it-yourself a two-class prediction problem, the last stage of the MLP classifier is a two-unit softmax, the output of which gives the probabilities for each of the two output classes. This is shown in the overall block diagram in Figure 1.


Sentence representation (LSTM)

Question 1

MLP Classifer

Question 2


Sentence representation (LSTM)

Figure 1: Block diagram for duplicate question detection system

Given that we discussed the overall structure of our implementation, I request our readers to implement this Fig. 7: Start Simulation window using a deep learning library of their choice. I would recommend using Tensorflow, PyTorch or Keras. We will discuss the Tensorflow code for this problem in next month’s column. Here are a few questions for our readers to consider in their implementation: ƒ How would you handle ‘out of vocabulary’ words in the test data? Basically, if there are words which do not have embeddings in either Word2vec/Glove or even in the case of corpus-trained embedding, how would you represent them? Fig. 8: sim-Default window ƒ Given that question sentences can be of different lengths, how would you handle the variable length sentences? ƒ On what basis would you decide how many hidden layers should be present in the MLP classifier and the number of hidden units in each layer? I suggest that our readers (specifically those who have just started exploring ML and NLP) can try implementing the solution and share the results in a Python jupyter notebook. Please do send me the pointer to your notebook and we can discuss it later in this column. If you have any favourite programming questions/ Fig. 9: Adding thewould project like to discuss on this forum, software topics wave that to you please send them to me, along with your solutions and sumptions into consideration for easing the operations feedback, at sandyasm_AT_yahoo_DOT_com. Wishing all of the circuit. While data input pin and address pin our readers a very happy and prosperous new year!

may have any value depending on the specifications of memory used and your need, clock used in the circuit is active high. Enable pin triggers the circuit when it is active By: Sandya Mannarswamy high, and read operation is performed when read/ The author is an expert in systems software and is currently write is high,scientist while write operation performed working aspin a research at Conduent LabsisIndia whenXerox read/write pin is active low. (formerly India Research Centre). Her interests include compilers, programming languages, file systems and natural Software language processing. If you are preparing for systems software interviews, you may find it useful to visit Sandya’s Verilog has been used for register-transfer logic codLinkedIn group ‘Computer Science Interview Training (India)‘ ing and verification. The bottom-up design has been at

• • • • ElEctronics For You | April 2017


Exploring Software

Anil Seth

Guest Column

Python is Still Special The author takes a good look at Python and discovers that he is as partial to it after years of using it, as when he first discovered it. He shares his reasons with readers.


t was in the year 2000 that I had first come across Python in the Linux Journal, a magazine that’s no longer published. I read about it in a review titled ‘Why Python’ by Eric Raymond. I had loved the idea of a language that enforced indentation for obvious reasons. It was a pain to keep requesting colleagues to indent the code. IDEs were primitive then—not even as good as a simple text editor today. However, one of Raymond’s statements that stayed in my mind was, “I was generating working code nearly as fast as I could type.” It is hard to explain but somehow the syntax of Python offers minimal resistance! The significance of Python even today is underlined by the fact that Uber has just open sourced its AI tool Pyro, which aims at ‘…deep universal probabilistic programming with Python and PyTorch (https://eng.’ Mozilla’s DeepSpeech open source speech recognition model includes pre-built packages for Python (

Passing a function as a parameter

Years ago, after coding a number of forms, it was obvious that handling user interface forms required the same logic, except for validations. You could code a common validations routine, which used a form identifier to execute the required code. However, as the number of forms increased, it was obviously a messy solution. The ability to pass a function as a parameter in Pascal, simplified the code a lot. So, the fact that Python can do it as well is nothing special. However, examine the simple example that follows. There should be no difficulty in reading the code and understanding its intent. >>> def add(x,y): … return x+y ... >>> def prod(x,y): 80 | JANUARY 2018 | OPEN SOURCE FOR YOU |

… ... >>> … … >>> 9 >>> 20 >>>

return x*y def op(fn,x,y): return fn(x,y) op(add,4,5) op(prod,4,5)

All too often, the method required is determined by the data. For example, a form-ID is used to call an appropriate validation method. This, in turn, results in a set of conditional statements which obscure the code. Consider the following illustration: >>> def op2(fname,x,y): … fn = eval(fname) … return fn(x,y) ... >>> op2(‘add’,4,5) 9 >>> op2(‘prod’,4,5) 20 >>>

The eval function allows you to convert a string into code. This eliminates the need for the conditional expressions discussed above. Now, consider the following addition: >>> … >>> >>> 3 >>> 3 >>>

newfn =”””def div(x,y): return x/y””” exec(newfn) div(6,2) op(div,6,2) op2(‘div’,6,2)

internet of things

Guest Column

Exploring Software

The system solution to mitiand mechanical probing. SPI Serial Flash gating something like this is to Various secure key-storage devicMain Storing Device Controller implement secure boot for the es provide system designers a host FW main PLC CPU. This is a way of auof features that range from package thenticating the firmware and only design to external-sensor interfaces 3 On the importance of which programming languages accepting software that has a valid and internal circuit architectures. to teach, Prof. Dijkstra wrote, “It is not only the violin that >>> Powerdigital signature. Depending on These requirements were developed MAXQ1050 Management shapes the violinist; we are all shaped by the tools we train the requirements, you could also by American military in the form of IC ourselves to use, and in this respect, programming languages In the example above, function has been added encrypt the firmware. FIPS 140 standard, and many chip have a devious influence: they shape our thinking habits.” to the application at runtime. Again, the emphasis is Fig. 6: Secure boot of the main PLC CPU Security processing demands vendors provide very comprehensive ( not on the fact that this can be done, but consider the can easily overwhelm the MIPS tamper-proof capabilities that can be It is not surprising that Python is widely used for AI. It is simplicity and readability of the code. A person does not of a traditional PLC CPU or even used in ICSes. encryption key is of prime consideasy to integrate Python with C/C++, and it has a wide range have to even know Python to get the idea of what the create latency issues. This is best eration in many applications, since of inbuilt libraries. ButThe most future of all, it is experiment code intends to do. ofeasy thetoIoT security done by off-loading the security there is no security once the key with new ideas and explore prototypes in Python. Prof. Dijkstra wrote, “If you want more effective functions to a low-cost, off-theis compromised. There may be other approaches to It is definitely a language all programmers should programmers, you will discover that they should not shelf secure microprocessor that is To properly address physical security as well, and as you begin to include in their toolkits. waste their time debugging; they should notsecurity, introduceseveral issues built for these functions, as shown must be realise how important security is in the bugs6.toThe startsystem with.” shown ( in Fig. here considered. These include a physical a connected factories environment, Edsger_W._Dijkstra) uses an external secure micropromechanism for generating you will eventually coalesce around By: Dr random Anil Seth Thistois validate where Python appears to do well.keys, It appears to cessor the firmware’s a physical design that prevents a few approaches. The author has earned the right to do what interests him. You allow to program fairly complex algorithms concisely digitalyou signature. covert electronic interception of a The IIoT in manufacturing can find him online at, http://sethanil. is and All retain Hence, the of introducing thereadability. above examples uselikelihood key that is being communicated in him highviademand, and is a growing, and reach email at bugs withauthentication, is minimised. keys to tostart enable between authorised agents, and a trend. Security will also eventually but this raises the question of key secure method of storing a key that grow to cover vulnerabilities, but protection. Physical security of an protects against clandestine physical the need is already here.


March 2017 | ElEctronics For You

www.EFYMag.coM | OPEN SOURCE FOR YOU | JANUARY 2018 | 81

Developers Insight

Machines Learn in Many Different Ways This article gives the reader a bird’s eye view of machine learning models, and solves a use case through Sframes and Python.


ata is the new oil’—and this is not an empty expression doing the rounds within the tech industry. Nowadays, the strength of a company is also measured by the amount of data it has. Facebook and Google offer their services free in lieu of the vast amount of data they get from their users. These companies analyse the data to extract useful information. For instance, Amazon keeps on suggesting products based on your buying trends, and Facebook always suggests friends and posts in which you might be interested. Data in the raw form is like crude oil—you need to refine crude oil to make petrol and diesel. Similarly, you need to process data to get useful insights and this is where machine learning comes handy. Machine learning has different models such as regression, classification, clustering and similarity, matrix factorisation, deep learning, etc. In this article, I will briefly describe these models and also solve a use case using Python. Linear regression: Linear regression is studied as a model to understand the relationship between input and output numerical values. The representation is a linear equation that combines a specific set of input values (x), the solution to which is the predicted output for that set of input values. It helps in estimating the values of the coefficients used in the representation with the data that we have available. For example, in a simple regression problem (a 82 | JANUARY 2018 | OPEN SOURCE FOR YOU |

single x and a single y), the form of the model is: y = B0 + B1*x Using this model, the price of a house can be predicted based on the data available on nearby homes. Classification model: The classification model helps identify the sentiments of a particular post. For example, a user review can be classified as positive or negative based on the words used in the comments. Given one or more inputs, a classification model will try to predict the value of one or more outcomes. Outcomes are labels that can be applied to a data set. Emails can be categorised as spam or not, based on these models. Clustering and similarity: This model helps when we are trying to find similar objects. For example, if I am interested in reading articles about football, this model will search for documents with certain high-priority words and suggest articles about football. It will also find articles on Messi or Ronaldo as they are involved with football. TFIDF (term frequency - inverse term frequency) is used to evaluate this model. Deep learning: This is also known as deep structured learning or hierarchical learning. It is used for product recommendations and image comparison based on pixels.

Insight Developers Now, let’s explore the concept of clustering and similarity, and try to find out the documents of our interest. Let’s assume that we want to read an article on soccer. We like an article and would like to retrieve another article that we may be interested in reading. The question is how do we do this? In the market, there are lots and lots of articles that we may or may not be interested in. We have to think of a mechanism that suggests articles that interest us. One of the ways is to have a word count of the article, and suggest articles that have the highest number of similar words. But there is a problem with this model as the document length can be excessive, and other unrelated documents can also be fetched as they might have many similar words. For example, articles on football players’ lives may also get suggested, which we are not interested in. To solve this, the TF-IDF model comes in. In this model, the words are prioritised to find the related articles. Let’s get hands-on for the document retrieval. The first thing you need to do is to install GraphLab Create, on which Python commands can be run. GraphLab Create can be downloaded from by filling in a simple form, which asks for a few details such as your name, email id, etc. GraphLab Create has the IPython notebook, which is used to write the Python commands. The IPython notebook is similar to any other notebook with the advantage that it can display the graphs on its console. Open the IPython notebook which runs in the browser at http://localhost:8888/. Import GraphLab using the Python command:

Figure 1: The people data loaded in Sframes

Figure 2: Data generated for the Obama article

import graphlab

Next, import the data in Sframe using the following command: peoples = graphlab.SFrame(‘’) .

To view the data, use the command: peoples.head()

This displays the top few rows in the console. The details of the data are the URL, the name of the people and the text from Wikipedia. I will now list some of the Python commands that can be used to search for related articles on US ex-President Barack Obama. 1. To explore the entry for Obama, use the command: obama = people[people[‘name’] == ‘Barack Obama’]

2. Now, sort the word counts for the Obama article. To turn the dictionary of word counts into a table, give the

Figure 3: Sorting the word count

following command: obama_word_count_table = obama[[‘word_count’]].stack(‘word_ count’, new_column_name = [‘word’,’count’])

3. To sort the word counts to show the most common words at the top, type: obama_word_count_table.head() | OPEN SOURCE FOR YOU | JANUARY 2018 | 83

Developers Insight

Figure 4: Compute TF-IDF for the corpus

4. Next, compute the TF-IDF for the corpus. To give more weight to informative words, we evaluate them based on their TF-IDF scores, as follows: people[‘word_count’] = graphlab.text_analytics.count_ words(people[‘text’]) people.head()

5. To examine the TF-IDF for the Obama article, give the following commands: obama = people[people[‘name’] == ‘Barack Obama’] obama[[‘tfidf’]].stack(‘tfidf’,new_column_ name=[‘word’,’tfidf’]).sort(‘tfidf’,ascending=False)

Figure 5: TF-IDF for the Obama article

Words with the highest TF-IDF are much more informative. The TF-IDF of the Obama article brings up similar articles that are related to it, like Iraq, Control, etc. Machine learning is not a new technology. It’s been around for years but is gaining popularity only now as many companies have started using it. By: Ashish Sinha The author is a software engineer based in Bengaluru. A software enthusiast at heart, he is passionate about using open source technology and sharing it with the world. He can be reached at Twitter handle: @sinha_tweet.

We love to hear from you as Electronics Bazaar consistently strives to make its content informative and interesting. Please share your feedback/ thoughts/ views via email at



Insight Developers

Regular Expressions in Programming Languages: Java for You This is the sixth and final part of a series of articles on regular expressions in programming languages. In this article, we will discuss the use of regular expressions in Java, a very powerful programming language.


ava is an object-oriented general-purpose programming language. Java applications are initially compiled to bytecode, which can then be run on a Java virtual machine (JVM), independent of the underlying computer architecture. According to Wikipedia, “A Java virtual machine is an abstract computing machine that enables a computer to run a Java program.” Don’t get confused with this complicated definition—just imagine that JVM acts as software capable of running Java bytecode. JVM acts as an interpreter for Java bytecode. This is the reason why Java is often called a compiled and interpreted language. The development of Java—initially called Oak— began in 1991 by James Gosling, Mike Sheridan and Patrick Naughton. The first public implementation of Java was released as Java 1.0 in 1996 by Sun Microsystems. Currently, Oracle Corporation owns Sun Microsystems. Unlike many other programming languages, Java has a mascot called Duke (shown in Figure 1). As with previous articles in this series I really wanted to begin with a brief discussion Figure 1: Duke – the mascot of Java

about the history of Java by describing the different platforms and versions of Java. But here I am at a loss. The availability of a large number of Java platforms and the complicated version numbering scheme followed by Sun Microsystems makes such a discussion difficult. For example, in order to explain terms like Java 2, Java SE, Core Java, JDK, Java EE, etc, in detail, a series of articles might be required. Such a discussion about the history of Java might be a worthy pursuit for another time but definitely not for this article. So, all I am going to do is explain a few key points regarding various Java implementations. First of all, Java Card, Java ME (Micro Edition), Java SE (Standard Edition) and Java EE (Enterprise Edition) are all different Java platforms that target different classes of devices and application domains. For example, Java SE is customised for general-purpose use on desktop PCs, servers and similar devices. Another important question that requires an answer is, ‘What is the difference between Java SE and Java 2?’ Books like ‘Learn Java 2 in 48 Hours’ or ‘Learn Java SE in Two Days’ can confuse beginners a lot while making a choice. In a nutshell, there is no difference between the two. All this confusion arises due to the complicated naming convention followed by Sun Microsystems. | OPEN SOURCE FOR YOU | JANUARY 2018 | 85

Developers Insight The December 1998 release of Java was called Java 2, and the version name J2SE 1.2 was given to JDK 1.2 to distinguish it from the other platforms of Java. Again, J2SE 1.5 (JDK 1.5) was renamed J2SE 5.0 and later as Java SE 5, citing the maturity of J2SE over the years as the reason for this name change. The latest version of Java is Java SE 9, which was released in September 2017. But actually, when you say Java 9, you mean JDK 1.9. So, keep in mind that Java SE was formerly known as Java 2 Platform, Standard Edition or J2SE. The Java Development Kit (JDK) is an implementation of one of the Java Platforms, Standard Edition, Enterprise Edition, or Micro Edition in the form of a binary product. The JDK includes the JVM and a few other tools like the compiler (javac), debugger (jdb), applet viewer, etc, which are required for the development of Java applications and applets. The latest version of JDK is JDK 9.0.1 released in October 2017. OpenJDK is a free and open source implementation of Java SE. The OpenJDK implementation is licensed under the GNU General Public License (GNU GPL). The Java Class Library (JCL) is a set of dynamically loadable libraries that Java applications can call at run time. JCL contains a number of packages, and each of them contains a number of classes to provide various functionalities. Some of the packages in JCL include java.lang,,, java.util, etc.

Now a Java class file called HelloWorld.class containing the Java bytecode is created in the directory. The JVM can be invoked to execute this class file containing bytecode with the command: java HelloWorld.class

The message ‘Hello World’ is displayed on the terminal. Figure 2 shows the execution and output of the Java program The program contains a special method named main( ), the starting point of this program, which will be identified and executed by the JVM. Remember that a method in an object oriented programming paradigm is nothing but a function in a procedural programming paradigm. The main( ) method contains the following line of code, which prints the message ‘Hello World’ on the terminal: ‘System.out.println(“Hello World”);’

The program and all the other programs discussed in this article can be downloaded from

The ‘Hello World’ program in Java

Other than console based Java application programs, special classes like the applet, servlet, swing, etc, are used to develop Java programs to complete a variety of tasks. For example, Java applets are programs that are embedded in other applications, typically in a Web page displayed in a browser. Regular expressions can be used in Java application programs and programs based on other classes like the applet, swing, servlet, etc, without making any changes. Since there is no difference in the use of regular expressions, all our discussions are based on simple Java application programs. But before exploring Java programs using regular expressions let us build our muscles by executing a simple ‘Hello World’ program in Java. The code given below shows the program public class HelloWorld { public static void main(String[ ] args) { System.out.println(“Hello World”); } }

To execute the Java source file open a terminal in the same directory containing the file and execute the command: javac 86 | JANUARY 2018 | OPEN SOURCE FOR YOU |

Figure 2: Hello World program in Java

Regular expressions in Java

Now coming down to business, let us discuss regular expressions in Java. The first question to be answered is ‘What flavour of regular expression is being used in Java?’ Well, Java uses PCRE (Perl Compatible Regular Expressions). So, all the regular expressions we have developed in the previous articles describing regular expressions in Python, Perl and PHP will work in Java without any modifications, because Python, Perl and PHP also use the PCRE flavour of regular expressions. Since we have already covered much of the syntax of PCRE in the previous articles on Python, Perl and PHP, I am not going to reintroduce them here. But I would like to point out a few minor differences between the classic PCRE and the PCRE standard tailor-made for Java. For example, the regular expressions in Java lack the embedded comment syntax available in programming languages like Perl. Another difference is regarding the quantifiers used in regular expressions in Java and other PCRE based programming languages. Quantifiers allow you to specify the number of occurrences of a character to match against a string. Almost all the PCRE flavours have a greedy quantifier and a reluctant quantifier. In addition to these two, the regular expression syntax of Java has a possessive quantifier also.

Insight Developers To differentiate between these three quantifiers, consider the string aaaaaa. The regular expression pattern ‘a+a’ involves a greedy quantifier by default. This pattern will result in a greedy match of the whole string aaaaaa because the pattern ‘a+’ will match only the string aaaaa. Now consider the reluctant quantifier ‘a+?a’. This pattern will only result in a match for the string aa since the pattern ‘a+?’ will only match the single character string a. Now let us see the effect of the Java specific possessive quantifier denoted by the pattern ‘a++a’. This pattern will not result in any match because the possessive quantifier behaves like a greedy quantifier, except that it is possessive. So, the pattern ‘a++’ itself will possessively match the whole string aaaaaa, and the last character a in the regular expression pattern ‘a++a’ will not have a match. So, a possessive quantifier will match greedily and after a match it will never give away a character. You can download and test the three example Java files, and for a better understanding of these concepts. In Java, regular expression processing is enabled with the help of the package java.util.regex. This package was included in the Java Class Library (JCL) by J2SE 1.4 (JDK 1.4). So, if you are going to use regular expressions in Java, make sure that you have JDK 1.4 or later installed on your system. Execute the command:

{ Pattern pat = Pattern.compile(“Open Source”); Matcher mat = pat.matcher(“Magazine Open Source For You”); if(mat.matches( )) { System.out.println(“Match from “ + (mat. start( )+1) + “ to “ + (mat.end( ))); } else { System.out.println(“No Match Found”); } } }

Open a terminal in the same directory containing the file and execute the following commands to view the output: javac and Java Regex1

You will be surprised to see the message ‘No Match Found’ displayed in the terminal. Let us analyse the code in detail to understand the reason for this output. The first line of code: ‘import java.util.regex.*;’

java -version

… at the terminal to find the particular version of Java installed on your system. The later versions of Java have fixed many bugs and added support for features like named capture and Unicode based regular expression processing. There are also some third party packages that support regular expression processing in Java but our discussion strictly covers the classes offered by the package java.util. regex, which is standard and part of the JCL. The package java.util.regex offers two classes called Pattern and Matcher two classes called Pattern and Matcher that are used are used jointly for regular expression processing. The Pattern class enables us to define a regular expression pattern. The Matcher class helps us match a regular expression pattern with the contents of a string.

Java programs using regular expressions

Let us now execute and analyse a simple Java program using regular expressions. The code given below shows the program import java.util.regex.*; class Regex1 { public static void main(String args[ ])

…imports the classes Pattern and Matcher from the package java.util.regex. The line of code: ‘Pattern pat = Pattern.compile(“Open Source”);’

…generates the regular expression pattern with the help of the method compile( ) provided by the Pattern class. The Pattern object thus generated is stored in the object pat. A PatternSyntaxException is thrown if the regular expression syntax is invalid. The line of code: ‘Matcher mat = pat.matcher(“Magazine Open Source For You”);’

…uses the matcher( ) method of Pattern class to generate a Matcher object, because the Matcher class does not have a constructor. The Matcher object thus generated is stored in the object mat. The line of code: ‘if(mat.matches( ))’

…uses the method matches( ) provided by the class Pattern to perform a matching between the regular expression pattern ‘Open Source’ and the string ‘Magazine Open Source For You’. The method matches( ) returns True if there is a match and returns False if there is no match. But the important thing to remember is that the method matches( ) returns True only if the | OPEN SOURCE FOR YOU | JANUARY 2018 | 87

Developers Insight pattern matches the whole string. In this case, the string ‘Open Source’ is just a substring of the string ‘Magazine Open Source For You’ and since there is no match, the method matches( ) returns False, and the if statement displays the message ‘No Match Found’ on the terminal. If you replace the line of code: ‘Pattern pat = Pattern.compile(“Open Source”);’

…with the line of code: ‘Pattern pat = Pattern.compile(“Magazine Open Source For You”);’

…then you will get a match and the matches( ) method will return True. The file with this modification is also available for download. The line of code: ‘System.out.println(“Match from “ + (mat.start( )+1) + “ to “ + (mat.end( )));’

…uses two methods provided by the Matcher class, start( ) and end( ). The method start( ) returns the start index of the previous match and the method end( ) returns the offset after the last character matched. So, the output of the program will be ‘Match from 1 to 28’. Figure 3 shows the output of and Regex2. java. An important point to remember is that the indexing starts at 0 and that is the reason why 1 is added to the value returned by the method start( ) as (mat.start( )+1). Since the method end( ) returns the index immediately after the last matched character, nothing needs to be done there. The matches( ) method of Pattern class with this sort of a comparison is almost useless. But many other useful methods are provided by the class Matcher to carry out different types of comparisons. The method find( ) provided by the class Matcher is useful if you want to find a substring match.

Figure 3: Output of and will display the message ‘Match from 10 to 20’ on the terminal. This is due to the fact that the substring ‘Open Source’ appears from the 10th character to the 20th character in the string ‘Magazine Open Source For You’. The method find( ) also returns True in case of a match and False in case if there is no match. The method find( ) can be used repeatedly to find all the matching substrings present in a string. Consider the program shown below. import java.util.regex.*; class Regex4 { public static void main(String args[]) { Pattern pat = Pattern.compile(“abc”); String str = “abcdabcdabcd”; Matcher mat = pat.matcher(str); while(mat.find( )) { System.out.println(“Match from “ + (mat. start( )+1) + “ to “ + (mat.end( ))); } } }

In this case, the method find( ) will search the whole string and find matches at positions starting at the first, fifth and ninth characters. The line of code: ‘String str = “abcdabcdabcd”;’

…is used to store the string to be searched, and in the line of code: ‘Matcher mat = pat.matcher(str);’

…this string is used by the method matcher( ) for further processing. Figure 4 shows the output of the programs and Now, what if you want the matched string displayed instead of the index at which a match is obtained. Well, then you have to use the method group( ) provided by the class

Replace the line of code: ‘if(mat.matches( ))’

…in with the line of code: ‘if(mat.find( ))’

…to obtain the program On execution, 88 | JANUARY 2018 | OPEN SOURCE FOR YOU |

Figure 4: Output of and

Matcher. Consider the program shown below: import java.util.regex.*;

Insight Developers class Regex5 { public static void main(String args[]) { Pattern pat = Pattern.compile(“S.*r”); String str = “Sachin Tendulkar Hits a Sixer”; Matcher mat = pat.matcher(str); int i=1; while(mat.find( )) { System.out.println(“Matched String “ + i + “ : “ + )); i++; } } }

On execution, the program displays the message ‘Matched String 1 : Sachin Tendulkar Hits a Sixer’ on the terminal. What is the reason for matching the whole string? Because the pattern ‘S.*r’ searches for a string starting with S, followed by zero or more occurrences of any character, and finally ending with an r. Since the pattern ‘.*’ results in a greedy match, the whole string is matched. Now replace the line of code: ‘Pattern pat = Pattern.compile(“S.*r”);’

…in with the line: ‘Pattern pat = Pattern.compile(“S.*?r”);’

…to get What will be the output of Regex6. java? Since this is the last article of this series on regular expressions, I request you to try your best to find the answer before proceeding any further. Figure 5 shows the output of and But what is the reason for the output shown by Again, I request you to ponder over the problem for some time and find out the answer. If you don’t get the answer, download the file from the link shown earlier, and in that file I have given the explanation as a comment. So, with that example, let us wind up our discussion about regular expressions in Java. Java is a very powerful programming language and the effective use of regular expressions will make it even more powerful. The basic stuff discussed here will definitely kick-start your journey towards the efficient use of regular expressions in Java. And now it is time to say farewell. In this series we have discussed regular expression processing in six different programming languages. Four of these—Python, Perl, PHP and Java—use a regular expression style called PCRE (Perl Compatible Regular Expressions). The other two programming languages we discussed in

Figure 5: Output of and

this series, C++ and JavaScript, use a style known as the ECMAScript regular expression style. The articles in this series were never intended to describe the complexities of intricate regular expressions in detail. Instead, I tried to focus on the different flavours of regular expressions and how they can be used in various programming languages. Any decent textbook on regular expressions will give a languageagnostic discussion of regular expressions but we were more worried about the actual execution of regular expressions in programming languages. Before concluding this series, I would like to go over the important takeaways. First, always remember the fact that there are many different regular expression flavours. The differences between many of them are subtle, yet they can cause havoc if used indiscreetly. Second, the style of regular expression used in a programming language depends on the flavour of the regular expression implemented by the language’s regular expression engine. Due to this reason, a single programming language may support multiple regular expression styles with the help of different regular expression engines and library functions. Third, the way different languages support regular expressions is different. In some languages the support for regular expressions is part of the language core. An example for such a language is Perl. In some other languages the regular expressions are supported with the help of library functions. C++ is a programming language in which regular expressions are implemented using library functions. Due to this, all the versions and standards of some programming languages may not support the use of regular expressions. For example, in C++, the support for regular expressions starts with the C++11 standard. For the same reason, the different versions of a particular programming language itself might support different regular expression styles. You must be very careful about these important points while developing programs using regular expressions to avoid dangerous pitfalls. So, finally, we are at the end of a long journey of learning regular expressions. But an even longer and far more exciting journey of practising and developing regular expressions lies ahead. Good luck! By: Deepu Benson The author is a free software enthusiast whose area of interest is theoretical computer science. He maintains a technical blog at and can be reached at | OPEN SOURCE FOR YOU | JANUARY 2018 | 89

Developers How To

Explore Data Using R As of August 2017, Twitter had 328 million active users, with 500 million tweets being sent every day. Letâ&#x20AC;&#x2122;s look at how the open source R programming language can be used to analyse the tremendous amount of data created by this very popular social media tool.


ocial networking websites are ideal sources of Big Data, which has many applications in the real world. These sites contain both structured and unstructured data, and are perfect platforms for data mining and subsequent knowledge discovery from the source. Twitter is a popular source of text data for data mining. Huge volumes of Twitter data contain many varieties of topics, which can be analysed to study the trends of different current subjects, like market economics or a wide variety of social issues. Accessing Twitter data is easy as open APIs are available to transfer and arrange data in JSON and ATOM formats. In this article, we will look at an R programming implementation for Twitter data analysis and visualisation. This will give readers an idea of how to use R to analyse Big Data. As a micro blogging network for the exchange and sharing of short public messages, Twitter provides a rich repository of different hyperlinks, multimedia and hashtags, depicting the contemporary social scenario in a geolocation. From the originating tweets and the responses to them, as well as the retweets by other users, it is possible to implement opinion mining over a subject of interest in a geopolitical location. By analysing the favourite counts and the information about the popularity of users in their followersâ&#x20AC;&#x2122; count, it is also possible to make a weighted statistical analysis of the data.


Start your exploration

Exploring Twitter data using R requires some preparation. First, you need to have a Twitter account. Using that account, register an application into your Twitter account from https:// site. The registration process requires basic personal information and produces four keys for R application and Twitter application connectivity. For example, an application myapptwitterR1 may be created as shown in Figure 1. In turn, this will create your application settings, as shown in Figure 2. A customer key, a customer secret, an access token and the access token secret combination forms the final authentication using the setup_twitter_oauth() function. >setup_twitter_oauth(consumerKey, consumerSecret,AccessToken, AccessTokenSecret)

It is also necessary to create an object to save the authentication for future use. This is done by OAuthFactory$new() as follows: credential<- OAuthFactory$new(consumerKey, consumerSecret, requestURL, accessURL,authURL)

How To Developers


Application Settings Consumer Key (API Key)HTgXiD3kqncGM93bxlBczTfhR

Access LevelRead and write (modify app permissions) Owner


Owner ID






Consumer Secret (API Secret)

Access Token


and access tokens: 1371497582-xD5GxHnkpg8z6k0XqpnJZ3XvIyc1vVJGUsDXNWZ Qm9tV2XvlOcwbrL2z4QktA3azydtgIYPqflZglJ3D4WQ3


Access Token Secret 15:00


Figure 1: Twitter application settings

Here, requestURL, accessURL and authURL are available from the application setting of

Connect to Twitter

This exercise requires R to have a few packages for calling all Twitter related functions. Here is an R script to start the Twitter data analysis task. To access the Twitter data through the just created application myapptwitterR, one needs to call twitter, ROAuth and modest packages.






Figure 2: Histogram of created time tag

connectivity object: >cred<- OAuthFactory$new(consumerKey,consumerSecret,requestUR L,accessURL,authURL) >cred$handshake(cainfo=”cacert.pem”)

Authentication to a Twitter application is done by the function setup_twitter_oauth() with the stored key values as:

>setwd(‘d:\\r\\twitter’) >install.packages(“twitteR”) >install.packages(“ROAuth”) >install.packages(“modest”) >library(“twitteR”) >library(“ROAuth”) >library(“httr”)

To test this on the MS Windows platform, load Curl into the current workspace, as follows: >download.file (url=” pem”,destfile=”cacert.pem”)

Before the final connectivity to the Twitter application, save all the necessary key values to suitable variables: >consumerKey=’HTgXiD3kqncGM93bxlBczTfhR’ >consumerSecret=’djgP2zhAWKbGAgiEd4R6DXujipXRq1aTSdoD9yaHSA8 q97G8Oe’ >requestURL=’’, >accessURL=’’, >authURL=’’)

With these preparations, one can now create the required

>setup_twitter_oauth(consumerKey, consumerSecret,AccessToken, AccessTokenSecret)

With all this done successfully, we are ready to access Twitter data. As an example of data analysis, let us consider the simple problem of opinion mining.

Data analysis

To demonstrate how data analysis is done, let’s get some data from Twitter. The Twitter package provides the function searchTwitter() to retrieve a tweet based on the keywords searched for. Twitter organises tweets using hashtags. With the help of a hashtag, you can expose your message to an audience interested in only some specific subject. If the hashtag is a popular keyword related to your business, it can act to increase your brand’s awareness levels. The use of popular hashtags helps one to get noticed. Analysis of hashtag appearances in tweets or Instagram can reveal different trends of what the people are thinking about the hashtag keyword. So this can be a good starting point to decide your business strategy. To demonstrate hashtag analysis using R, here, we have picked up the number one hashtag keyword #love for the study. Other than this search keyword, the searchTwitter() function also requires the maximum number of tweets that the function call will return from the tweets. For this discussion, let us consider the maximum number as 500. Depending upon | OPEN SOURCE FOR YOU | JANUARY 2018 | 91

Developers How To the speed of your Internet and the traffic on the Twitter server, you will get an R list class object responses within a few minutes and an R list class object. >tweetList<- searchTwitter(“#love”,n=500) >mode(tweetList) [1] “list” >length(tweetList) [1] 500

In R, an object list is a compound data structure and contains all types of R objects, including itself. For further analysis, it is necessary to investigate its structure. Since it is an object of 500 list items, the structure of the first item is sufficient to understand the schema of the set of records. >str(head(tweetList,1)) List of 1 $ :Reference class ‘status’ [package “twitteR”] with 20 fields ..$ text : chr “ #SavOne #LLOVE #GotItWrong #JCole #Drake #Love #F4F #follow #follow4follow #Repost #followback” ..$ favorited : logi FALSE ..$ favoriteCount :num 0 ..$ replyToSN : chr(0) ..$ created : POSIXct[1:1], format: “2017-10-04 06:11:03” ..$ truncated : logi FALSE ..$ replyToSID : chr(0) ..$ id : chr “915459228004892672” ..$ replyToUID : chr(0) ..$ statusSource :chr “<a href=\”\” rel=\”nofollow\”>Twitter Web Client</a>” ..$ screenName : chr “Lezzardman” ..$ retweetCount : num 0 ..$ isRetweet : logi FALSE ..$ retweeted : logi FALSE ..$ longitude : chr(0) ..$ latitude : chr(0) ..$ location :chr “Bay Area, CA, #CLGWORLDWIDE <ed><U+00A0> <U+00BD><ed><U+00B2><U+00AF>” ..$ language : chr “en” ..$profileImageURL:chr images/444325116407603200/XmZ92DvB_normal.jpeg” ..$ urls :’data.frame’: 1 obs. of 5 variables: .. ..$ url : chr “” .. ..$ expanded_url: chr “” .. ..$ display_url :chr “” .. ..$ start_index :num 0 .. ..$ stop_index :num 23 ..and 59 methods, of which 45 are possibly relevant: .. getCreated, getFavoriteCount, getFavorited, getId, getIsRetweet, 92 | JANUARY 2018 | OPEN SOURCE FOR YOU |

.. getLanguage, getLatitude, getLocation, getLongitude, getProfileImageURL, .. getReplyToSID, getReplyToSN, getReplyToUID, getRetweetCount, .. getRetweeted, getRetweeters, getRetweets, getScreenName, getStatusSource, .. getText, getTruncated, getUrls, initialize, setCreated, setFavoriteCount, .. setFavorited, setId, setIsRetweet, setLanguage, setLatitude, setLocation, .. setLongitude, setProfileImageURL, setReplyToSID, setReplyToSN, .. setReplyToUID, setRetweetCount, setRetweeted, setScreenName, .. setStatusSource, setText, setTruncated, setUrls, toDataFrame, .. toDataFrame#twitterObj >

The structure shows that there are 20 fields of each list item, and the fields contain information and data related to the tweets. Since the data frame is the most efficient structure for processing records, it is now necessary to convert each list item to the data frame and bind these row-by-row into a single frame. This can be done in an elegant way using the function call, as shown here: loveDF<-“rbind”,lapply(tweetList,

Function lapply() will first convert each list to a data frame, then will bind these, one by one. Now we have a set of records with 19 fields (one less than the list!) in a regular format ready for analysis. Here, we shall mainly consider ‘created’ field to study the distribution pattern of arrival of tweets. >length(head(loveDF,1)) [1] 19 >str(head(lovetDF,1)) ‘data.frame’ : 1 obs. of 19 variables: $ text : chr “ #SavOne #LLOVE #GotItWrong #JCole #Drake #Love #F4F #follow #follow4follow #Repost #followback” $ favorited : logi FALSE $ favoriteCount : num 0 $ replyToSN : chr NA $ created : POSIXct, format: “2017-10-04 06:11:03” $ truncated : logi FALSE $ replyToSID : chr NA $ id : chr “915459228004892672” $ replyToUID : chr NA $ statusSource : chr “<a href=\”\” rel=\”nofollow\”>Twitter Web Client</a>”

0.8 0










200 100




How To Developers









Figure 3: Histogram of ordered created time tag

Figure 4: Cumulative frequency distribution

$ screenName : chr “Lezzardman” $ retweetCount : num 0 $ isRetweet : logi FALSE $ retweeted : logi FALSE $ longitude : chr NA $ latitude : chr NA $ location : chr “Bay Area, CA, #CLGWORLDWIDE <ed><U+00A0><U+00BD><ed><U+00B2><U+00AF>” $ language : chr “en” $ profileImageURL: chr “ images/444325116407603200/XmZ92DvB_normal.jpeg” >

If we want to study the pattern of how the word ‘love’ appears in the data set, we can take the differences of consecutive time elements of the vector ‘created’. R function diff() can do this. It returns iterative lagged differences of the elements of an integer vector. In this case, we need lag and iteration variables as one. To have a time series from the ‘created’ vector, it first needs to be converted to an integer; here, we have done it before creating the series, as follows:

The fifth column field is ‘created’; we shall try to explore the different statistical characteristics features of this field. >attach(loveDF) # attach the frame for further processing. >head(loveDF[‘created’],2) # first 2 record set items for demo. created 1 2017-10-04 06:11:03 2 2017-10-04 06:10:55

Twitter follows the Coordinated Universal Time tag as the time-stamp to record the tweet’s time of creation. This helps to maintain a normalised time frame for all records, and it becomes easy to draw a frequency histogram of the ‘created’ time tag. >hist(created,breaks=15,freq=TRUE,main=”Histogram of created time tag”)

>detach(loveDF) >sortloveDF<-loveDF[order(as.integer(created)),] >attach(sortloveDF) >hist(as.integer(abs(diff(created)))

This distribution shows that the majority of tweets in this group come within the first few seconds and a much smaller number of tweets arrive in subsequent time intervals. From the distribution, it’s apparent that the arrival time distribution follows a Poisson Distribution pattern, and it is now possible to model the number of times an event occurs in a given time interval. Let’s check the cumulative distribution pattern, and the number of tweets arriving within a time interval. For this we have to write a short R function to get the cumulative values within each interval. Here is the demo script and the graph plot: countarrival<- function(created) { i=1 s <- seq(1,15,1) for(t in seq(1,15,1)) | OPEN SOURCE FOR YOU | JANUARY 2018 | 93

Developers How To { s[i] <- sum((as.integer(abs(diff(created))))<t)/500 i=i+1 } return(s) }

To create a cumulative value of the arriving tweets within a given interval, countarrival() uses sum() function over diff() function after converting the values into an integer. >s <-countarrival(created) >x<-seq(1,15,1) >y<-s >lo<- loess(y~x) >plot(x,y) >lines(predict(lo), col=’red’, lwd=2)

of arrival probabilities. The pattern in Figure 5 shows a cumulative Poisson Distribution, and can be used to model the number of events occurring within a given time interval. The X-axis contains one-second time intervals. Since this is a cumulative probability plot, the likelihood of the next tweet arriving corresponds to the X-axis value or less than that. For instance, since 4 on the X-axis approximately corresponds to 60 per cent on the Y-axis, the next tweet will arrive in 4 seconds or less than that time interval. In conclusion, we can say that all the events are mutually independent and occur at a known and constant rate per unit time interval. This data analysis and visualisation shows that the arrival pattern is random and follows the Poisson Distribution. The reader may test the arrival pattern with a different keyword too. By: Dipankar Ray

# sum((as.integer(abs(diff(created))))<t)/500

To have a smooth time series curve, the loess() function has been used with the predict() function. Predicted values based on the linear regression model, as provided by loess(), are plotted along with the x-y frequency values. This is a classic example of probability distribution


The author is a member of IEEE and IET, with more than 20 years of experience in open source versions of UNIX operating systems and Sun Solaris. He is presently working on data analysis and machine learning using a neural network and different statistical tools. He has also jointly authored a textbook called ‘MATLAB for Engineering and Science’. He can be reached at

How To Developers

Using Jenkins to Create a Pipeline for Android Applications This article is a tutorial on how to create a pipeline to perform code analysis using lint and create a APK file for Android applications.


ontinuous integration is a practice that requires developers to integrate code into a shared repository such as GitHub, GitLab, SVN, etc, at regular intervals. This concept was meant to avoid the hassle of later finding problems in the build life cycle. Continuous integration requires developers to have frequent builds. The common practice is that whenever a code commit occurs, a build should be triggered. However, sometimes the build process is also scheduled in a way that too many builds are avoided. Jenkins is one of the most popular continuous integration tools. Jenkins was known as a continuous integration server earlier. However, the Jenkins 2.0 announcement made it clear that, going forward, the focus would not only be on continuous integration but on continuous delivery too. Hence, ‘automation server’ is the term used more often after Jenkins 2.0 was released. It was initially developed by

Kohsuke Kawaguchi in 2004, and is an automation server that helps to speed up different DevOps implementation practices such as continuous integration, continuous testing, continuous delivery, continuous deployment, continuous notifications, orchestration using a build pipeline or Pipeline as a Code. Jenkins helps to manage different application lifecycle management activities. Users can map continuous integration with build, unit test execution and static code analysis; continuous testing with functional testing, load testing and security testing; continuous delivery and deployment with automated deployment into different environments, and so on. Jenkins provides easier ways to configure DevOps practices. The Jenkins package has two release lines: ƒ LTS (long term support): Releases are selected every 12 weeks from the stream of regular releases, ensuring a stable release. | OPEN SOURCE FOR YOU | JANUARY 2018 | 95

Developers How To Weekly: A new release is available every week to fix bugs and provide features to the community. LTS and weekly releases are available in different flavours such as .war files (Jenkins is written in Java), native packages for the operating systems, installers and Docker containers. The current LTS version is Jenkins 2.73.3. This version comes with a very useful option, called Deploy to Azure. Yes, we can deploy Jenkins to the Microsoft Azure public cloud within minutes. Of course, you need a Microsoft Azure subscription to utilise this option. Jenkins can be installed and used in Docker, FreeBSD, Gentoo, Mac OS X, OpenBSD, openSUSE, Red Hat/Fedora/CentOS, Ubuntu/ Debian and Windows. The features of Jenkins are: ƒ Support for SCM tools such as Git, Subversion, Star Team, CVS, AccuRev, etc. ƒ Extensible architecture using plugins: The plugins available are for Android development, iOS development, .NET development, Ruby development, library plugins, source code management, build tools, build triggers, build notifiers, build reports, UI plugins, authentication and user management, etc. ƒ It has the ‘Pipelines as a Code’ feature, which uses a domain-specific language (DSL) to create a pipeline to manage the application’s lifecycle. ƒ The master agent architecture supports distributed builds. To install Jenkins, the minimum hardware requirements are 256MB of RAM and 1GB of drive space. The recommended hardware configuration for a small team is 1GB+ of RAM and 50 GB+ of drive space. You need to have Java 8 - Java Runtime Environment (JRE) or a Java Development Kit (JDK). The easiest way to run Jenkins is to download and run its latest stable WAR file version. Download the jenkins.war file, go to that directory and execute the following command: ƒ



configure authentication using Active Directory, Jenkins’ own user database, and LDAP. You can also configure authorisation using Matrix-based security or the projectbased Matrix authorisation strategy. To configure environment variables (such as ANDROID_ HOME), tool locations, SonarQube servers, Jenkins location, Quality Gates - Sonarqube, e-mail notification, and so on, go to Jenkins Dashboard > Manage Jenkins > Configure System. To configure Git, JDK, Gradle, and so on, go to Jenkins Dashboard > Manage Jenkins > Global Tool Configuration.

Creating a pipeline for Android applications

We have the following prerequisites: ƒ Sample the Android application on GitHub, GitLab, SVN or file systems. ƒ Download the Gradle installation package or configure it to install automatically from Jenkins Dashboard. ƒ Download the Android SDK. ƒ Install plugins in Jenkins such as the Gradle plugin, the Android Lint plugin, the Build Pipeline plugin, etc. Now, let’s look at how to create a pipeline using the Build Pipeline plugin so we can achieve the following tasks: ƒ Perform code analysis for Android application code using Android Lint. ƒ Create an APK file.

java -jar jenkins.war.

Next, go to http://<localhost|IP address>:8080 and wait until the ‘Unlock Jenkins’ page appears. Follow the wizard instructions and install the plugins after providing proxy details (if you are configuring Jenkins behind a proxy).

Figure 1: Global tool configuration

Configuration ƒ



To install plugins, go to Jenkins Dashboard > Manage Jenkins > Manage Plugins. Verify the updates as well as the available and the installed tabs. For the HTTP proxy configuration, go to the Advanced tab. To manually upload plugins, go to Jenkins Dashboard > Manage Jenkins > Manage Plugins > Advanced > Upload Plugin. To configure security, go to Jenkins Dashboard > Manage Jenkins > Configure Global Security. You can


Figure 2: Gradle installation

How To Developers

Figure 6: Publish Android Lint results Figure 3: ANDROID_HOME environment variable

Figure 4: Source code management

Figure 5: Lint configuration

Create a pipeline so that the first code analysis is performed and on its successful implementation, execute another build job to create an APK file. Now, let’s perform each step in sequence. Configuring Git, Java, and Gradle: To execute the build pipeline, it is necessary to take code from a shared repository. As shown below, Git is configured to go further. The same configuration is applied to all version control that is to be set up. The path in Jenkins is Home > Global tool configuration > Version control / Git. ƒ

In an Android project, the main part is Gradle, which is used to build our source code and download all the necessary dependencies required for the project. In the name field, users can enter Gradle with their version for better readability. The next field is Gradle Home, which is the same as the environment variable in our system. Copy your path to Gradle and paste it here. There is one more option, ‘Install automatically’, which installs Gradle’s latest version if the user does not have it. Configuring the ANDROID_HOME environment variable: The next step is to configure the SDK for the Android project that contains all the platform tools and other tools also. Here, the user has to follow a path to the SDK file, which is present in the system. The path in Jenkins is Home> Configuration >SDK. Creating a Freestyle project to perform Lint analysis for the Android application: The basic setup is ready; so let’s start our project. The first step is to enter a proper name (AndroidApp-CA) to your project. Then select ‘Freestyle project’ under Category, and click on OK. Your project file structure is ready to use. The user can customise all the configuration steps to show a neat and clean function. As shown in Figure 4, in a general configuration, the ‘Discard old build’ option discards all your old builds and keeps the number of the build at whatever the user wants. The path in Jenkins is Home> #your_project# > General Setting. In the last step, we configure Git as version control to pull the latest code for the Build Pipeline. Select the Git option and provide the repository’s URL and its credentials. Users can also mention from which branch they want to take the code, and as shown in Figure 5, the ‘Master’ branch is applied to it. Then click the Apply and Save button to save all your configuration steps. The next step is to add Gradle to the build, as well as add Lint to do static code analysis. Lint is a tool that performs code analysis for Android applications, just as Sonarqube does in Java applications. To add the Lint task to the configuration, the user has to write Lint options in the | OPEN SOURCE FOR YOU | JANUARY 2018 | 97

Developers How To

Figure 10: Build Pipeline flow Figure 7: Gradle build

Figure 11: Successful execution of the Build Pipeline Figure 8: Archive Artifact

Figure 9: Downstream job

build.gradle file in the Android project. The Android Lint plugin offers a feature to examine the XML output produced by the Android Lint tool and provides the results on the build page for analysis. It does not run Lint, but Lint results in XML format must be generated and available in the workspace. Creating a Freestyle project to build the APK file for the Android application: After completing the analysis of the code, the next step is to build the Android project and create an APK file to execute further. Creating a Freestyle project with the name AndroidApp-APK: In the build actions, select Invoke Gradle script. 98 | JANUARY 2018 | OPEN SOURCE FOR YOU |

Archive the build artefacts such as JAR, WAR, APK or IPA files so that they can be downloaded later. Click on Save. After executing all the jobs, the pipeline can be pictorially represented by using the Build Pipeline plugin. After installing that plugin, users have to give the start, middle and end points to show the build jobs in that sequence. They can configure upstream and downstream jobs to build the pipeline. To show all the build jobs, click on the â&#x20AC;&#x2DC;+â&#x20AC;&#x2122; sign on the right hand side top of the screen. Select build pipeline view on the screen that comes up after clicking on this sign. Configuring a build pipeline view can be decided on by the user, as per requirements. Select AndroidApp-CA as the initial job. There are multiple options like the Trigger Option, Display Option, Pipeline Flow, etc. As configured earlier, the pipeline starts by clicking on the Run button and is refreshed periodically. Upstream and downstream, the job execution will take place as per the configuration. After completing all the processes, you can see the visualisation shown in Figure 11. Green colour indicates the successful execution of a pipeline whereas red indicates an unsuccessful build. By: Bhagyashri Jain The author is a systems engineer and loves Android development. She likes to read and share daily news on her blog at

Overview Developers

Demystifying Blockchains A blockchain is a continuously growing list of records, called blocks, which are linked and secured using cryptography to ensure data security. Hence, a blockchain is nearly impossible to tamper with without anyone noticing.

Demystifying blockchains


ata security is of paramount importance to corporations. Enterprises need to establish high levels of trust and offer guarantees on the security of the data being shared with them while interacting with other enterprises. The major concern of any enterprise about data security is data integrity. What many in the enterprise domain worry about is, “Is my data accurate?” Data integrity ensures that the data is accurate, untampered with and consistent across the life cycle of any transaction. Enterprises share data like invoices, orders, etc. The integrity of this data is the pillar on which their businesses are built.


A blockchain is a distributed public ledger of transactions that no person or company owns or controls. Instead, every user can access the entire blockchain, and every transaction from any account to any other account, as it is recorded in a secure and verifiable form using algorithms of cryptography. In short, a blockchain ensures data integrity. A blockchain provides data integrity due to its unique and significant features. Some of these are listed below. Timeless validation for a transaction: Each transaction in a blockchain has a signature digest attached to it which depends on all the previous transactions, without the expiration date. Due to this, each transaction can be validated at any point in time by anyone without the risk of the data being altered or tampered with. Highly scalable and portable: A blockchain is a decentralised ledger distributed across the globe, and it ensures very high availability and resilience against disaster. Tamper-proof: A blockchain uses asymmetric or elliptic curve cryptography under the hood. Besides, each transaction gets added to the blockchain only after validation, and each transaction also depends on the previous transaction.

A blockchain, in itself, is a distributed ledger and an interconnected chain of individual blocks of data, where each block can be a transaction, or a group of transactions. In order to explain the concepts of the blockchain, let’s look at a code example in JavaScript. The link to the GitHub repository can be found at Coins. So do check the GitHub repo and go through the ‘README’ as it contains the instructions on how to run the code locally. Block: A block in a blockchain is a combination of the transaction data along with the hash of the previous block. For example: class Block { constructor(blockId, dateTimeStamp, transactionData, previousTransactionHash) { this.blockId = blockId; this.dateTimeStamp = dateTimeStamp; this.transactionData = transactionData; this.previousTransactionHash = previousTransactionHash; this.currentTransactionHash = this. calculateBlockDigest(); }

The definition of the block, inside a blockchain, is presented in the above example. It consists of the data (which includes blockId, dateTimeStamp, transactionData, previousTransactionHash, nonce), the hash of the data (currentTransactionHash) and the hash of the previous transaction data. Genesis block: A genesis block is the first block to be created at the beginning of the blockchain. For example: new Block(0, new Date().getTime().valueOf(), ‘First Block’, ‘0’);

Adding a block to the blockchain

In order to add blocks or transactions to the blockchain, we have to create a new block with a set of transactions, and add it to the blockchain as explained in the code example below: addNewTransactionBlockToTransactionChain(currentBlock) { currentBlock.previousTransactionHash = this. | OPEN SOURCE FOR YOU | JANUARY 2018 | 99

Developers Overview returnLatestBlock().currentTransactionHash; currentBlock.currentTransactionHash = currentBlock. calculateBlockDigest(); this.transactionChain.push(currentBlock); }

Another way is to not only change the data but also update the hash. Even then the current implementation can invalidate it. The code for it is available in the branch with_updated_hash.

In the above code example, we calculate the hash of the previous transaction and the hash of the current transaction before pushing the new block to the blockchain. We also validate the new block before adding it to the blockchain using the method described below.

Proof of work

Validating the blockchain

Each block needs to be validated before it gets added to the blockchain. The validation we used in our implementation is described below: isBlockChainValid() { for (let blockCount = 1; blockCount < this. transactionChain.length; blockCount++) { const currentBlockInBlockChain = this. transactionChain[blockCount]; const previousBlockInBlockChain = this. transactionChain[blockCount - 1]; if (currentBlockInBlockChain. currentTransactionHash !== currentBlockInBlockChain. calculateBlockDigest()) { return false; } if (currentBlockInBlockChain. previousTransactionHash !== previousBlockInBlockChain. currentTransactionHash) { return false; } } return true; }

In this implementation, there are a lot of features missing as of now, like validation of the funds, the rollback feature in case the newly added block corrupts the blockchain, etc. If anyone is interested in tackling fund validation, the rollback or any other issue they find, please go to my GitHub repository, create an issue and the fix for it, and send me a pull request or just fork the repository and use it whichever way this code suits your requirements. A point to be noted here is that in this implementation, there are numerous ways to tamper with the blockchain. One way is to tamper with the data alone. The implementation for that is done in the branch AngCoins/tree/tampering_data.

With the current implementation, it is still possible that someone can spam the blockchain by changing the data in one block and updating the hash in all the following blocks in the blockchain. In order to prevent that, the concept of the ‘proof of work’ suggests a difficulty or condition that each block that is generated has to meet before getting added to the blockchain. This difficulty prevents very frequent generation of the block, as the hashing algorithm used to generate the block is not under the control of the person creating the block. In this way, it becomes a game of hit and miss to try to generate the block that meets the required conditions. For our implementation, we have set the difficult task that each block generated must have two ‘00’ in the beginning of the hash, in order to be added to the blockchain. For example, we can modify the function to add a new block to include the difficult task, given as below: addNewTransactionBlockToTransactionChain(currentBlock) { currentBlock.previousTransactionHash = this. returnLatestBlock().currentTransactionHash; currentBlock.mineNewBlock(this.difficulty); this.transactionChain.push(currentBlock); } This calls the mining function (which validates the difficult conditions): mineNewBlock(difficulty) { while(this.currentTransactionHash.substring(0, difficulty) !== Array(difficulty + 1).join(‘0’)) { this.nonce++; this.currentTransactionHash = this. calculateBlockDigest(); } console.log(‘New Block Mined --> ‘ + this. currentTransactionHash); }

The complete code for this implementation can be seen in the branch block_chain_mining.

Blockchain providers

Blockchain technology, with its unprecedented way of managing trust and data and of executing procedures, can transform businesses. Here are some open source blockchain platforms.


Continued on page 103...

Let’s Try Developers

Get Familiar with the Basics of R This article tells readers how to get their systems ready for R—how to install it and how to use a few basic commands.


is an open source programming language and environment for data analysis and visualisation, and is widely used by statisticians and analysts. It is a GNU package written mostly in C, Fortran and R itself.

Installing R

Installing R is very easy. Navigate the browser to and click on CRAN in the Download section (Figure 1). This will open the CRAN mirrors. Select the appropriate mirror and it will take you to the Download section, as shown in Figure 2. Grab the version which is appropriate for your system and install R. After the installation, you can see the R icon on the menu/desktop, as seen in Figure 3. You can start using R by double-clicking on the icon, but there is a better way available. You can install the R Studio, which is an IDE (integrated development environment)— this makes things very easy. It’s a free and open source integrated environment for R. Download R Studio from products/rstudio/. Use the open source edition, which is free to use. Once installed, open R Studio by double-clicking on its icon, which will look like what’s shown in Figure 4. The default screen of R Studio is divided into three sections, as shown in Figure 5. The section marked ‘1’ is the main console window where we will execute the R commands. Section 2 shows the environment and history. The former will show all the available variables for the console and their values, while ‘history’ stores all the commands’ history. Section 3 shows the file explorer, help

viewer and the tab for visualisation. Clicking on the Packages tab in Section 3 will list all the packages available in R Studio, as shown in Figure 6. Using R is very straightforward. On the console area, type ‘2 + 2’ and you will get ‘4’ as the output. Refer to Figure 7. The R console supports all the basic math operations; so one can think of it as a calculator. You can try to do more calculations on the console. Creating a variable is very straightforward too. To assign ‘2’ to variable ‘x’, use the following different ways: > x <- 2 OR > x = 2 OR > assign(“x”,2) OR > x <- y <- 2

One can see that there is no concept of data type declaration. The data type is assumed according to the value assigned to the variable. As we assign the value, we can also see the Environment panel display the variable and value, as shown in Figure 8. A rm command is used to remove the variable. R supports basic data types to find the type of data in variable use class functions, as shown below: > x <- 2 > class(x) [1] “numeric”

The four major data types in R are numeric, character,

Figure 1: R Project website

Figure 2: R Project download page | OPEN SOURCE FOR YOU | JANUARY 2018 | 101

Developers Let’s Try

Figure 3: R icon after installation

Figure 4: R Studio icon

date and logical. The following code shows how to use various data types: > x<-”data” > class(x) [1] “character” > nchar(x) [1] 4 > d<-as.Date(“2017-12-01”) > d [1] “2017-12-01” > class(d) [1] “Date” > b<-TRUE > class(b) [1] “logical”

Figure 7: Using the console in R Studio

Apart from basic data types, R supports data structures or objects like vectors, lists, arrays, matrices and data frames. These are the key objects or data structures in R. A vector stores data of the same type. It can be thought of as a standard array in most of the programming languages. A ‘c’ function is used to create a vector (‘c’ stands for ‘combine’). The following code snippet shows the creation of a vector: > v <- c(10,20,30,40) > v [1] 10 20 30 40

The most interesting thing about a vector is that any operation applied on it will be applied to individual elements of it. For example, ‘v + 10’ will increase the value of each element of a vector by 10. > v + 10 [1] 20 30 40 50

Figure 5: R Studio default screen

This concept is difficult to digest for some, but it’s a very powerful concept in R. Vector has no dimensions; it is simply a vector and is not to be confused with vectors in mathematics which have dimensions. Vector can also be created by using the ‘:’ sign with start and end values; for example, to create a vector with values 1 to 10, use 1:10. > a <- 1:10 > a [1] 1 2 3 4 5 6 7 8 9 10

It is also possible to do some basic operations on vectors, but do remember that any operation applied on a vector is applied on individual elements of it. For example, if the addition operation is applied on two vectors, the individual elements of the vectors will be added:

Figure 6: Packages in R Studio

> a<-1:5 > b<-21:25 > a+b [1] 22 24 26 28 30


Let’s Try Developers > arr <- array(21:24, dim=c(2,2)) > arr [,1] [,2] [1,] 21 23 [2,] 22 24

A data frame and matrix are used to hold tabular data. It can be thought of as an Excel sheet with rows and columns. The only difference between a data frame and matrix is that in the latter, every element should be of the same type. The following code shows how to create the data frame:

Figure 8: R Studio Environment and console > a-b [1] -20 -20 -20 -20 -20 > a*b [1] 21 44 69 96 125

A list is like a vector, but can store arbitrary or any type of data. To create a list, the ‘list’ function is used, as follows: > l <- list(1,2,3,”ABC”) > l [[1]] [1] 1 [[2]] [1] 2

> > > > >

x<-1:5 y<-(“ABC”, “DEF”, “GHI”, “JKL”, “MNO”) z<-c(25,65,33,77,11) d <- data.frame(SrNo=x, Name=y, Percentage=z) d

1 2 3 4 5

SrNo 1 2 3 4 5

Name Percentage ABC 25 DEF 65 GHI 33 JKL 77 MNO 11

So a data frame is nothing but a vector combined in the column format. This article gives a basic idea of how data is handled by R. I leave the rest for you to explore.

[[3]] [1] 3 [[4]] [1] “ABC”

By: Ashish Singh Bhatia

A list can be used to hold different types of objects. It can be used to store a vector, list, data frame or anything else. An array is nothing but a multi-dimensional vector that can store data in rows and columns. An array function is used to create an array.

The author is a technology enthusiast and a FOSS fan. He loves to explore new technology and to work on Python, Java and Android. He can be reached at He blogs at and http://

Continued from page 100... HyperLedger: Hyperledger nurtures and endorses a wide array of businesses around blockchain technologies, including distributed ledgers, smart contracts, etc. Hyperledger encourages the re-use of common building blocks and enables the speedy invention of distributed ledger technology components. Project link: Project GitHub link: Openchain: Openchain is an open source distributed ledger technology. It is ideal for enterprises, and deals in issuing and managing digital assets in a robust, secure and scalable way. Project link: Ethereum project: This is a distributed framework that runs smart contracts—applications that run exactly as programmed in a secured virtual environment without

downtime or the possibility of tampering, as this platform leverages the custom built blockchain. Project link: There are some more blockchain projects, links to which can be found in the References section.

Reference [1]

By: Abhinav Nath Gupta The author is a software development engineer at Cleo Software India Pvt Ltd, Bengaluru. He is interested in cryptography, data security, cryptocurrency and cloud computing. He can be reached at | OPEN SOURCE FOR YOU | JANUARY 2018 | 103




Clearing the terminal screen

Enter ‘clear’ without quotes in the terminal and hit the Enter button. This causes the screen to be cleared, making it look like a new terminal. —Abhinay B,

Convert image formats from the command line in Ubuntu

‘convert’ is a command line tool that works very well in many Linux based OSs. The ‘convert’ program is a part of the ImageMagick suite of tools and is available for all major Linux based operating systems. If it is not on your computer, you can install it using your package manager. It can convert between image formats as well as resize an image, blur, crop, dither, draw on, flip, join, and resample more from your command line. The syntax is as follows:

convert image1.jpg -resize 800x600 newimage1.jpg

—Anirudh K,

Passwordless SSH to remote machine

It can be really annoying (mostly in the enterprise environment) when you have to enter a password each time while doing an SSH to a remote machine. So, our aim here is to do a passwordless SSH from one machine (let’s call it host A/User a) to another (host B/User b). Now, on host A, if a pair of authentication keys are not generated for User a, then generate these with the following commands (do not enter a passphrase):

For example, we can convert a PNG image to GIF by giving the following command:

a@A:~> ssh-keygen -t rsa Generating public/private rsa key pair. Enter file in which to save the key (/home/a/.ssh/id_rsa): Created directory ‘/home/a/.ssh’. Enter passphrase (empty for no passphrase): Enter same passphrase again: Your identification has been saved in /home/a/.ssh/id_rsa. Your public key has been saved in /home/a/.ssh/ The key fingerprint is:

convert image.png image.gif

3e:4f:05:79:3a:9f:96:7c:3b:ad:e9:58:37:bc:37:e4 a@A

To convert a JPG image to BMP, you can give the following command:

This will generate a public key on /home/a/.ssh/ On host B as User b, create ~/.ssh directory (if not already present) as follows:

convert [input options] input file [output options] output file.

convert image.jpg image.bmp

The tool can also be used to resize an image, for which the syntax is shown below: convert [nameofimage.jpg] -resize [dimensions] [newnameofimage.jpg]

For example, to convert an image to a size of 800 x 600, the command would be as follows:

a@A:~> ssh b@B mkdir -p .ssh b@B’s password:

Finally, append User a’s new public key to b@B:.ssh/ authorized_keys and enter User b’s password for one last time: a@A:~> cat .ssh/ | ssh b@B ‘cat >> .ssh/authorized_ keys’ b@B’s password:


From now on, you can log into host B as User b from host A as User a without a password: a@A:~> ssh b@B

—Ashay Shirwadkar,

Performance analysis of code

In order to check the performance of the code you have written, you can use a simple tool called ‘perf’. Just run the following command: $sudo apt-get install linux-tools-common linux-tools-generic

The above command will install the ‘perf’ tool on Ubuntu or a similar operating system.

$echo “Tips and Tricks” | curl -F-=\<-

To generate the QR code for a domain, use the following code: $echo “” | curl -F-=\<-

Note: You need a working Internet connection on your computer. —Remin Raphael,

Replace all occurrences of a string with a new line

Often, we might need to replace all occurrences of a string with a new line in a file. We can use the ‘sed’ command for this:

$perf list $sed ‘s/\@@/\n/g’ file1.txt > file2.txt

The above command gives the list of all the information that can be got by running ‘perf’. For example, to analyse the performance of a C program and if you want to know the number of cachemisses, the command is as follows:

The above command replaces the string ‘@@’ in ‘file1. txt’ with a new line character and writes the modified lines to ‘file2.txt’. sed is a very powerful tool; you can read its manual for more details.

$perf stat -e cache-misses ./a.out

If you want to use more than one command at a time, give the following command: $perf stat -e cache-misses,cache-references ./a.out

—Gunasekar Duraisamy,

Create a QR code from the command line

QR code (abbreviated from Quick Response Code) is a type of matrix bar code (or two-dimensional bar code) first designed for the automotive industry. There are many online websites that help you create a QR code of your choice. Here is a method that helps generate QR codes for a string or URL using the Linux command line:

—Nagaraju Dhulipalla,

Git: Know about modified files in changeset

Running the plain old ‘git log’ spews out a whole lot of details about each commit. How about extracting just the name of the files (with their path relative to the root of the Git repository)? Here is a handy command for that: git log -m -1 --name-only --pretty=”format:” HEAD

Changing the HEAD to a different SHA1 commit ID will fetch the names of the files only. This can come in handy while tooling the CI environment. Note: This will return empty on merge commits. —Ramanathan M,

Share Your Open Source Recipes!

Figure 1: Generated QR code

The joy of using open source software is in finding ways to get around problems—take them head on, defeat them! We invite you to share your tips and tricks with us for publication in OSFY so that they can reach a wider audience. Your tips could be related to administration, programming, troubleshooting or general tweaking. Submit them at The sender of each published tip will get a T-shirt. | OPEN SOURCE FOR YOU | JANUARY 2018 | 105



The latest, stable Linux for your desktop. Ubuntu Desktop 17.10 (Live)

placement. a free re

Re co mm en de

tended, and sh unin oul




e ern Int

nab tio ec

l e ma

terial, if found

he complex n d to t atu ute re




s c, i



M Drive VD-RO M, D B RA , 1G : P4 nts me ire qu Re

In c ase this DV Dd oe sn ot wo r

tem ys dS


for rt@ ppo t su sa ou t rite ,w rly pe ro

Ubuntu comes with everything you need to run your organisation, school, home or enterprise. All the essential applications, like an office suite, browsers, email and media apps come pre-installed, and thousands of more games and applications are available in the Ubuntu Software Centre. Ubuntu 17.10, codenamed Artful Aardvark, is the first release to include the new shell; so itâ&#x20AC;&#x2122;s a great way to preview the future of Ubuntu. You can try it live from the bundled DVD.

a. t dat


Any o


January 2018

Fedora Workstation 27

Fedora Workstation is a polished, easy-to-use operating system for laptop and desktop computers, with a complete set of tools for developers and makers of all kinds. It comes with a sleek user interface and the complete open source toolbox. Previous releases of Fedora have included Yumex-DNF as a graphical user interface for package management. Yumex-DNF is no longer under active development; it has been replaced in Fedora 27 by dnfdragora, which is a new DNF front-end that is written in Python 3 and uses libYui, the widget abstraction library written by SUSE, so that it can be run using Qt 5, GTK+ 3, or ncurses interfaces. The ISO image can be found in the other_isos folder on the root of the DVD.

MX Linux 17 CD

MX Linux is a cooperative venture between the antiX and former MEPIS communities, which uses the best tools and talent from each distro. It is a mid-weight OS designed to combine an elegant and efficient desktop with simple configuration, high stability, solid performance and a medium-sized footprint. MX Linux is a great nofuss system for all types of users and applications. The ISO image can be found in the other_isos folder on the root of the DVD.

am Te

efy. m@ tea cd ail: e-m


What is a live DVD? A live CD/DVD or live disk contains a bootable operating system, the core program of any computer, which is designed to run all your programs and manage all your hardware and software. Live CDs/DVDs have the ability to run a complete, modern OS on a computer even without secondary storage, such as a hard disk drive. The CD/DVD directly runs the OS and other applications from the DVD drive itself. Thus, a live disk allows you to try the OS before you install it, without erasing or installing anything on your current system. Such disks are used to demonstrate features or try out a release. They are also used for testing hardware functionality, before actual installation. To run a live DVD, you need to boot your computer using the disk in the ROM drive. To know how to set a boot device in BIOS, please refer to the hardware documentation for your computer/laptop.




IOTSHOW.IN 7-9 Feb 2018 KTPO Whitefield • Bengaluru AN EVENT FOR

Creating IoT Solutions? Come and explore latest products & technologies

THE CREATORS, 150+ speakers • 200+ exhibitors • 5,000+ delegates THE ENABLERS AND CONTACT: 98111 55335 • • THE CUSTOMERS

December 2017

Loonycorn is hiring


Mail Resume + Cover Letter to You:    

Really into tech - cloud, ML, anything and everything Interested in video as a medium Willing to work from Bangalore in the 0-3 years of experience range

Us:  ex-Google | Stanford | INSEAD  100,000+ students  Video content on Pluralsight, Stack, Udemy...