Data analytics for social microblogging platforms 1st edition soumi dutta - Download the ebook today

Page 1


https://ebookmass.com/product/data-analytics-for-socialmicroblogging-platforms-1st-edition-soumi-dutta/

Instant digital products (PDF, ePub, MOBI) ready for you

Download now and discover formats that fit your needs...

ISE Data Analytics for Accounting 2nd Edition Vernon Richardson Professor

https://ebookmass.com/product/ise-data-analytics-for-accounting-2ndedition-vernon-richardson-professor/

ebookmass.com

Data Analytics for Intelligent Transportation Systems 1st Edition Edition Mashrur Chowdhury

https://ebookmass.com/product/data-analytics-for-intelligenttransportation-systems-1st-edition-edition-mashrur-chowdhury/

ebookmass.com

Intelligent Data-Analytics for Condition Monitoring Malik

https://ebookmass.com/product/intelligent-data-analytics-forcondition-monitoring-malik/

ebookmass.com

Read & Think French, Premium The Editors Of Think French! Magazine

https://ebookmass.com/product/read-think-french-premium-the-editorsof-think-french-magazine-2/

ebookmass.com

Britain's Industrial Revolution in 100 Objects John Broom

https://ebookmass.com/product/britains-industrial-revolutionin-100-objects-john-broom/

ebookmass.com

A Wild West of the Mind George Sher

https://ebookmass.com/product/a-wild-west-of-the-mind-george-sher/

ebookmass.com

Science Wars: The Battle over Knowledge and Reality Steven L. Goldman

https://ebookmass.com/product/science-wars-the-battle-over-knowledgeand-reality-steven-l-goldman/

ebookmass.com

Cosmopolitanism and Transatlantic Circles in Music and Literature 1st ed. Edition Ryan R. Weber

https://ebookmass.com/product/cosmopolitanism-and-transatlanticcircles-in-music-and-literature-1st-ed-edition-ryan-r-weber/

ebookmass.com

Genetic

Analysis of Complex Disease 3rd Edition William K. Scott

https://ebookmass.com/product/genetic-analysis-of-complex-disease-3rdedition-william-k-scott/

ebookmass.com

https://ebookmass.com/product/shake-up-my-life-nasty-bastards-mcbook-5-hayley-faiman/

ebookmass.com

Data Analytics for Social Microblogging Platforms

FIRST EDITION

Soumi Dua

Department of Computer Application and Science, Institute of Engineering & Management (IEM), Kolkata, India

Asit Kumar Das

Department of Computer Science and Technology, IIEST Shibpur, Howrah, India

Saptarshi Ghosh

Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, Kharagpur, India

Department of Computer Science, CHRIST (Deemed to be University), Bangalore, India

Table of Contents

Cover image

Title page

Copyright

Dedication

About the authors

Preface

Acknowledgments

About the book

Part 1: Introduction of intelligent information

filtering and organization systems for social microblogging sites

Chapter 1: Introduction to microblogging sites

Abstract

1.1. Introduction

1.2. Online social networking sites

1.3. Advantages and disadvantages of social networking

1.4. Microblogging sites

1.5. Information of social microblogging sites

1.6. Challenges in using microblogging sites

1.7. Background of the Twitter microblogging site

1.8. Motivation of research

1.9. Challenges and requirements of multi-document summarization

1.10. Contributions of this research

1 11 Conclusion

References

Chapter 2: Literature review on data analytics for social microblogging platforms

Abstract

2.1. Introduction

2 2 Attribute selection and its application in spam detection

2.3. Summarization with various methods

2.4. Cluster analysis of microblogs

2.5. Conclusion

References

Chapter 3: Data collection using Twier API

Abstract

3.1. Introduction

3.2. Experimental dataset description

3.3. Data preprocessing

3.4. Removal of user names and URLs

3.5. Converting emojis and emoticons to words

3.6. Conclusion

References

Part 2: Microblogging dataset applications and implications

Chapter 4: Aribute selection to improve spam classification

Abstract

4.1. Introduction

4.2. Literature survey

4.3. Methodology for classification

4.4. Experimental dataset

4.5. Evaluating performance

4.6. Conclusion

References

Chapter 5: Ensemble summarization algorithms for microblog summarization

Abstract

5.1. Introduction

5.2. Base summarization algorithms

5.3. Unsupervised ensemble summarization

5.4. Supervised ensemble summarization

5.5. Experiments and results

5.6. Demonstrating the input and output of summarization algorithms through an example

5.7. Conclusion

References

Chapter 6: Graph-based clustering technique for microblog clustering

Abstract

6.1. Introduction

6.2. Related work

6.3. Background studies

6.4. Proposed methodology

6.5. Results and discussion

6.6. Conclusion References

Chapter 7: Genetic algorithm-based microblog clustering technique

Abstract

7.1. Introduction

7.2. Related work

7.3. Clustering using genetic algorithms and K-means

7.4. Evaluating performance

7.5. Experimental dataset

7.6. Conclusion References

Part 3: Aribute selection to improve spam classification

Chapter 8: Feature selection-based microblog clustering technique

Abstract

8.1. Introduction

8.2. Related work

8.3. Microblog clustering algorithms

8.4. Dataset for clustering algorithms

8.5. Experimental results

8.6. Conclusion

References

Chapter 9: Dimensionality reduction techniques in microblog clustering models

Abstract

9.1. Introduction

9.2. Literature survey

9.3. Proposed methodology

9.4. Dataset

9.5. Results and discussion

9.6. Conclusion

References

Chapter 10: Conclusion and future directions

Abstract

10.1. Introduction

10.2. Summary of contributions

10.3. Future research directions

References

Index

Copyright

Academic Press is an imprint of Elsevier

125 London Wall, London EC2Y 5AS, United Kingdom

525 B Street, Suite 1650, San Diego, CA 92101, United States

50 Hampshire Street, 5th Floor, Cambridge, MA 02139, United States

The Boulevard, Langford Lane, Kidlington, Oxford OX5 1GB, United Kingdom

Copyright © 2023 Elsevier Inc. All rights reserved.

No part of this publication may be reproduced or transmied in any form or by any means, electronic or mechanical, including photocopying, recording, or any information storage and retrieval system, without permission in writing from the publisher. Details on how to seek permission, further information about the Publisher's permissions policies and our arrangements with organizations such as the Copyright Clearance Center and the Copyright Licensing Agency, can be found at our website: www.elsevier.com/permissions.

This book and the individual contributions contained in it are protected under copyright by the Publisher (other than as may be noted herein).

Notices

Knowledge and best practice in this field are constantly changing. As new research and experience broaden our understanding, changes in research methods, professional practices, or medical treatment may become necessary.

Practitioners and researchers must always rely on their own experience and knowledge in evaluating and using any information, methods, compounds, or experiments described herein. In using such information or methods they should be mindful of their own safety and the safety of others, including parties for whom they have a professional responsibility.

To the fullest extent of the law, neither the Publisher nor the authors, contributors, or editors, assume any liability for any injury and/or damage to persons or property as a maer of products liability, negligence or otherwise, or from any use or operation of any methods, products, instructions, or ideas contained in the material herein.

ISBN: 978-0-323-91785-8

For information on all Academic Press publications visit our website at hps://www.elsevier.com/books-and-journals

Publisher: Mara Conner

Editorial Project Manager: Franchezca A. Cabural

Production Project Manager: Punithavathy Govindaradjane

Cover Designer: Christian Bilbow

Typeset by VTeX

Dedication

I, Dr. Soumi Dua, dedicate the book to my parents, daughter, brother, husband, friends, colleagues, and all my teachers.

I, Dr. Asit Kumar Das, dedicate the book to my wife, son, parents, friends, and colleagues.

I, Dr. Saptarshi Ghosh, dedicate the book to my wife, parents, friends, colleagues, and all my teachers.

I, Dr. Debabrata Samanta, dedicate the book to my parents Mr. Dulal Chandra Samanta, Mrs. Ambujini Samanta, my elder sister Mrs. Tanusree Samanta, brother-in-law Mr. Soumendra Jana and daughter Ms. Aditri Samanta.

About the authors

Dr. Soumi Dua is Associate Professor at the Institute of Engineering & Management,Saltlake, India. She has completed her PhD in Engineering at IIEST, Shibpur. She received her B.Tech (IT) and M.Tech (CSE) as a Gold medalist from MAKAUT. She is certified as Publons Academy Peer Reviewer, 2020 and Certified Microsoft Innovative Educator, 2020. Her research interests include data mining, online social network data analysis, and image processing. She has published 50 conference and journal papers with publishing houses like Springer, IEEE, IGI Global, and Taylor & Francis. She has contributed five book chapters published by Taylor & Francis Group and IGI Global. She is peer reviewer and TPC member of different international journals. She was editor of the CIPR-2020, CIPR-2019, IEMIS-2018, IEMIS-2020, CIIR-2021,IEMIS2022 and special issues in IJWLTT. She is a member of several technical functional bodies such as IEEE, ACM, IFERP, MACUL, SDIWC, Internet-Society, ICSES, ASR, AIDASCO, USERN, IRAN, and IAENG. She has published six patents and one Indian Copyright. She has delivered more than 30 keynote talks at different international conferences.

Dr. Asit Kumar Das works as a Professor in the Department of Computer Science and Technology, Indian Institute of Engineering Science and Technology, Shibpur, Howrah, West Bengal, India. He has published more than 100 research papers in various international journals and conference proceedings, 1 book, and 4 book chapters. He has worked as a member of the Editorial/Reviewer Board of various international journals and conferences. He has shared his research experiences in many workshops and conferences as an invited lecturer in various institutes in India. He has acted as the

general chair, program chair, and advisory member of commiees of many international conferences. His research interests include data mining and paern recognition in various fields, including bioinformatics, social networks, text mining, audio and video data analysis, and medical data analysis. He has already guided ten PhD students and is currently guiding six PhD students.

Dr. Saptarshi Ghosh is Assistant Professor at the Department of Computer Science and Engineering, Indian Institute of Technology Kharagpur, India. His primary research interests include social network analysis, legal analytics, and algorithmic bias and fairness. His research is interdisciplinary and uses techniques from machine learning, natural language processing, information retrieval, computational social science, and complex network theory. He heads the Max Planck Partner Group at IIT Kharagpur, which focuses on algorithmic bias and fairness. He received his PhD in computer science from IIT Kharagpur in 2013. He was a Humboldt Postdoctoral Research Fellow at the Max Planck Institute for Software Systems (MPI-SWS), Germany.

Dr. Debabrata Samanta is presently working as Associate Professor in the Department of Computer Science, CHRIST (Deemed to be University), Bangalore, India. He obtained his Bachelors in Physics (Honors) from Calcua University, Kolkata, India. He obtained his MCA from the Academy of Technology, under WBUT, West Bengal. He obtained his PhD in Computer Science and Engineering from the National Institute of Technology, Durgapur, India, in the area of SAR image processing. He is keenly interested in interdisciplinary research and development and has experience spanning fields of SAR image analysis, video surveillance, heuristic algorithms for image classification, deep learning frameworks for detection and classification, blockchain, statistical modeling, wireless ad hoc networks, natural language processing, and V2I communication. He has successfully completed six consultancy projects. He has received an Open Access Publication fund. He has received funding under the International Travel Support Scheme in 2019 for aending a conference in Thailand. He is the owner of 21 patents (3 Indian patents designed, 2 Australian patents granted, 16

p p g p g Indian patents published) and 2 copyrights. He has authored and coauthored over 189 research papers in international journals (SCI/SCIE/ESCI/Scopus) and conference proceedings published by publishing houses including IEEE, Springer, and Elsevier. He has received the “Scholastic Award” at the 2nd International conference on Computer Science and IT application, CSIT-2011, Delhi, India. He is a coauthor of 13 books and the coeditor of 11 books. He has presented various papers at international conferences and received Best Paper awards. He has authored and coauthored eight book chapters. He also serves as acquisition editor for Springer, Wiley, CRC, Scrivener Publishing, and Elsevier. He is a Professional IEEE Member, an Associate Life Member of the Computer Society of India (CSI), and a Life Member of the Indian Society for Technical Education (ISTE). He has been a convener, keynote speaker, session chair, cochair, publicity chair, publication chair, advisory board member, and technical program commiee member for many prestigious international and national conferences. He was invited speaker at several institutions.

Preface

This book focuses on microblogging sites, which have opened up a variety of new opportunities for communication, as well as new obstacles. Spammers and other types of users who upload dangerous content on microblogging sites are becoming more prevalent as the services become more popular. As a result, it is critical to screen spam posts from such sites. The abundance of information provided on microblogging services is a second difficulty. On Twier, for example, more than 500 million posts (tweets) are published per day on average. Users are experiencing information overload as a result of the vast amount of information that is submied. As a result, strategies for organizing information must be developed. The goal of this work is to create strategies for dealing with the two practical difficulties mentioned above: screening out hazardous content and organizing information on microblogging sites. We believe this book will be both instructive and provocative. We believe it will move the data analysis community forward, allowing each user to study various queries, applications, and future arrangements in order to make safe and secure plans for everybody. It also focuses on theory and methods in related disciplines such as intelligent information filtering and organization systems for social microblogging sites.

Acknowledgments

Soumi Dua, Dr. Associate Professor, Department of Computer Application and Science, Institute of Engineering & Management (IEM), Kolkata, India

Asit Kumar Das, Dr. Professor, Department of Computer Science and Technology, IIEST Shibpur, Howrah, India

Saptarshi Ghosh, Dr. Assistant Professor, Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, Kharagpur, India

Debabrata Samanta, Dr. Associate Professor, Department of Computer Science, CHRIST (Deemed to be University), Bangalore, India

This work would not have been possible without close cooperation with many people who were always there when we needed them the most. We take this opportunity to acknowledge them and extend our sincere gratitude for helping us make this book a reality.

We would like to acknowledge the distinguished researcher Dr. Tanmoy Chakraborty, with whom we have coauthored a publication. We feel very fortunate as he has provided us with valuable suggestions.

We would also like to acknowledge Vibhash Chandra and Kanav Mehra of IIEST Shibpur, with whom we have coauthored several publications.

We are highly thankful to the director of IEM, Kolkata, Dr. Satyajit Chakraborti, and the Head of the Department of Computer Science and Application, Dr. Abhishek Bhaacharya, for their invaluable advices and moral support which transformed our work. We also acknowledge the invaluable support provided by our colleagues.

On our journey we had some other wayfarers, whose support is unforgeable because of their loving disposition and who are now our friends.

Finally, we would like to acknowledge the people who mean the world to us: our parents, brothers, sisters, and children. We cannot imagine a life without their love and blessings. Thank you all for showing faith in us and giving us the liberty to choose what we desire. We consider ourselves the luckiest in the world to have such supportive families, standing behind us with their love and support. We express our great pleasure, sincere thanks, and gratitude to the people who significantly helped, contributed, and supported the completion of this book. Our sincere thanks go to Fr. Benny Thomas, Professor in the Department of Computer Science and Engineering, CHRIST (Deemed to be University), Bengaluru, Karnataka India, for his continuous support, advice, and cordial guidance from the conception to the completion of this book.

About the book

The goal of this book is to discuss important computational techniques in the domain of microblogging datasets, such as paern recognition, machine learning, data mining algorithms, rough set and fuzzy set theory, evolutionary computations, combinatorial paern matching, and efficient data mining techniques, including clustering and classification. This book provides a comprehensive explanation of microblog datasets as a field that is focused on information, data, and knowledge in the context of natural language processing.

Chapter 1 states that Tumblr, Twier, and Sina Weibo are three of the most popular online social microblogging platforms today. Microblogging platforms have become popular communication tools because they allow for fast information exchange. Every day, these sites generate massive amounts of data as a result of commercial, intellectual, and social activities. This crowdsourced data can be used for fraud detection, market analysis, spam posts, spam detection, categorization or grouping of users based on their behavior, customer retention, and extraction of crucial news, as well as production control and scientific discovery. Microblog data is rapidly being used to build real-time search engines and recommendation systems, as well as services that mine and summarize public reaction to events. Microblogging sites, in addition to having a wide range of applications, also present a number of challenges in terms of exploitation of crowdsourced data, such as the need to filter out potentially harmful content uploaded by spammers and the need to organize the voluminous data. The purpose of this book, as stated in the next section, is to provide ways for dealing with these two challenges.

Chapter 2 provides a review of the literature on a number of themes. Filtering undesired information (e.g., spam), clustering, and summarization are three commonly used strategies to achieve information filtering and organization. Prior to clustering and summarization, aribute selection and dimensionality reduction are critical tasks. Because of the growing and diverse nature of microblog vocabulary, aribute selection plays an increasingly important role in data analysis. Aribute selection increases the generation of summarization, grouping, and classification procedures as well as reducing the dimension of the large dataset. The state-of-the-art aribute selection methods are explored in this section. Without any prior knowledge, cluster analysis looks for paerns in a collection to detect similarity across objects (i.e., unsupervised). Clustering is particularly important in the analysis of microblog data. Various standard clustering algorithms are briefly reviewed in this section. This section also covers related issues such as cluster quality evaluation measures and cluster validation. Automatic document summarization is a well-known problem in the field of information retrieval. Hundreds to thousands of microblogs (tweets) are routinely posted on Twier during an emergency, making it impossible to read through each tweet individually. As a result, summarizing emergency microblogs has become a major research area in recent years. Some off-the-shelf extractive summarization algorithms are explored in this chapter. This chapter analyzes many publications that have employed aribute selection, summarization, and clustering approaches in the domain of online social microblogging sites, in addition to analyzing these methods in general.

Chapter 3 states that some of the most prominent microblogging services are Twier, Facebook, and LinkedIn. On the Twier microblogging site, which is one of the most popular websites on the Internet today, millions of users post real-time messages (tweets) on a variety of topics. Popular content on Twier (i.e., content that is widely discussed) can be used for a variety of purposes on any given day, including content suggestion, marketing, and commercial campaigns. One of the most exciting characteristics of Twier is its

p g g

real-time nature: at any given time, millions of Twier users are giving their thoughts on a wide range of topics or incidents/events occurring across the world. As a result, Twier content is particularly useful for acquiring real-time information on a variety of topics. This section covers the dataset that was used in a number of experiments in the book. Twier provides an API for gathering many types of data, including streams of tweets published through the website, user profile information, and so on. Twier, in particular, provides a 1% random sample of all tweets published on the Twier website worldwide. This chapter provides the reader with a quick rundown of the experimental dataset that was used to evaluate the various data analytics methodologies described in the book.

Chapter 4 discusses that spammers are increasingly targeting online social network (OSN) sites, placing dangerous content on them as their popularity grows. Spam posts and spam accounts must consequently be filtered from OSNs. Several prior aempts to categorize spam on OSNs used a number of criteria to distinguish spam from legitimate entities. The purpose of this chapter is to improve spam categorization by developing a method for selecting aributes that enables for a fewer number of aributes to be discovered, resulting in beer classification. We explicitly apply rough set theory concepts to construct the aribute selection method. On five different spam classification datasets spanning a variety of OSNs, the suggested methodology's performance is compared to that of numerous baseline feature selection approaches. We discovered that the suggested strategy selects a smaller aribute subset than baseline procedures for the majority of datasets, but produces beer classification performance than the other methods.

In Chapter 5, it is discussed that crowdsourced textual data from social media sites like Twier, in particular, has emerged as a valuable source of real-time information on current events such as geopolitical events, natural and man-made disasters, and so on. During emergency situations, microblogging networks, particularly Twier, have become vital sources of real-time situational information. During an emergency, hundreds to thousands of

g g y microblogs (tweets) are routinely posted on Twier, making it hard to go through each one individually. As a result, summarizing microblogs wrien during emergency situations has been a major research topic in recent years. Extractive summarizing algorithms have been developed to generate summaries of text in general and microblogs in particular. A few studies have looked into the utility of various summarizing algorithms on microblogs. Rather than aempting to create a new summary algorithm in this chapter, we examine if existing off-the-shelf summarizing systems may be coupled to provide beer-quality summaries than any of the separate algorithms. This chapter covers a variety of supervised and unsupervised techniques. Unsupervised methods divide the tweets identified by the underlying algorithms into groups based on some measure of tweet similarity and then selects one tweet from each group. Algorithms using this method seek to find the most important tweets while avoiding redundancy in the final summary. Based on the rankings evaluated by several base methods and the performance of the base approaches throughout a training set, the supervised ensemble technique aempts to learn a ranking of tweets according to significance. The goal is to integrate multiple ranks to improve tweet ranking (for inclusion in the final summary). Experiments are carried out on microblogs related to four recent disasters, motivated by the relevance of microblog summaries during crisis situations. It is shown that the proposed ensemble methods can combine the outputs of many different base approaches to provide summaries that are superior to any of the basis algorithms.

In Chapter 6, it is discussed that with millions of users posting hundreds of millions of tweets per day, Twier is one of the most popular social networking services on the internet. Twier is now largely considered as one of the most popular and fastest-growing communication platforms, and it is frequently used to stay up to date on current events and news items. While keyword matching might help you find tweets about a specific event or news stories fast, many of the tweets will have semantically identical content. It is difficult for a user to stay on top of an event or a news story if he/she

y p y needs to read all of the tweets that provide the same or redundant information. As a result, having effective methods for summarizing a large number of tweets is advantageous. We present a graph-based strategy for summarizing tweets in this chapter, in which a graph is first constructed based on tweet similarity, and then community recognition algorithms are used to cluster comparable tweets. Finally, a representative tweet from each cluster is picked for inclusion in the summary. Tweet semantic similarity is determined by a variety of factors, including WordNet synset-based features. Sumbasic, an existing summarization program, performs worse than the proposed method.

In Chapter 7, it is stated that Twier, a microblogging platform, is one of the most widely used online communities today. During a major event, such as a disaster, a large number of tweets are instantly posted on Twier. Because the information is posted too fast to follow for anyone, it must be categorized in order to be used effectively. Because many of the tweets created during an event are highly similar, clustering or grouping similar tweets is a good approach for minimizing the amount of information provided. Clustering, on the other hand, is difficult due to the small size and chaotic nature of tweets. In this chapter, we suggest a new tweet clustering strategy that combines two approaches: classic clustering with K-means and evolutionary clustering with genetic algorithms. We demonstrate that the proposed methodology outperforms existing clustering methods using a dataset of actual tweets gathered during a recent crisis event.

In Chapter 8, we discuss how the growing popularity of microblogging provides a varied platform for the general population to use as a communication medium. Every day, thousands of posts on any trending or non-trending topic are published as microblogs. A high number of messages are uploaded during any important event, such as a natural disaster, an election, or a sporting event like the IPL or the world cup. Because the rapid sending of messages causes information overload, clustering or grouping comparable messages is an effective approach for reducing it. Due to the small size and noisy nature of messages, grouping microblog data is

y g g p g g tough. Incrementally huge data is another clustering challenge. Therefore, this chapter proposes a novel clustering approach for microblogs that integrates feature selection techniques. The proposed method has been evaluated on a range of experimental datasets and compared to a number of current clustering algorithms. All proposed methods outperform other methods.

Chapter 9 shows microblogging services like Twier have risen to prominence as the preferred mode of public communication in recent years. Individual users can easily receive hundreds of microblogs (tweets) per day if they are mildly involved on Twier. Furthermore, a large number of tweets contain virtually the same content as a result of retweeting and reposting. People may be overburdened by these vast amounts of repetitive data, and no user can effectively assimilate so much information. Under these circumstances, it is necessary to develop methodologies to deal with the data overburden. One of the most effective strategies to manage Twier's data overflow is to combine semantically similar tweets into groups, with the goal of showing only a few tweets from each group to each user. Multiple graph clustering algorithms based on dimensionality reduction for clustering microblogs are presented in this chapter. Through experiments on four different microblog datasets, it is demonstrated that the suggested clustering approaches outperform various established clustering algorithms.

Chapter 10 explores the fundamental goal of this book: to create efficient algorithms for information filtering and organization on social microblogging platforms. This final chapter highlights the book's accomplishments and suggests research directions. During unique situations such as natural and man-made disasters, information organization is vital. Thousands of tweets are published every hour during such disasters, and because a timely reaction is essential, responders (such as relief workers) must acquire a quick summary of the information posted. The summarizing and grouping algorithms described in this book are applied to microblogs generated during numerous emergency scenarios, keeping this criterion in mind. The experiments show that the offered procedures are effective. It should be noted that the book's chapters have

employed a wide range of methodologies from numerous fields, including rough set theory, complex network analysis, evolutionary algorithms, ensemble algorithms, and other mathematical and statistical methods.

Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.