CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 1

FIRST SOCIAL NETWORK ANALYSIS, TRUST & PEOPLE SEARCH TECHNIQUES Human-enhanced time-aware multimedia search

CUBRIK Project IST-287704 Deliverable D3.2 WP3

Deliverable Version 1.0 – 31 December 2012 Document. ref.: cubrik.D32.LUH.WP3.V1.0


Programme Name: ...................... IST Project Number: ........................... 287704 Project Title:.................................. CUBRIK Partners:........................................ Coordinator: ENG (IT) Contractors: UNITN, TUD, QMUL, LUH, POLMI, CERTH, NXT, MICT, ATN, FRH, INN, HOM, CVCE, EIPCM Document Number: ..................... cubrik.D32.LUH.WP3.V1.0 Work-Package: ............................. WP3 Deliverable Type: ........................ Document Contractual Date of Delivery: ..... 31 December 2012 Actual Date of Delivery: .............. 31 December 2012 Title of Document: ....................... First Social Network analysis, Trust & People Search Techniques Author(s): ..................................... LUH Approval of this report ............... Summary of this report: .............. History: .......................................... Keyword List: ............................... Availability .................................... This report is public

This work is licensed under a Creative Commons Attribution-NonCommercialShareAlike 3.0 Unported License. This work is partially funded by the EU under grant IST-FP7-287704

CUbRIK First Social Network analysis, Trust & People Search Techniques

D32 Version 1.0


Disclaimer This document contains confidential information in the form of the CUbRIK project findings, work and products and its use is strictly regulated by the CUbRIK Consortium Agreement and by Contract no. FP7- ICT-287704. Neither the CUbRIK Consortium nor any of its officers, employees or agents shall be responsible or liable in negligence or otherwise howsoever in respect of any inaccuracy or omission herein. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7-ICT-2011-7) under grant agreement n째 287704. The contents of this document are the sole responsibility of the CUbRIK consortium and can in no way be taken to reflect the views of the European Union.

CUbRIK First Social Network analysis, Trust & People Search Techniques

D32 Version 1.0


Table of Contents EXECUTIVE SUMMARY

1

1. PICALERT!: A SYSTEM FOR PRIVACY-AWARE IMAGE CLASSIFICATION AND RETRIEVAL

2

1.1 INTRODUCTION 1.2 METHODS AND TECHNIQUES 1.2.1 Data and Crowdsourcing 1.2.2 Features 1.2.3 Classification 1.2.4 Search 1.3 EXPERIMENTS 1.3.1 Classification Quality 1.3.2 Search Quality

2 3 3 3 4 4 4 4 5

2. MINING PEOPLE’S APPEARANCES TO IMPROVE RECOGNITION IN IMAGE COLLECTIONS 2.1 INTRODUCTION 2.2 METHODS AND TECHNIQUES 2.2.1 Face Detection And Basic Recognition 2.2.2 Graph-based Recognition 2.2.3 Mining People’s Appearances 2.2.4 Incorporating Mining Results 2.3 EXPERIMENTS 2.3.1 Dataset 2.3.2 Implementation Details and Setup 2.3.3 Results 3.

6 6 7 7 8 8 9 9 10 11

EXPLORING THE SOCIAL CHARACTERISTICS OF YOUTUBE VIDEOS 3.1 3.2 3.3

4.

13

INTRODUCTION DATA COLLECTION, METHODS AND CHARACTERISTICS EXPERIMENTS

13 13 14

ANALYSIS AND DETECTION OF TROLL USERS 4.1 4.2

5.

18

INTRODUCTION EXPERIMENTS

18 18

CROWDSOURCING FOR DEDUPLICATION APPLIED TO DIGITAL LIBRARIES 5.1 INTRODUCTION 5.2 METHODS AND TECHNIQUES 5.2.1 DuplicatesScorer 5.2.2 Learning from the Crowd 5.2.3 Computing the Crowd Decision for a Pair 5.2.4 Quality Control for Crowdsourcing 5.2.5 Learning to Deduplicate 5.3 EXPERIMENTS 5.3.1 Experimental Setting and Dataset 5.3.2 Crowd Decision Strategies vs. Optimization Strategies 5.3.3 Integrated Duplicates Detection Strategies

6.

INTRODUCTION METHODS AND TECHNIQUES

CUbRIK First Social Network analysis, Trust & People Search Techniques

20 20 20 20 21 21 22 22 22 22 22 23

ANALYZING EMOTIONS AND SENTIMENTS IN SOCIAL WEB STREAMS 6.1 6.2

6

25 25 26

D32 Version 1.0


6.2.1 Data Collection Process 6.2.2 Model of Emotion and Polarity Analysis for Political Figures 6.2.3 Sentiment Analysis and Multilingualism 6.3 EXPERIMENTS 6.3.1 Polarity Detection 6.3.2 Emotion Detection 6.3.3 Emotional Pattern Analysis and Opinion Poll 7.

EFFICIENT DIVERSITY ANALYSIS OF LARGE DATA COLLECTIONS 7.1 INTRODUCTION 7.2 METHODS AND TECHNIQUES 7.2.1 Diversity Index Definition 7.2.2 Diversity Index Computation 7.3 EXPERIMENTS 7.3.1 Data 7.3.2 Performance 7.3.3 Characterizing the Diversity in Corpora

8.

REFERENCES

CUbRIK First Social Network analysis, Trust & People Search Techniques

26 26 27 28 28 28 29 30 30 31 31 31 33 33 34 35 37

D32 Version 1.0


Executive Summary The rapidly increasing popularity and data volume of modern Social Web environments is mainly due to their ease of operation even for inexperienced users, suitable mechanisms for supporting collaboration, and attractiveness of shared annotated material. Deliverable D3.2 provides an overview of the techniques for Social Web analysis, community analysis, and trust related analysis developed in CUBRIK. Building on our analyses we developed methods for structuring, searching, and aggregating information on the Social Web. To this end, we gathered large-scale data collections from different Social Web environments like the photo sharing platform Flickr, the YouTube video portal, and the micro-blogging environment Twitter. In addition, we annotated and enriched part of these data collections using crowdsourcing. The annotated data formed the basis for a variety of analyses and applications. The specific technical contributions described in this deliverable can be summarized as follows: o In the context of image retrieval we developed the PicAlert! [33] system for privacyoriented image classification and search (Section 1). It can be used for retrieving private content users are comfortable to share, and, more importantly, can help with the early discovery of privacy breaches. Furthermore, we developed a flexible framework in three stages to recognize people across a consumer photo collection [37] (Section 2). o We explored social features referring to information that is created by different types of explicit or implicit user interaction with the system (such as likes, dislikes, favourites, comments, etc.), and analyzed its performance of guiding users to retrieve more relevant and high-quality content in the context of video search [36] (Section 3). o In the context of community analysis and trust, we conducted an exploratory analysis of the presence of troll users (i.e. users posting disruptive, false or offensive comments to fool and provoke other users) in social websites (Section 4). o We employed crowdsourcing in an integrated system, where both human and machine work together towards improving their performances on the task, which focused on the case of duplicate detection for scientific publications [12]. The findings were applied to our online publication search system FreeSearch (Section 5). o We exploited Twitter data for aggregating and analyzing sentiments and emotions in that micro-blogging environment [34] (Section 6). Since users share their opinions on blogs, forums, and social networking sites, such as Facebook or Twitter, these sites become an interesting source to mine for specific sentiments towards current affairs, which can be exploited to improve decision making processes. In the context of aggregate data analysis, we also developed efficient algorithms for computing an aggregate measure of topic diversity in large document corpora such as user generated content on the Social Web, bibliographic data, or other web repositories [35] (Section 7). Analyzing diversity is useful for obtaining insights into knowledge evolution, trends, periodicities, and topic heterogeneity of such collections. Sections 1, 2, 3, 6, and 7 correspond mainly to Task 3.3 (Social Network Data Collection and Analysis) outlined in the Description of Work for CUBRIK, Sections 4 and 5 correspond mainly to Task 3.4 (Analysis of community roles and trust and people search).

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 1

D32 Version 1.0


1. PicAlert!: A System for Privacy-Aware Image Classification and Retrieval 1.1 Introduction Multimedia retrieval is one of the central themes in the CUBRIK project. In this section we describe a system specifically developed for privacy-oriented image classification and search in Flickr. It is based on visual features as well as textual annotations available on the Social Web, and makes use of machine learning models that were built over training sets obtained via crowdsourcing. With increasing availability of content sharing environments such as Flickr, and YouTube, the volume of private multimedia resources publicly available on the Web has drastically increased. In particular young users often share private images about themselves, their friends and classmates without being aware of the consequences such footage may have for their future lives. Users of photo sharing sites often lack awareness of privacy issues. Existing sharing platforms often employ rather lax default privacy configurations, and require users to manually decide on privacy settings for each single resource. Given the amount of shared information, this process can be tedious and error-prone. This is especially true for large batch photos upload. Furthermore, image search engines do not provide the possibility to directly search for private images which might already be available on the web.

Figure 1 Graphical user interfaces of the services In this work [33] we demonstrate the PicAlert! privacy oriented image search application. PicAlert! is able to identify and isolate images in a Flickr result set that are potentially sensitive with respect to user privacy. The application is based on a web service that automatically identifies a privacy degree of an image through classification of the content and context of the image. It could be directly integrated into social photo sharing applications like Flickr or Facebook, or into browser plugins in order to support users in making adequate privacy decisions in image sharing. Thus, the application illustrated here is two-fold: warning the user about uploading potentially sensitive content on the one hand (Figure 1a) and privacy-oriented search on the other hand (Figure 1b). We are aware that building alarm systems for private content and enabling privacy-oriented search can be seen as contradicting goals; privacy-oriented search is not negative per-se, as it can be used for retrieving private content users are comfortable to share, and, more importantly, can help with the early discovery of privacy breaches. However, as with almost every technology, it requires sensible handling and constructive usage.

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 2

D32 Version 1.0


1.2 Methods and Techniques

Figure 2 System architecture overview. The system architecture is illustrated in Figure 2 . Firstly, through crowd-sourcing, we build a training set of private and public images. In the next step we extract visual, and, if available, textual features which provide hints for the privacy degree of an image. We then train a SVM classifier which is used by our Search and Alert system for identifying potentially sensitive visual content. Finally, the user can access the application from arbitrary clients including desktops and mobile devices. In the following we provide a brief overview of the system components and show how results are presented to the user.

1.2.1

Data and Crowdsourcing

In order to obtain an appropriate dataset with labeled private and public image examples, we performed a user study in which we asked external assessors to judge the privacy of photos available online as an annotation game. To this end, we crawled 90,000 images from Flickr, using the "most recently uploaded" option to gather photos uploaded in a time period of 4 months. At each step of the game we presented five photos to a participant of the study. For each photo, the participants had to decide if, in their opinion, the photos belonged to the private sphere of the photographer. Specifically, we asked the participants to imagine that images presented to them were photos they took with their own cameras, and mark these images as "private", "public", or "undecidable". We provided the following guidance for selecting the label:" Private are photos which have to do with the private sphere (like self portraits, family, friends, your home) or contain objects that you would not share with the entire world (like a private email). The rest are public. In case no decision can be made, the picture should be marked as undecidable. Over the course of the experiment, 81 users between ten and 59 years of age labeled 37,535 images. Each picture was labeled private or public if at least 75% of the judges were of the same opinion. Overall the dataset contained 4,701 images labeled as private, and 27,405 images labeled as public; the remainder were marked as undecidable.

1.2.2

Features

Digital images are internally represented as two-dimensional arrays of color pixels. This representation is difficult to use directly for classification because it is highly multidimensional and subject to noise. Instead, a process known as feature extraction is typically used to make measurements about the image content. Image features come in many forms, from the very low-level, to so-called high-level features. Low-level features are typically formed from statistical descriptions of pixels, whilst high-level features are those that have a directly attributable semantic meaning. CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 3

D32 Version 1.0


For this application, we have selected a range of image features that could potentially be used in building a classifier that can discriminate public and private images automatically. In particular we observed that the occurrence of faces in a picture is strongly associated with a high degree of privacy although a considerable number of faces also can be found in public images. Intuitively, color may be an indicator for certain types of public and private image. For example, public images of landscape scenes are very likely to have a specific color distribution. The edges within an image are a very powerful feature for discriminating between different in/outdoor types of scene, and are useful for privacy classification. Finally the SIFT descriptor [3] turned out to be the most powerful feature for our application. Private and public photos typically tend to be taken in specific contexts. For example, pictures can be taken in public places like stadiums, supermarkets and airports, or in private places like home, car, or garden. Accordingly the object parts contained in a photo, like sport equipment, furniture, human and animal body parts are represented as SIFT features and could be different and thus give us insights about an image's privacy. For efficiency reasons we limited the visual features used for the demonstration application to face detection and SIFT features. Additionally, we made use of textual features including the image tags and title.

1.2.3

Classification

We obtained a balanced training set by randomly restricting the initial image set to a subset of 9402 images with an equal number of public and private images. The balanced set helps to capture general classifier properties independently of the a-priori class probabilities of the dataset. In the next step we built classifiers using the SVMlight [1] classification software. The results of the classification experiments for selected visual features are presented in the system evaluation section.

1.2.4

Search

In order to create a list of images ranked by privacy, we estimated the likelihood of image privacy using the output of the SVM classifier trained on a set of images labelled as "public" or "private" by the users. We use the Flickr API as the underlying search provider for our PicAlert! search service. The user interface of the application simply consists of a text box and a keyword search can be performed pressing the "Search" button. The difference to other engines is mainly in the search result representation. PicAlert! divides the results into two sets: "public" and "private". Additionally, each set is divided into three subsets according to the classifier confidence intervals and is denoted by color. The green color corresponds to a strong classifier confidence, yellow to moderate and red to a weak confidence. Figure 1b shows an example for the results representation for the query "christiano ronaldo". In the left ("public") part we observe that the majority of pictures are related to sporting events, whilst the right ("private") part is mostly dominated by photos about Ronaldo's private life.

1.3 Experiments 1.3.1

Classification Quality

In order to evaluate our classification approach, from the initial dataset we randomly sampled 60% as training data for building our classifiers, and 40% as test data, with each data set containing an equal proportion of public and private instances. Our quality measures for the classification are the precision-recall curves as well as the precision-recall break-even points for these curves. The break-even point (BEP) is equal to the F1 measure and the harmonic mean of precision and recall. The combination of visual and textual features is shown in Figure 3. The visual features only lead to the BEP of 0.74. The text features provide a short but concise summary of the image content and result in a BEP of 0.78. Finally, the combination of the visual and textual features leads to an additional performance boost with a BEP of 0.80, showing that textual and visual features can complement each other in the privacy classification task. However, classification with only visual features alone also produces promising results, and is useful if limited or no textual annotations are available, as

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 4

D32 Version 1.0


is the case for many photos on the web.

Figure 3 P/R curves for the features and their combination.

1.3.2

Search Quality

In order to evaluate our search ranking quality we randomly chose 50 image-related queries from an MSN search engine query log. For each query, we computed privacy-oriented rankings using the pre-trained classifier. The list of test photos in descending order of their user-assigned privacy value was considered as ground truth for our experiments. We compared the order of the automatically generated rankings using Kendall's Tau-b [2]. We choose the Tau-b version in order to avoid a systematic advantage of our methods due to many ties produced by the high number of photos with equal user ratings. The original Flickr ranking does not consider the privacy of the images in the search results. This was reflected by a small value of -0.04. In contrast, our methods show a clear correlation with the userbased privacy ranking. The combination of textual and visual features provides the best ranking performance (

).

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 5

D32 Version 1.0


2. Mining People’s Appearances to Improve Recognition in Image Collections 2.1 Introduction In addition to studying privacy-oriented image classification and search as described in the previous section, we went a step further in CUBRIK and developed a data mining approach [37] for discovering and incorporating patterns of groups of people frequently appearing together in images. Faces, along with the people’s identities behind them, are an effective element in organizing a personal collection of photos, as they represent who was involved. However, along with the fact that current face recognition approaches require a notable amount of user interaction for training a classification model, the accurate recognition of faces itself is still very challenging. Particularly in the uncontrolled environments of consumer photos, faces are usually neither perfectly lit nor captured. Wide variations in pose, expression or makeup are common and difficult to handle as well. One way to address this issue is to keep improving face recognition techniques. Another way is to incorporate context; in other words, to consider additional information aside from just faces within photos and across entire collections. As a wide variety of literature shows, such contextual information might include time, location or scene. The people appearing in photos can also provide further information. For instance, a person’s demographics such as gender and age are often easier to surmise than his or her identity. The same applies to ethnic indicators, including skin tone and traditional costumes. Finally, clothing in general can provide useful contextual information. In this work we show how to recognize people in Consumer Photo Collections by employing a graphical model together with a distance-based face description method. To further improve recognition performance, we incorporate context in the form of social semantics. We demonstrate the effect of a probabilistic approach through experiments on a dataset that spans nearly ten years.

2.2 Methods and Techniques We present a flexible framework in three stages to recognize people across a Consumer Photo Collection. Compared to traditional approaches where a classifier is typically trained to recognize a single random face at a time, we intend to consider and thus recognize all people’s appearances within an entire dataset simultaneously. To accomplish this, we first lay out a probabilistic graphical model with a similarity or distance-based description technique at its core. Next, to further improve recognition performance, we aim to incorporate context in the form of social semantics that is usually only implicitly at hand within photo collections. For example, family relations are usually not labeled, but it is often possible to infer this information by looking at multiple photos spanning a longer time period. Thus, we propose a method that utilizes a data mining technique at its core to discover patterns of groups of people who frequently appear together. In order to discover such patterns even when the training set is sparse, we devise our overall approach in an iterative fashion. Lastly, we extend our initial recognition framework by incorporating the gained additional information in an effective way. In the next two sections, we summarize how we detect and discriminate faces and reiterate the basics of our graph-based recognition approach, both of which we detail in [23]. Thereafter, we set forth the techniques we employ to mine people’s appearances, process them and incorporate them into a unified framework. We demonstrate the effectiveness of the

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 6

D32 Version 1.0


proposed approach with experiments on a new dataset and present our conclusions.

2.2.1

Face Detection And Basic Recognition

We choose to utilize the seminal work of Viola and Jones included in the OpenCV package to detect faces. Their detection framework builds upon Haar-like features and an AdaBoost-like learning technique. The face recognition technique we introduce next provides some leeway for minor misalignment and thus scaling the patches identified as faces to a common size and converting them to a gray-scale representation is the only normalization we perform. Compared to holistic face recognition approaches that typically require training, we turn to a feature-based method using histograms of Local Binary Patterns [25] that allows us to directly compute face descriptors and subsequently compare these with each other based on a distance measure (e.g. utilizing Statistics). To actually recognize faces, the most straightforward approach is then nearest-neighbor matching against a set of known face descriptors.

2.2.2

Graph-based Recognition

Figure 4 People recognition Left: Face and people recognition by traditional nearest-neighbour matching (each testing sample Te is independently matched against all training samples Tr) and our graph-based framework that considers all people’s appearances within an entire dataset simultaneously (faces are represented by interconnected nodes, where each node encodes a list of state likelihoods based on face similarities). Right: Exemplary photos of the employed dataset. Like in our previous work [24], we employ a graphical model to further improve recognition of people (e.g. by incorporating constraints). We set up one graph with nodes for both a testing and a training set (signified by samples Tr and Te in the left side of Figure 4), where we condition on the observed training samples with known label classes. The states of the nodes reflect the people’s identities. We use the graph’s unary node potentials to express how likely people’s appearances belong to particular individuals (in the training set). We define the unary term as in Equation 1, where signifies the states, a face similarity function (the distances among faces), and an optional normalization constant. Note that for each state we retain only the closest match among several possible training samples of the same label class. (1) As there is one unary potential for each node, there is one pairwise potential for each edge (connecting two nodes) in a pairwise CRF. They allow us to model the combinations of the states two connected nodes can take, and thus, to encourage a spatial smoothness among neighboring nodes in terms of their states. They also allow us to enforce a uniqueness constraint that no individual person can appear more than once in any given photo:

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 7

D32 Version 1.0


(2)

2.2.3

Mining People’s Appearances

Our aim is to discover and exploit patterns of how people appear together in a photo collection. We do so primarily by finding repeated appearances of groups of people. While multiple people may appear together in a single photo, it is evident that people typically do not congregate for the sole purpose of taking one photo. Instead, people usually meet for a longer time period (e.g. numerous relatives attending the same family birthday party), over which multiple photos might get captured. Thus, it is also necessary to model the case of an individual person who only appears by himself or with different members of the group in all photos captured throughout such a period, which we refer to as an event. Note that for simplicity we are only interested in consecutive, non-overlapping events. Finding such patterns based solely on the information given by the training set (some labeled people) is usually not a viable option because of the often low number of training samples. Thus, we devise the following iterative approach: 1. First, we use our basic graph-based approach (as described in Section 2.2.2) to initially recognize people. 2. Then, based on these preliminary recognition results, we attempt to discover appearance patterns. 3. Lastly, we perform inference - as in Step 1 - a second time while also considering the information gained through pattern mining in Step 2 to refine the initial recognition results. for a person’s face Note that in Step 1 we also compute a measure of confidence appearance based on probabilities (corresponding to the nodes’ states) provided by the graph’s inference method. We utilize this confidence measure later.

2.2.4

Incorporating Mining Results

At this point we have numerous appearance patterns; however, we do not know to which events they apply. Thus, we show next how to match both. Like in the previous section, we first discard all uncertain appearances from confidence values

below a certain threshold

intermediate result

for each event

. These are appearances with associated . Then

. Next, we compile an

:

1. 2.

First, we form a set of individual people over that is associated with the event . We then include people in , who are part of the training set and are

3.

associated with the event , but are not contained within (because of possible recognition errors during our initial run). Next, we match the given event with any appearance pattern based on it’s set of associated people. To do so, we form a list the FIM’s result •

and including any transaction

and

is above a threshold

being the number of distinct people in

The number of distinct individuals who are either only in threshold

4.

that matches the following criteria:

The number of distinct individuals who are in with

by iterating through all transactions in

or

. is below a

.

Then, we transform the sets of people within into stacked vectors with a vector length equal to the number of total individuals. If an individual appears in a set, we store the transaction’s frequency

at each individual’s position in the vector.

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 8

D32 Version 1.0


5.

Finally, we sum up all vectors of the given event aggregating the frequencies.

into a histogram vector

, thus

All histogram vectors together then make up the final intermediate result , which we now incorporate into our graph-based recognition framework with the aim of improving recognition on its subsequent run. Again, we perform several steps for each event follows: 1.

as

We identify applicable people in the corresponding histogram vector : a. First, we omit people with a low frequency (below ). We define the vector corresponding to the remaining people as . b.

Then, we compute the mean average whose frequency is less than

.

of

, and omit people from

then defines the remaining people.

2.

If corresponds to less than two distinct people, the mining result is not useful for the given event and we continue with the next event.

3.

Next, we include people into

, who are part of the training set

and are

associated with the event , but are not contained within (again, we do this because of possible recognition errors during our initial run). We initialize their frequencies with

.

At this point, tells us which people should be present in a given event according to the previously discovered patterns. Moreover, the aggregated frequency values indicate how likely this is true for particular people. We propose two complementary ways to incorporate this information. Recall that the graph’s unary node potentials (Equation 1) express how likely people’s appearances belong to particular individuals. We propose to adjust the potentials according to based on following exponential regularization, where frequency values:

is used to dampen large (4)

Note that adjusting the unary potentials primarily affects only the recognition of appearances associated with an event

. Thus, we also propose to influence how we establish the graph’s

edges (recall that these reflect dependencies among nodes and thus appearances). Let be a vector signifying an appearance’s similarities with every other appearance (as used when we find the

closest matches among all appearances as outlined in Section 2.2.2).

Since we obtain a preliminary recognition result w.r.t. the label prediction appearance vector

in our initial step, we are able to adjust

of the appearances. We perform this step for every

associated with a given event

incorporate a regularization parameter

. Similarly to before, we also

: (5)

2.3 Experiments 2.3.1

Dataset

Typical face datasets like FERET or LFW are not suitable for our aim of recognition in Consumer Photo Collections. To better demonstrate our data mining driven approach, we compile a more challenging dataset that spans a significantly longer time period and contains notably more photos and face appearances. Similarly to the Gallagher Dataset, all photos are

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 9

D32 Version 1.0


shot in an uncontrolled environment with a typical consumer camera. Many of the photos depict the main subjects, a couple and their friends and family, in a broad variety of settings and scenes both indoors and outdoors (see Figure 4, right side). The dataset depicts a total of 56 different individual people over a period of nearly ten years. Altogether, there are roughly 3000 face appearances spread over approximately 2200 photos. Most individuals appear at least ten times in total. Note that infants and children appear quite seldom; most people are grown-ups. The number of distinct people appearing together throughout the events (e.g. when we consider an event to last around one day) seems to be mostly under ten, but is in several cases more than15. The ground truth specifies the true face boundaries along a unique class label. All photos include EXIF metadata as embedded by the camera, and it is thus possible to extract the time of capture.

2.3.2

Implementation Details and Setup

We refer the reader to our previous work [23] for details on Sections 2.2.1 and 2.2.2. Note, however, that in this work we do not consider any of the social semantics that [23] introduces except people’s uniqueness within photos. We are interested in evaluating the effectiveness of our graph-based approach that considers social semantics gained through a data mining technique. Thus, we first compare our basic graph-based approach against the traditional nearest-neighbor method. Then, we evaluate the intermediate outcome of the proposed data mining technique (separately) against the ground truth as well as the impact of the overall approach when incorporating the gained data mining results. Lastly, to avoid being influenced by the face detection method, we only consider correctly detected faces as verified against the ground truth; then: . Our primary measure is then the 1-hit recognition rate. Recognition Rate depending on Training Size 85

Configuration

Event s

Incorr .

Correc t

Default

93.85

109.9 8

80.41

79.81

107.3 9

81.60

15.19

23.90

15.76

89.42

88.96

77.27

99.62

103.7 0

79.11

81.15

115.1 5

82.68

96.92

111.8 7

78.02

69.23

82.46

84.75

64.04

62.51

87.75

80

Recognition Rate [%]

75 70 65 60

Nearest-Neighbor Graph. Model

55 50 45 40

2

4

6

8

10 12 Training Size [%]

14

16

18

20

Figure 5 Performance Recognition CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 10

D32 Version 1.0


Left: Basic recognition performance (without using mined appearances) depending on training set size. depending on different parameter

Right: Evaluation of the intermediate result configurations (all values are in percent).

For all experiments, we split the dataset into a training set and a testing set , such that represents a small random but stratified subset of . By default, we use a rather small training set size of 3% (with at least three samples of each label class). We repeat all experiments five times and average the results. If not otherwise mentioned, we base all experiments on the following default parameters and configuration: MAP-based graph

inference

with

;

;

;

;

;

;

hours; ;

;

;

; and

.

2.3.3

Results

Face Detection and Recognition Given , the Viola-Jones face detection method we utilize correctly detects 2498 faces (true positives) with respect to the ground truth. Recall that for we retain only correctly detected faces. The left plot of Figure 5 illustrates the 1-hit recognition rates for varying training set sizes for the baseline nearest-neighbor matching approach as well as our graph-based approach. We notice notably better results for the latter, especially when the training set size is small(in that case, up to 15% gain). Mining People’s Appearances We find 104 temporal cluster sets (or events) by applying Mean-Shift on the photos’ capture timestamps. Based on our preliminary recognition results (as outlined in Section 2.2.3, , , ) and 98781 Step 1), we are able to mine between 287 ( ( , , ) appearance patterns on average (recall that we average the results over five runs) depending on the various parameter combinations. Using either or greatly reduces the number of patterns to below 6500. Note that we limit our experiments to following suitable values: ; ; and . Incorporating Mining Results Recognition Rate depending on Alpha and Lambda 72

71

71

70

70

69

Recognition Rate [%]

Recognition Rate [%]

Recognition Rate depending on Configuration 72

Graph. Model (Without Mining) Default Default + Support = 3 Default + relaxed Default + Support = 3 + relaxed

68 67 66

69

67 66

65

65

64

64

63

2

5

10

15 Lambda

20

25

30

Graph. Model (Without Mining) Alpha = 0.7 Alpha = 1.0 (Default) Alpha = 1.3 Alpha = 1.6

68

63

2

Figure 6 Final recognition rate depending on

5

10

15 Lambda

20

, configuration (left) and

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 11

25

30

(right).

D32 Version 1.0


Next, we evaluate the intermediate result . In the right table of Figure 5 we list for how many events we can find a matching appearance pattern (where 100% reflects the optimum result). Moreover, we compare applicable patterns (that match an event) against the ground truth w.r.t. the amount of individuals correctly predicted (we desire high precision values) as well as the amount of incorrectly predicted individuals (we desire low values). Due to lack of space, we only list some parameter combinations where .

denotes the combination

We notice that higher values for and lead to more correctly predicted individuals (and, likewise, less incorrectly predicted individuals), but they also lead to fewer matching events. For example, we are able to find at least one matching pattern for about 94% of all events using our default configuration, but only for about 80% when using a higher support of 3. In both cases, we predict about 80% of the people collectively appearing in the events. We also see that and do not show much positive effects. The two plots of Figure 6 show the final recognition performance when incorporating the mined appearance patterns. We achieve the best performance (about 71%) when

is

between 0.7 and 1.2 and between 15 and 25.We notice that as long as many events are matched to an appearance pattern, the mentioned parameters are quite insensitive w.r.t. the performance. Looking only at matched events, we notice recognition gains of nearly 20% (e.g. with of 1.8, where we match only 64% of the events, but these with a higher precision). Overall, we are able to improve recognition performance by around 8% and 22% when compared to our basic graph-based approach and a traditional nearest-neighbor method, respectively.

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 12

D32 Version 1.0


3. Exploring the Social Characteristics of YouTube Videos 3.1 Introduction Social Web content is accompanied with explicit and implicit user feedback including favourite assignments, comments, and clicks. In the CUBRIK project we studied the impact and implications of this feedback in the context of video search. What happens when a user clicks on a like or dislike button or posts a comment for a digital object, say a blog post, interesting photo or funny video displayed in her favourite Web 2.0 platform? Other than being shown to the future visitors of the same object (and serving as a medium of user participation/interaction), can these social signals help the underlying search systems for guiding its users to reach to a better quality or more relevant content? Despite the rapid and growing interest for Web 2.0 applications from both the industry and researchers from various disciplines, these questions are still not clearly answered. While we witness some recent moves from big players towards a more social search (such as the Google+ application and Bing's expansion of results with those ``liked by" the users' Facebook friends1 the ways search engines and/or Web 2.0 applications exploit social signals (if they ever do) are usually not disclosed. In terms of the academic research, there exists a large body of work analyzing the rich content posted on Web 2.0 platforms [18][16][21] that also fuelled research in recommendation systems, opinion mining, trend analysis, etc. Social features, as we call here, refer to the information that is created by some explicit or implicit user interaction with the system (such as likes, dislikes, favourites, comments, etc). Our work [36] essentially explores the characteristics of the YouTube query results with respect to the social features. Here we provide an in-depth analysis of the social features associated with the top-ranked videos retrieved by YouTube for 1,450 real user queries. The queries are obtained from a major search engine's auto-completions specialized for YouTube and their top-300 results are retrieved from YouTube API, making our dataset a unique and valuable collection to work on. In Section 2 we describe the details of our dataset and in Section 3 we report several interesting statistics regarding the queries, their resulting videos and the characteristics of these videos in terms of the associated social features.

3.2 Data Collection, Methods and Characteristics Query Set Q. We first obtained around 7,000 queries using the auto-completion based suggestion service specialized for the YouTube domain from a major search engine. In particular, we submitted all possible combinations of two-letter prefixes in English (like aa, ab, ..., zz, etc.) and collected top-10 query suggestions for each such prefix (e.g., "aaliyah", "aaron carter", "abba dancing queen", etc.) in a similar fashion to [17]. From this initial set, we sampled a subset of 1,450 queries, denoted as Q, which constitutes the query set for this study. Note that, different from all earlier studies that seeded their crawlers with some generic queries (e.g., queries from Google's Zeitgeist archive [19] or terms from the blogs and RSS fields [20]), we employ a set of real YouTube queries, as the collected query suggestions are based on real and potentially popular queries previously submitted to YouTube.

1 http://www.pcmag.com/article2/0,2817,2380874,00.asp CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 13

D32 Version 1.0


3.3 Experiments For each q in Q, we obtained the top-300 result videos (denoted as Rq) from YouTube API along with the available metadata fields (see Table 1) in late 2011. This process resulted in a superset of 380K videos, i.e., around 262 videos per query are retrieved. Among these videos, 365K of them are unique (i.e., only 4% of all videos overlap among different query results). The set of unique videos is denoted as V in this work.

Table 1 Metadata fields stored for each video v in V. In addition to the metadata fields directly available via API, we crawled up to 10,000 most recent comments that are posted for each video from actual HTML responses of YouTube (API can provide only up to 1,000 comments). Due to the difficulties of crawling HTML, we could obtain around 33 million comments posted for 86K unique videos in our dataset. This is a fairly large set of comments as the recent works also employ similar (e.g., up to 1,000 comments for 40K videos in [20]) or smaller number of comments (e.g., a total of 6.1 million comments in [19]). Finally, we also constructed the profiles of users who uploaded the videos in V. To this end, for each user u, we again crawled HTML pages to obtain the number of uploaded videos, number of subscribers (i.e., the number of users that is following the user u), and total number of views for the content uploaded by the user u. We ended with the profiles for 208K unique users, denoted as U. In Table 2 we provide the basic statistics on the appropriate metadata fields computed over the set of 365K unique videos, V. As V includes the videos retrieved for mostly popular search suggestions, some of the popularity-related metadata statistics seem to be much higher than those obtained for the YouTube crawls that are closer to random distributions (i.e., those based on generic queries). For instance, the average of the view counts is 274K for our dataset, which is an order of magnitude larger value than those reported in some recent works (e.g., see Table 1 in [21] and Table 2 in [18]). We also find that, on the average, the videos in our query results attract more than an order of magnitude of likes than dislikes. Table 2 also shows that the average number of comments in our collection is 0.18% of the average number of views, i.e., very close to 0.16% reported by [16], but interestingly, less than 0.5% reported by another recent study ([20]). As a final remark, we observe that the standard deviation values are rather high for all metadata fields presented in Table 1, a result again in line with the previous findings [21].

Table 2 Metadata statistics per each video v in V.

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 14

D32 Version 1.0


Figure 7 Query characteristics: (a) Category distribution and (b) No. of results. Characteristics of YouTube queries. In contrast to web search, for which publicly available query logs allow analyzing issues like the user search intentions, result distributions, etc., there exist no public query logs that can be exploited for characterizing the users interests for video search in platforms like YouTube. So, we begin with a short analysis of the queries in Q, as the way our query set is constructed allows us to shed light on the real interests of YouTube users. We classified the 1,450 queries in Q based on the YouTube provided category of their resulting videos. In particular, a query's category is designated as the most popular category among those of the videos retrieved for this query. Not surprisingly, the majority of the queries fall into the ”music” category. The other popular query categories are ”entertainment”, “gaming” and “sports”, as shown in Figure 7a. In addition to the automatic categorization, we also conducted a manual analysis to detect named entities appearing in the queries and found that 46% of the queries include a person entity (e.g., a singer, movie star, music band, YouTube user, etc.) and another 9% of them include a product entity. In Figure 7b, we provide the distribution of the number of results for our queries. The plot shows that for almost half of the queries YouTube reports more than 10K resulting videos, which is rather expected, as our queries are composed of popular query suggestions. Characteristics of the top-ranked results. In this section, we present the basic characteristics of top-50 query results with respect to the raw social features such as the number of views, likes, dislikes and comments. While earlier studies about YouTube (e.g., [18][16][21]) also report statistics for some of these features, their analyses are usually over a set of videos (e.g., such as those we provide in Table 2). To the best of our knowledge, ours is the first study that provides an analysis for social features taking into account the rank of the videos in the query results.

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 15

D32 Version 1.0


Figure 8 Avg. No. of (a) views, (b) likes, (c) dislikes, and (d) comments vs. video rank. Figure 8a shows the average number of views for the videos that are ranked at the i-th position in the query results, where 1 ≤ i ≤ 50. In general, the number of views is quite high (around 300,000 views even at rank 50), which is not surprising due to the way we choose our queries, as discussed before. The number of views for the top-ranked video is considerably higher than that for the others and indeed the videos that are in top-7 results are viewed more than 10 million times on the average. A common behaviour of YouTube users is rating the viewed video, usually expressed by clicking like/dislike buttons. We find that the videos in top-10 have higher number of likes and dislikes in comparison to the rest of the videos (see Figure 8b and c, respectively). Moreover, there is an order of magnitude difference between the number of likes and dislikes: the former starts around 16,000 and goes down to 2,000 whereas number of dislikes starts from 1,800 and goes down to 200 for top-50 videos. Note that, for such popular videos with hundreds of thousands of views, it should not be surprising that there are some dislikes, as well. However, on the average, we observe that more than 93% of the ratings for top-50 (and even for top-300) videos is positive. In addition to likes/dislikes, some videos are marked as favourite by some users. We observed a similar trend to that in Figure 8b for the favourite counts (plot not shown here). An even stronger form of the user interaction and participation in YouTube is posting comments for videos [19][20], Figure 8d depicts the average number of comments for videos at each result position. As in the previous cases, the top results have attracted considerably more attention than others, and the average number of comments drops under 3,000 for the videos ranked below top-5. To summarize, we find out that the videos in top-10 YouTube results for our queries are usually those that attracted very high user interest as expressed by the number of views, likes, favourites and comments. For the rest of the results, the differences among the videos in terms of the values for these social features seem to be rather negligible. The high values CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 16

D32 Version 1.0


of social features for the videos in top-10 results can also be attributed to well-known Yule process (or rich-get-richer principle) [16], as the videos that appear in the first page are more likely to be viewed and interacted. On the other hand, when all other features (such as the textual relevance to a query, etc.) are equal, it seems intuitive to put a video that is highly viewed/liked/commented etc. at a higher rank than one that attracted no interest. While it is not possible to draw a conclusion whichever of these explanations (or, maybe both) holds for YouTube (and indeed, it is not the goal of this paper), it seems as a worthwhile direction to further investigate the retrieval potential of the social features in more depth.

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 17

D32 Version 1.0


4.

Analysis And Detection Of Troll Users

Apart from studying and exploring content and community annotations, in the CUBRIK project we are also interested in roles of contributing users in Social Web environments and trust. Commenting tools in social websites are mostly used to share legitimate opinions and feelings. Nevertheless, it is also common to find users that abuse this mechanism in various ways, which include posting links to external web pages aiming at increasing their visibility (i.e. spamming), or posting disruptive, false or offensive comments to fool and provoke other users. We conduct an exploratory analysis of the presence of troll users (trolls) in social websites, and study methods for automatically detecting potential trolls based on the textual content of their comment history.

4.1 Introduction Our main data source for this analysis is the YouTube collection. Extracting a comparable number of trolls from the YouTube dataset is not straightforward. First, troll detection requires manually assessing the content of each user’s comments as YouTube does not provide a list of troll users flagged by the community. Second, the proportion of trolls is significantly lower than that of legitimate users [22]. Therefore, identifying a comparable amount of trolls in YouTube using a random sampling strategy would require manually inspecting comments from thousands of users. This process demands a huge amount of human labor that renders it unfeasible. To circumvent this problem, we used a simple heuristic to increase the chance of finding trolls in our sample by means of the user approval ratio , where pos(u) and neg(u) denote the number of positively and negatively rated comments for a given user u, respectively. Low values of this ratio indicate strong rejection by the community for the comments of a particular user, while high values indicate general acceptance of the user’s opinions. We used this metric to sample YouTube users by randomly selecting 500 users with under the assumption that a significant number of trolls would fall into this interval. In order to obtain a set containing more non-troll users we also sampled a set of 500 users with approval ratio . The final set of 1000 users was then manually annotated using the following three labels based on the content of their comments: “troll”, “non-troll”, or “unknown”.

4.2 Experiments Figure 9 shows the distribution of troll and non-troll users in YouTube with respect to the user approval ratios

. This figure clearly illustrates the higher proportion of trolls found in the [0,

0.1] range, as compared with the proportion at higher levels of . This result provides empirical support for the heuristic chosen in our sampling strategy. We also observe a large percentage of trolls in the [0.1, 0.2] range, whereas just a tiny fraction of users are found to be trolls for > 0.2. This result evidences the intuition that comment ratings serve as good proxies for troll identification in online communities. In this work, however, we restrict ourselves to a scenario where no rating information is available and attempt to predict trolls using only the textual content of each user’s comments. Figure 10 shows the distribution of comment ratings from YouTube. Note that our sampling strategy for detecting trolls in YouTube is biased towards low rated comments, as 50% of the comments were chosen from values in the range [0, 0.1]. As illustrated in Figure 2.1, this bias significantly affects the distribution of non-troll comment ratings, but has just a marginal effect on the distribution of troll comment ratings as they mostly feature low rating values. Therefore, we address our sampling bias by comparing the ratings of comments from trolls in

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 18

D32 Version 1.0


this sample with ratings from comments in the whole dataset (including troll and non-troll users). The plot shows a clear trend of comments from troll users having lower ratings than comments from non-troll users.

Figure 9 Distribution of troll and non-troll users in YouTube w.r.t. to user

Figure 10 Comment rating distributions for troll and non-troll users in YouTube.

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 19

D32 Version 1.0


5. Crowdsourcing for Deduplication Applied to Digital Libraries 5.1 Introduction In addition to exploiting existing annotations and community feedback in multimedia environments as described in previous sections, enriching and annotating data via crowdsourcing is one of the central themes of the CUBRIK project. In this section, we describe a combined and iterative crowdsourcing and machine learning approach, which we applied for de-duplication of Web content. The Web is an immense repository of diverse information describing all possible entities. With theoretically infinite data sources, many entities are described in several places, leading to inherent duplicate data and/or metadata. In duplicate detection, computers are particularly good at detecting duplicate candidates, but the difficult task is the actual binary decision if the two entity descriptions represent the same real-world entity or not. This is exactly where humans excel and this task can be crowdsourced via an online marketplace, for a low cost. Coupled with automatic algorithms, Crowdsourcing can ideally leverage the best capacities of both human and machine. In this work [12] propose a way to use the crowd for machine learning, in an integrated system, where the two methods work together towards improving their performances on the task at hand. We focus on the case of duplicate detection for scientific publications and we apply the findings to the online publication search system FreeSearch2. At the core of our duplicate detection method we still have a general purpose similarity scorer, working on (attribute, value) pairs without additional knowledge as to the semantic of the attribute, therefore keeping the solution usable by general duplicate detection tasks. The decision for a pair of entities is based on several features learned by our method from human provided training data gathered using Amazon's Mechanical Turk (MTurk) service in an active learning manner. To tackle the quality issue of using Crowdsourcing, we employ and compare several algorithms to compute a confidence in the workers. We apply our simple and powerful methods to an online publication search system: First, we perform a coarse duplicate detection relying on publication signatures in real time. Then, a second automatic step compares duplicate candidates and increases accuracy while adjusting based on both feedback from our online users and from Crowdsourcing platforms.

5.2 Methods and Techniques Without restricting the generality of our methods, we will focus on the domain of digital libraries and give examples of scientific publications. Let us use the following notations: e i is an entity, described as a set of attribute-value tuples; p i , j a pair of entities (ei , e j ) that can be duplicates or not; P is the set containing all the publication pairs

5.2.1

DuplicatesScorer

The entity matcher, called DuplicatesScorer, is introduced in [14]. Given a request and a candidate entity, the generic matching module computes a matching score resulting from the aggregation of several features of two kinds: attribute level features and entity level features. At all levels, feature scores are aggregated by a weighted log-based tempered geometric mean. The algorithm has the following parameters: DSParams = {(FieldName, FieldWeight)}. For a given pair of entities

it outputs an Automatic Decision Score

. The Automatic

2 http://dblp.kbs.uni-hannover.de CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 20

D32 Version 1.0


Decision of the algorithm can have two values, duplicates, and

5.2.2

=0 if

= 1 if

≼threshold indicating

< threshold , for non duplicates.

Learning from the Crowd

In order to help us decide on the uncertain duplicates that have scores around the threshold, and to improve our automatic methods we create HITs consisting of 5 publication pairs for which the workers have to assign the duplicate or not duplicate labels. Each HIT pays 5ct. We are employing an active learning technique. The data is sent to the MTurk in batches consisting of pairs for which the automatic algorithm is most uncertain. With each solved batch, the algorithm learns the parameters that yield results as close as possible to those provided by the crowd. The algorithm thus improves, the number of uncertain pairs will decrease, and the need for crowd input diminishes. The general steps taken by our method are described in Figure 11 Learning to Deduplicate from the Crowd. In a continuous iterative process we build a candidate set, let it be assessed by the crowd, learn better parameters for the automatic method, and update the status of the pairs for which the crowd's decision is strong. For the publication pairs on which we don't have a strong or full agreement within the workers, we get more votes by extending the number of assignments of the HIT.

Figure 11 Learning to Deduplicate from the Crowd

5.2.3 Let be

Computing the Crowd Decision for a Pair the vote that worker

casted on the

as duplicates pair and -1 otherwise. ,

.

. It can be: 1 if worker

voted

represents the confidence associated to worker

is the set of workers that voted on

From MTurk we get tuples of the form

. and we aggregate the votes of single

workers into the Crowd Soft Decision, to get a single assignment for

:

, is the aggregated Crowd Decision and can have two values 1 if < 0.5, indicating non duplicates. the pair as duplicates, and 0 if

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 21

≤ 0.5 designating

D32 Version 1.0


5.2.4

Quality Control for Crowdsourcing

A simple metric for evaluating the worker confidence can be the proportion of correctly classified pairs, when compared to the crowd's decision. To compute the workers confidence we use an EM algorithm as proposed in [11]. The algorithm takes as input the work done by the workers for all pairs and outputs the worker confidences and final decisions. We initialize e.g. all workers are considered equally good, with =1. We then repeat two steps until we reach convergence or for a certain number of iterations: 1. compute the soft decisions for all the pairs based on the worker confidences 2. update all the worker confidences

5.2.5

Learning to Deduplicate

Using the reputation system to put a lower weight on the contribution of the bad workers, the Crowdsourced work gets as close as possible to a real ground truth, and we can use it to maximize the accuracy of the automatic method by learning the parameters DSParams and threshold, that provide results as close as possible to those of the crowd. We use a choice multi-objective Evolutionary Algorithm (including SPEA2 and NSGA2) as implemented in the OPT4J [13] library. We optimize either: •

the Accuracy (overlap) of DuplicatesScorer decision when compared to the crowd's decision and we will find both the DSParams and the threshold that maximize it

the correlation between the crowd's soft decision and the DuplicatesScorer's score. We want to compare with , and find those DSParams that yield the best correlation between the two, and then we seek the threshold that gives the highest Accuracy. The possible optimizations are: o minimizing the Mean Absolute Error(MAE) o minimizing the sum of the log of errors(sum-log-err) o maximizing the Pearson Correlation(Pearson) of the two series.

5.3 Experiments 5.3.1

Experimental Setting and Dataset

In our experiments the entities we deduplicate are scientific publications. The posted HITs

ADS

∈ [0.7,0.8]

i, j consist of 5 pairs with obtained with a common sense parameter choice. We posted one batch of 60 HITs having a qualification test as prerequisite, and two batches containing 60 HITs and 119 HITs without. We retrieved a total number of 1132 assignments from MTurk, corresponding to 239 HITs solved by 78 unique workers, with an average time per HIT of 90 seconds. The average time for solving a HIT for a worker was of 145 seconds. The average number of HITs solved by a single worker was 72. For obtaining a ground truth against which we can compare the accuracy of our algorithm we labeled a number of 450 pairs, selected such that they were solved by all the workers, needing on average 2 minutes per HIT. Manually computing the worker confidence against the 450 pairs that compose our ground truth, reveals an average confidence of 0.85 with a standard deviation of 0.14. In comparison, the automatically computed worker confidence on the same data has an average of 0.81 with a standard deviation of 0.18 and on the whole data set 0.81 and 0.16 respectively. The confidence in the worker does not strongly correlate with the average time he invests in solving the HIT (PPC = 0.177) or with the number of solved HITS (PCC = -0.19).

5.3.2

Crowd Decision Strategies vs. Optimization Strategies

The Crowd Decision can be computed using different strategies, depending on how we

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 22

D32 Version 1.0


compute the worker confidence: = 1.

MV represents the soft decision in the case of majority voting, where all

C-Iter represents the soft decision computed by using the worker confidences computed using the EM algorithm.

C-boost represents the decision computed by using instead of in the computation of a boosted weight in computing the soft decisions using the same algorithm, giving Workers with a higher confidence have a bigger importance.

C-Manual is obtained by using manually computed worker confidence on our own assessed ground truth.

Heur represents a heuristic that disregards the worker confidence in the final crowd decision. The pair will be regarded as a duplicate if all the initial 3 workers agreed, or if after requesting 2 more votes, at least 4 out of 5 workers agree We present the results obtained by using different ways of obtaining the final decisions, combined with different optimization strategies in terms of accuracy in Table 3.

Crowd Decision Strategies Optimization strategies

3 workers MV

5 workers MV

C-Iter

C-manual

C-boost

Heur

Accuracy

79.19

80.00

79.73

80.00

78.92

79.73

Sum-Err

76.49

79.46

79.46

79.46

79.46

79.19

Sum-log-err

71.89

78.11

78.38

78.92

80.27

76.76

Pearson

73.24

79.46

79.46

80.54

79.46

81.08

Table 3 Crowd Decision Strategies vs. Optimization Strategies The best automatic decision strategy is Boost The highest performance is obtained in the case of optimization for methods penalizing high differences between worker decisions and algorithm output, like Pearson or Sum-log-err. Heur uses the best data in terms of agreement between workers and achieves very good results. In our optimization setting, strategies that approximate the worker's real confidence (MV, Iter and Boost) perform as good as Manual.

5.3.3

Integrated Duplicates Detection Strategies

We implemented different duplicate detection strategies in an online scientific publication search system. Table 4 Integrated Duplicates Detection Strategies shows the performance of different methods. • •

sign detects duplicates based only on the publication signature. DS/muses the default manual weights and threshold for the DuplicatesScorer while DS/o uses the optimized, learned weights using the simplest, most cost-effective strategy, MV with 3 workers optimized for Accuracy. sign+DS/m and sign+DS/o are very computationally efficient combined methods: first they group duplicate candidates by signature (for efficiency) and then base the duplicate detection decision on DS/m or DS/o respectively (for best accuracy). CD-MV is simply the crowd decision out of 3 workers using majority voting the performance of humans given this task.

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 23

D32 Version 1.0


Sign sign+DS/m sign+DS/o DS/m DS/o R 0.2 0.2 0.2 0.67 0.56 A 0.77 0.77 0.77 0.7 0.79 P 0.95 0.95 1 0.48 0.66

CDMV 0.97 0.83 0.63

Table 4 Integrated Duplicates Detection Strategies Automatically learning features increases accuracy from 0.70 (DS/m) to 0.79 (DS/o). Looking at CD-MV we see that even humans perform only 4% better, with an accuracy of 0.83. While integrated in the overall system, sign+DS/o shows a perfect precision at cost of recall.

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 24

D32 Version 1.0


6. Analyzing Emotions and Sentiments in Social Web Streams 6.1 Introduction Information aggregation can be a powerful tool for data analysis and information extraction on the Social Web. In this Section we will describe research on aggregation for emotion and sentiment analysis conducted in the CUBRIK project. Real-time microblogging services, such as Twitter, have experienced an explosion in global user adoption over the past years. It is estimated that the Twitter users surpassed 300 million, and they generate more than 200 million of 140-character Twitter messages (i.e., tweets) every day. Latin America is not the exception, Brazil, Mexico, Venezuela, Colombia, Argentina, and Chile figure among the top-20 countries in terms of Twitter accounts, as reported by a recent study by Semiocast, a provider of consumer insight and brand management solutions. The high rate at which users share their opinions on blogs, forums, and social networking sites, such as Facebook or Twitter, makes this kind of media even more attractive to measure specific sentiments towards current affairs. Teal-time access to the large amount of user generated content available can provide the tools to social researchers, and citizens in general, to monitor the pulse of the society towards specific topics of interest, a task traditionally accomplished only through opinion polls, which are costly and time consuming to conduct, and therefore frequently limited to small sample sizes. Real-time analysis of social media streams allows for discovery of latent patterns in public opinion, which can be exploited to improve decision making processes. For example, automatically detecting emotions such as joy, sadness, fear, anger, and surprise in the social web has several practical applications, for instance, tracking the popularity of political figures or public response to new released products. This is the field of sentiment analysis, which involves determining the opinions and private states (beliefs, feelings, and speculations) of the speaker towards a target entity. The goal in this work [34] is to explore the sentiments and emotions towards political figures in Latin America. To this end, we analyze mentions on Twitter and blogs of eighteen Latin American presidents, between October 1, 2011 and April 1, 2012. The names of the presidents and their respective country are listed in Table 5. By making use of an emotion lexicon, we study the emotions evoked by each president. While this approach is standard in many applications, we felt that a study on political emotion detection via the social web, covering Latin America in particular, was necessary.

Table 5 Presidents of Latin America considered in the analysis (listed alphabetically by country name

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 25

D32 Version 1.0


6.2 Methods and Techniques 6.2.1

Data Collection Process

We perform our study on a collection of 165,484 documents, from them, 155,280 are 140character Twitter messages or tweets, and 10,204 are snippets of weblog posts. The total of documents was produced by 55,013 distinct users during the six-month period between 1st of October, 2011 and 1st of April, 2012. We chose this period of time because it allowed us to discuss and contrast our findings against an independent opinion poll published in April 2012 [4]. The same procedure and analytic techniques discussed in this work can be directly applied on real-time data streams, as illustrated in Figure 12. Both, tweets and blog posts are in Spanish and they were collected as follows: 1. Twitter messages were retrieved using Topsy, a Twitter search engine that indexes and archives messages posted on Twitter. For each president name listed in Table 5, we issued a query against Topsy using its API. We forced an exact match on the name by enclosing it in double quotes. We also included in the parameters the corresponding start and end date of interest. 2. Blog posts were fetched using Google News RSS Feeds. Similarly as in the case of tweets, we used as query term the name of the president and forced an exact match. We restricted the sources of information to be exclusively blogs in the Spanish language. Again, the time range was specified to the period under analysis. In this case, we consider as document the post's title and the short snippet of text (around 300 characters) contained in the item's description tag of the RSS result as returned by Google.

Figure 12 Social Analytics Process

6.2.2

Model of Emotion and Polarity Analysis for Political Figures

The objective is to identify the emotions reflected in the Social Web towards a political figure, in our case, a particular Latin American president. To this end, we analyze tweets and blog post snippets of maximum 140-character and 300-character long, respectively. Given the short text of the documents, we assume that words close to the president's name convey the emotion to be captured. In particular, we focus our analysis on nouns and adjectives. The emotion detection approach comprises the following procedure: 1. Create a profile for each president 2. Extract the terms from the profile 3. Associate to each term an emotion and polarity based on an emotion lexicon

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 26

D32 Version 1.0


4. Compute the emotion vector and polarity for each president First, we build a profile for each of the 18 presidents. The profile consists of all tweets and blog post snippets collected for the corresponding president. After building the profiles, we use TreeTagger to perform part-of-speech tagging on each of them [5]. Then, based on the output of TreeTagger, we extract the nouns and adjectives. Finally, we use a term-based matching technique to associate each term with emotion and polarity values. We used in our study the NRC Emotion Lexicon (EmoLex), a large set of human-provided word emotion association ratings. EmoLex was created by crowdsourcing to Amazon's Mechanical Turk, and it is described in [6].

6.2.3

Sentiment Analysis and Multilingualism

Note that the terms in the lexicon are in English, however, the profile of the presidents contains text in Spanish. One approach to address this issue is to use machine translation to translate the text of each document into English, and conduct the analysis in this language, for example in [7] tweets are translated from German to English to extract the emotions. However, our objective is to process the social stream in real-time, and translating each and every microblog post would be costly. Instead, we propose to machine translate the terms in the lexicon from English to Spanish, in this way, the process is performed once, offline, and the resulting terms are used to perform the analysis in the same language of the posts. To this end, we translated the terms in EmoLex using three different services: Google Translate, Bing Translator, and Yahoo! Babel Fish. The resulting terms in Spanish were associated to the corresponding English term's emotions and polarity. President's Emotional Vector We define the emotional vector, ep, for president p as follows: Let Tp be the set of terms extracted from the president's profile p, and Tm the set of all terms in EmoLex annotated with emotion m, where , eight basic emotions. Then, the

dimension of emotional vector

i.e.,

Plutchik's

is given by:

where is an indicator function that outputs 1 if the term is associated to emotion m, and 0 otherwise. Finally, we normalize vector to produce a probability vector

where is a normalization constant that corresponds to the total number of terms associated to an emotion.

President's Polarity Similarly as in the case of emotions, we compute the polarity tuple (positive, negative)p of president p as follows:

where the indicator functions are defined analogously as in the case of the emotions, and the normalization constant

is the sum of terms with a polarity value assigned.

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 27

D32 Version 1.0


6.3 Experiments In this Section, we present the results of our investigation. First, we explore the polarity of the terms in each president's profile. Second, we will analyze the emotions associated to each president, and the extracted patterns that could explain the degree of acceptance of each political figure.

6.3.1

Polarity Detection

The polarity detected for each president is shown in Figure 13. Polarity detection provides a quick overview of the sentiment conveyed by the terms co-occurring with the presidents' name, but it is too coarse grained, and as we will discuss later in this section, it does not fully explain the popularity as measured by the opinion poll.

6.3.2

Emotion Detection

In order to illustrate the terms behind each of the emotions extracted, we present in Figure 14 a tag cloud per each emotion considered. Each emotion tag-cloud includes the top-25 most frequent terms aggregated over all presidents. We can observe that the emotion analysis provides more insights on the perception of the presidents in the social stream, than the polarity value alone. For example, in the case of the Mexican president F. Calderon, he has a negative polarity value of 54%, which can be better qualified by the predominant emotions extracted from his profile, namely: sadness, anger, fear, and disgust.

Figure 13 Polarity Analysis

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 28

D32 Version 1.0


Figure 14 Tag clouds for each emotion

6.3.3

Emotional Pattern Analysis and Opinion Poll

In Table 6, we contrast the opinion poll results, polarity (positive--negative), and Plutchik's eight basic emotions opposing pairs: joy--sadness, anger--fear, trust--disgust, and anticipation--surprise. The opinion poll reflects the percentage of people's approval with respect to the corresponding president's job performance [4],[8]. We observe that neither the polarity extracted, nor the single emotion pairs alone can fully explain the results of the opinion poll. The polarity analysis is limited for popularity prediction, and a combination of emotions is a better approach for short-term popularity forecasts. Roughly speaking, a high positive (resp. negative) weight indicates that presidents with these emotional features should be approved (resp. disapproved) by the people. The features corresponding to the emotional pairs joy–sadness and anticipation–surprise are the dominant terms of the expression. The pair of emotions sadness–fear has a weight over the polarity score pair.

Table 6 Contrast the opinion poll results, polarity (positive--negative), and Plutchik's eight basic emotions opposing pairs

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 29

D32 Version 1.0


7.

Efficient Diversity Analysis of Large Data Collections

7.1 Introduction In this section, we describe the analysis of topic diversity in data collections, which is another type of aggregate data analysis conducted in CUBRIK that is especially relevant in the context of analyzing communities and developments on the Social Web. Diversity has been studied in different disciplines and contexts for decades. The diversity of a population can reveal certain cultural properties of a country, e.g. with respect to religion, ethnic groups, or political orientations. In the area of ecology, biodiversity is used as a measure of the health of biological systems. Recently, diversity also drew the attention of scientists in the database and information retrieval communities. For instance, Vee et al. [31] use the sum of similarities of all object pairs to measure diversities of relational records, while Ziegler et al. [32] employ a similar measure for diversifying items in a recommender system. In the area of search result diversification [27][32], inter-object similarity is used to obtain a subset of the most dissimilar results, providing an overview over the result space. However, due to the quadratic computational complexity, the mentioned approaches are applied on relatively small sets, limited to the number of life forms in a bio-system, or top-k objects selected in the context of search result diversification. In contrast, in this paper, we focus on analyzing diversity on Web collections and in other large-scale text corpora. Increasing amounts of data are published on the Internet on a daily basis, not least due to popular social web environments such as YouTube, Flickr, and the blogosphere. This results in a broad spectrum of topics, communities and knowledge, which is constantly changing over time. An increase of content diversity over time indicates that a community is broadening its area of interest; negative peaks in diversity can additionally reveal a temporary focus on specific events. Analyzing diversity is promising for understanding the dynamics of information needs and interests of the users generating the data. In a recommender system context, diversity and its temporal evaluation exhibits an additional criterion for suggesting to users interest groups or communities (e.g. rather broad vs. more specialized ones). Additionally, our methods could be used to efficiently analyze the correlation between diversity indexes for the documents retrieved for a set of queries and the performance of an IR system, allowing for quicker, deeper, and broader analyses. Furthermore, high diversity in document repositories can be employed as indicator that more structure is required, and trigger manual or automatic processes for introducing topic hierarchies or clusters. In our studies we show an example analysis depicting the temporal development of the diversity of photo annotations in Flickr over time, revealing interesting insights about trends and periodicities in that social content sharing environment. In addition we conduct an analysis of scientific communities on data extracted from the DBLP bibliography, and furthermore, study diversity in clustered corpora such as US Census data and newsfeeds. Our goal is to enable a fast analysis of the variation of topic diversities according to different aspects such as time, location and communities. There are a number of well-known indexes measuring diversity in ecology such as Simpson's diversity [30] and the Shannon index [28]. For applying the concept of diversity to text data, existing works use the sum of all document pair similarities or variations based on pair-wise distances as diversity metrics. These diversity indexes are based on the computation of pairwise comparisons. Thus, a common problem is their computational complexity: comparisons are necessary for sets of objects. While this is still feasible in scenarios with small amounts of data as for top-k candidates in query result diversification, it becomes prohibitive when computing topic diversity for large data sets. In order to solve the computational problem (the main contribution of our work [35]), we propose two novel algorithms which make use of random sampling and Min-wise independent hashing paradigms. More specifically, we propose two fast approximation

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 30

D32 Version 1.0


algorithms for computing the average pair-wise Jaccard similarity of sets with probabilistic accuracy guarantees, coined as SampleDJ and TrackDJ. We discovered that specific properties of the Jaccard coefficient as underlying measure for the pair-wise similarities required in diversity indexes can make the computation feasible, even for huge amounts of data. Although there exist a variety of alternative metrics, Jaccard is still one of the most popular measures in IR due to its simplicity and high applicability, and provides intuitive and interesting results in our example studies. SampleDJ, which is based on random sampling, solves the problem in time independent of the data set size, where is the diversity index value to be estimated. TrackDJ, which is based on Min-wise independent hashing [26] solves the problem in time regardless of the input data distribution. Experiments on real-world as well as synthetic data confirm our analytical results. Furthermore, we show the applicability of our methods in example studies on various data collections.

7.2 Methods and Techniques 7.2.1

Diversity Index Definition

There exists a plethora of methods for computing pair-wise similarities (e.g. cosine similarity, Jaccard coefficient, Okapi BM25, inverted distances, etc.). For diversity computation in this paper, we employ the Jaccard coefficient, which is one of the most popular measures in IR due to its simplicity and high applicability. We will see that specific properties of this coefficient can make the computation of diversity values feasible even for large data sets. Given two term sets and (e.g. sets of tags for two photos in Flickr), their Jaccard similarity

is defined as follows:

We define our topic diversity index as Refined Diversity Jaccard-Index, which is in fact the average Jaccard similarity of all object pairs. DEFINITION 1 (REFINED DJ-INDEX). Given a collection of objects , where each object is a set of elements (e.g. terms), the refined DJ-Index measuring the diversity of the collection is defined as follows:

Where In this work, we use the expression Refined DJ-Index, or RDJ index for short, to emphasize that self-similar pairs are not included in the diversity computation. It is easy to see that the RDJ-index can be considered as a special case of the Stirling index where

and

are set

to 1, and both and are if self-similar pairs are included. Note that smaller RDJ values mean larger diversities. For better visualization in plots we do not use 1-RDJ to measure diversity because RDJ values are usually quite small, and 1-RDJ is often close to 1.

7.2.2

Diversity Index Computation

Having defined the RDJ-index for measuring diversity, the goal of this section is to compute this statistic for a given text corpus. , , where each PROBLEM 1 (RDJ-Index Computation). Given a collection of objects object is a set of elements (e.g. terms), the objective is to compute the RDJ-index value CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 31

D32 Version 1.0


efficiently. The Naive Method: All-Pair The straightforward solution for computing RDJ directly according to Definition 1 is 1) to compare all object pairs and compute the size of intersection and union of each pair, and then 2) to sum up the similarities of all pairs and to divide by the number of object pairs. We call this algorithm All-Pair. Diversity Index Approximation To increase the time efficiency of the RJD computation, we consider approximation algorithms that are usually faster but provide estimated results, which are subject to errors. Our goal is to bound these errors within a tolerable range. In the following, we define error measures for the estimation quality of approximation algorithms. DEFINITION 2 (ABSOLUTE ERROR). Let () ̂_D_Dd__________αĝϨϨ________________

be

an

estimate

of

a

statistic

DEFINITION 3 (RELATIVE ERROR). (() ̂)/D_Dd__________ւʅϨϨ_________

be

an

estimate

of

a

statistic

Let

Like many other indexes, diversity indexes are mainly used for comparisons. A single index value alone usually cannot reveal much insight to users, and the relative error is more appropriate than the absolute error for measuring the accuracy of index value estimates. For the absolute error, it becomes hard for users to specify a meaningful accuracy requirement in form of an error bound unless they approximately know the value of the indexes to be estimated. Therefore, for both approximation algorithms described in the next subsections, we will focus our theoretical analysis on the relative error. The SampleDJ Algorithm A natural solution to reduce the computational cost is sampling. One approach is to take a random sample of the input data set, i.e., to sample objects uniformly at random (without replacement) from a collection of objects, compute the sum of similarities of all object pairs in the sample, and scale the diversity index of the sample. Another approach is to take objects uniformly at random but with replacement (see also [29] for a general discussion of sampling with replacement). In our context, we sample from all possible pairs, compute the similarities of those pairs and scale the result according to and as the final estimate. The advantage of this approach is that each pair is taken independently of other pairs, which makes the sampling analysis easier. We focus on this approach in this work and name it SampleDJ. The details are shown in Figure 15. TrackDJ: Estimating the RDJ-Index in Linear Time Although SampleDJ can be very efficient in some cases, its performance is sensitive to the data distribution, and it can be very slow in the worst case. Apart from that, without prior knowledge of the input data, it is difficult to predict the running time. Thus, we present TrackDJ, another approximation technique returning an accurate estimate for the DJ-Index in

time regardless of the input data

distribution, where is the number of objects (term sets) in the data set. Prerequisites: Min-wise Independent Hashing Broder et al. proposed a powerful technique called Min-wise in- dependent hashing (Min-hash) [26] . An interesting property of this technique is that the hashing collision probability of two objects is exactly equal to their Jaccard similarity.

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 32

D32 Version 1.0


Figure 15 SampleDJ: An approximation algorithm estimating diversity indexes The TrackDJ Algorithm: Given the Min-wise hashing property that more similar objects are more likely to have a hash collision, TrackDJ counts the number of collisions of all object pairs and estimates the diversity index. The key idea of TrackDJ is that if there are more similar pairs, the self-join size of the min-hash values will be higher due to the Min-wise hashing property; instead of comparing all object pairs, the self-join size of a set of items can be computed in linear time. TrackDJ maps each object (e.g. a term set) to a min-hash value. The algorithm uses the fact that two objects have a higher probability to have a min-hash collision if they have a higher Jaccard similarity; the total number of collisions of all possible object pairs directly approximates the sum of their pair-wise similarities. Thus, by computing the self-join size of all min-hash values (i.e. the total number of collisions of all pairs), TrackDJ estimates the RDJ-Index value. The detailed approximation method is shown in Figure 16.

7.3 Experiments 7.3.1

Data

Real Data. Our real-world data sets were obtained from Flickr4 and DBLP. DBLP is a Computer Science Bibliography database containing more than 1.2 million bibliographic records. The data records in DBLP are mostly contributed by human editors and are wellstructured. From DBLP, we extracted 1,256,089 paper titles based on the publication year and venue. We only considered conference and journal papers and removed books and other article types. Each paper title was considered as one object for our diversity analysis. For reasons of completeness and cleanness we focused on DBLP paper titles to mine topic diversities of computer science papers. Finally, we gathered tag assignments for 134 Mio Flickr photos uniformly over the time period from 01 Jan 2005 until 05 Sept 2010. From this set we selected a subset of 25 Mio photos where each of photo contained at least 3 English tags (though a dictionary check), and performed stemming.

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 33

D32 Version 1.0


Figure 16 TrackDJ: Estimating the RDJ-Index in Linear Time Synthetic Data. We created additional synthetic data sets in order to study the performance of different algorithms on data sets with various RDJindex values. To this end, we generated groups of objects with the property that 1) all objects within a group had the same number of terms and were pair-wise similar with the same similarity, and 2) objects in different groups had always similarity 0. By adjusting the number of groups with different sizes (i.e. the number of objects in each group), we constructed multiple data sets with different RDJindex values. More specifically, we constructed the data sets as follows: 1) We set U, the number of term IDs in each object to 10. 2) In each group, we generated a set of common IDs shared by all objects in the group. All other IDs in the group were distinct; in this way, by controlling the common ID numbers, we set the pair-wise similarity of all pairs in each group to around 0.5. 3) By varying the number of groups G and group size Gs, we created multiple synthetic data sets with n = G · Gs = 524, 288 objects each.

7.3.2

Performance

Metrics. The three metrics used to measure the performance of the algorithms were running time, space (memory) requirements, and accuracy. In terms of memory, the primary costs of All-Pair and SampleDJ were almost the same: namely, storing all term IDs, requiring e.g. about 33MB for the DBLP title IDs. TrackDJ requires more space for storing the frequency counters. Depending on the data sets, this cost can be as large as the input data set in the worst case. For the DBLP data set, it is less than 1/10 of the input data. Because the space cost is relatively clear and corresponds directly to the input data set size, we focused on running time and accuracy in our experiments. Performance Results on Real-World Datasets. Experiments for the approximation algorithms were performed with an error bound ǫ of 0.1 and a confidence value of 1−δ = 0.95. Table 7 shows the running time and error values (along with the exact RDJ values computed through All-Pair).

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 34

D32 Version 1.0


For the naive All-Pair solution, we observe that running times are rather small for dataset sizes of 1,000 and 10,000 but, due to its quadratic behaviour, quickly become infeasible for larger sets. Our SampleDJ approach shows the best running time behaviour of the three tested algorithms. TrackDJ shows linear behaviour and outperforms All-Pair for datasets of size n larger than 1 million. Note that for the approximation algorithms TrackDJ and SampleDJ the actual error is clearly below the defined error bound (in most cases less than 0.001 for ǫ = 0.1) For the given datasets, SampleDJ shows much better performance than TrackDJ.

Table 7 Running times, RDJ value and error for All-Pair, SampleDJ and TrackDJ for Flickr, and DBLP

7.3.3

Characterizing the Diversity in Corpora

Cluster analysis or “clustering” refers to the division of a set of objects into subsets (called clusters) so that objects in the same cluster are more similar and objects in different clusters are less similar. In many contexts unsupervised machine learning techniques like hierarchical, partitional or spectral clustering are employed to achieve this goal. Intuitively, the higher the number of clusters in a particular data set, the higher its diversity. Here, we want to verify if this property is reflected by the RDJ index. To this end, we analyzed three real world data sets: RCV1, Flickr Groups, US Census 1990. Table 8 provides an overview over the topics in the datasets and the number of instances per topic.

Table 8 Size, title and RDJ value (in percent) per category in Reuters Corpus, per group in Flickr-Groups collection, and per cluster in US-Census data

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 35

D32 Version 1.0


Figure 17 shows the RDJ index vs. the number k of clusters contained in the sample sets. The key observation is that the RDJ value indeed decreases with the number of clusters. This shows that topic diversity in the sample sets is mirrored by the Jaccard-based RDJ index. Despite of the large structural and conceptual differences between the distinct corpora and the large differences between the absolute RDJ values, the decreasing pattern observed is remarkably similar across corpora. A comparison of diversity values in different corpora and for specific clusters reveals further interesting insights (cf. Figure 17). Generally, the diversity in the Flickr data set is highest (corresponding to lower RDJ values) as the tags used for diversity computation can be defined by users and are not restricted. The Reuters data set contains more restricted vocabulary, and is less diverse. Finally, attributes in the UCI Dataset are well defined, the number of possible terms per instance is small (68), and the “vocabulary� limited, which results in high RDJ values.

Figure 17 Diversity increases with a growing number of categories in Reuters Corpus, groups in Flickr-Groups, and educational levels in US-Census datasets

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 36

D32 Version 1.0


8.

References [1] [2] [3] [4] [5]

[6] [7]

[8] [9] [10] [11] [12] [13] [14]

[15] [16] [17] [18] [19] [20] [21]

T. Joachims. Making large-scale support vector machine learning practical. Advances in kernel methods: support vector learning, pages 169–184, 1999. W. H. Kruskal. Ordinal measures of association. Journal of the American Statistical Association, 53(284):814–861, 1958. D. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 60(2):91– 110, Jan. 2004. Consulta Mitofsky – www.consulta.mx, “Aprobaci´on de Mandatarios Am´erica y El Mundo,” http://goo.gl/8fFNU, April 2012. H. Schmid, “Probabilistic part-of-speech tagging using decision trees,” in Proceedings of the International Conference on New Methods in Language Processing, Manchester, UK, 1994. S. M. Mohammad and P. D. Turney, “Crowdsourcing a wordemotion association lexicon,” Computational Intelligence, 2011. Tumasjan, T. Sprenger, P. Sandner, and I. Welpe, “Predicting elections with twitter: What 140 characters reveal about political sentiment,” in International AAAI Conference CID-Gallup – www.cidgallup.com, “Encuesta de Opinion Publica Centro America y Republica Dominicana,” http://goo.gl/lc85v, December 2011. Chang Chih-Chung and Lin Chih-Jen, “LIBSVM: A library for support vector machines”,ACM Transactions on Intelligent Systems and Technology,2(3), 27:127:27, 2011 Cortes Corinna and Vapnik Vladimir,“Support-Vector Networks”, Mach. Learn. , 20(3):273—297, Sept 1995 Dawid A. P. and Skene A. M., “Maximum Likelihood Estimation of Observer ErrorRates Using the EM Algorithm”, Applied Statistics, 1(29), 1979 Georgescu Mihai, Pham, Duc Dang, Firan Claudiu, Nejdl Wolfgang and Gaugaz, Julien „Map to Humans and Reduce Error - Crowdsourcing for Deduplication Applied to Digital Libraries”, CIKM 2012 Lukasiewycz Martin, Glass Michael, Reimann Felix and Teich Jurgen, “Opt4J - A Modular Framework for Meta-heuristic Optimization”, GECCO 2011. Miklos Zoltan, Bonvin Nicolas, Bouquet Paolo, Catasta Michele, Cordioli Daniele, Fankhauser Peter, Gaugaz Julien, Ioannou Ekaterini, Koshutanski Hristo, Mana Antonio, Niederee Claudia, Palpanas Themis and Stoermer Heiko, “From Web Data to Entities and Back”. CAiSE 2010 Zhu Yunyue and Shasha Dennis, “Efficient elastic burst detection in data streams”, KDD 2003. Cha, M., Kwak, H., Rodriguez, P., Ahn, Y.Y., Moon, S.: Analyzing the video popularity characteristics of large-scale user generated content systems. IEEE/ACM Trans. Netw. 17(5), 1357–1370 (2009). Chelaru, S., Altingovde, I.S., Siersdorfer, S.: Analyzing the polarity of opinionated queries. In: Proc. of ECIR’12. pp. 463–467 (2012). Cheng, X., Dale, C., Liu, J.: Statistics and social network of youtube videos. In:Proc. of IEEE IWQoS’08 (2008). Siersdorfer, S., Chelaru, S., Nejdl, W., San Pedro, J.: How useful are your comments?: analyzing and predicting youtube comments and comment ratings. In: WWW’10. pp. 891–900 (2010). Thelwall, M., Sud, P., Vis, F.: Commenting on youtube videos: From Guatemalan rock to el big bang. JASIST 63(3), 616–629 (2012). Vavliakis, K.N., Gemenetzi, K., Mitkas, P.A.: A correlation analysis of web social media. In: WIMS’11. pp. 1–5 (2011).

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 37

D32 Version 1.0


[22] Kundegis J., and Bauckhage C.: The slashdot zoo: Mining a social network with negative edges. ACM Transactions on Intelligent Systems and Technology, 2011. [23] Brenner, M., Izquierdo, E.: Graph-based Recognition in Photo Collections using Social Semantics. In: MM SBNMA. (2011) 47-52 [24] Gallagher, A.C., Chen, T.: Understanding images of groups of people. In: CVPR. (2009) 256-263 [25] Ahonen, T., Hadid, A.: Face description with local binary patterns: Application to face recognition. PAMI 28(12) (2006) 2037-2041 [26] A. Z. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise independent permutations. J. Comput. Syst. Sci., 60:630–659, June 2000. [27] S. Gollapudi and A. Sharma. An axiomatic approach for result diversification. WWW’09, Madrid, Spain. [28] C. C. Krebs. Ecological Methodology. HarperCollins, 1989. [29] Olken. Random sampling from databases. In Ph.D. Diss.(University of California at Berkeley), 1993. [30] E. H. Simpson. Measurement of diversity. Nature, 163, 1949. [31] E. Vee, U. Srivastava, J. Shanmugasundaram, P. Bhat, and S. A. Yahia. Efficient computation of diverse query results. In ICDE’08, Washington, DC, USA. [32] C.-N. Ziegler, S. M. McNee, J. A. Konstan, and G. Lausen. Improving recommendation lists through topic diversification. In WWW ’05, New York, USA. [33] Zerr, S., Siersdorfer, S., Hare, J.: PicAlert!: a system for privacy-aware image classification and retrieval. In CIKM '12, New York, USA. (2012)2710-2712. [34] Diaz-Aviles E., Orellana-Rodriguez C., Nejdl W.: Taking the Pulse of Political Emotions in Latin America Based on Social Web Streams. LA-WEB 2012: 40-47 [35] Deng, F., Siersdorfer, S., Zerr, S.: Efficient jaccard-based diversity analysis of large document collections. In CIKM '12, New York, USA. (2012) 1402-1411. [36] Chelaru, S., Orellana-Rodriguez, C., Altingovde, I.S.: Can Social Features Help Learning to Rank YouTube Videos?. In WISE’12. [37] Markus Brenner, Ebroul Izquierdo : Mining People’s Appearances to Improve Recognition in Photo Collections, Advances in Multimedia Modeling, Lecture Notes in Computer Science Volume 7732, 2013, pp 185-195

CUbRIK First Social Network analysis, Trust & People Search Techniques

Page 38

D32 Version 1.0


Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.