Query Processing Pipelines Rel.2 by CUbRIK Project

R2 PIPELINES FOR QUERY PROCESSING Human-enhanced time-aware multimedia search

CUBRIK Project IST-287704

Deliverable Version 1.0 â&#x20AC;&#x201C; 31 August 2013 Document. ref.: cubrik.D62.EMP.WP6.V1.0

Programme Name: ...................... IST Project Number: ........................... 287704 Project Title: .................................. CUBRIK Partners: ........................................ Coordinator: ENG (IT) Contractors: UNITN, TUD, QMUL, LUH, POLMI, CERTH, NXT, MICT, FRH, INN, HOM, CVCE, EIPCM, EMP Document Number: ..................... cubrik.D62.EMP.WP6.V1.0 Work-Package: ............................. WP6 Deliverable Type: ........................ Accompanying Document Contractual Date of Delivery: ..... 31 August 2013 Actual Date of Delivery: .............. 31 August 2013 Title of Document: ....................... R2 Pipelines for Query Processing Author(s): ..................................... Otto (EMP), Chelaru, Zhu (LUH), Croce, Lazzaro (ENG), Giakoumis, Drosou (CERTH)

Approval of this report ............... Summary of this report: .............. History: .......................................... Keyword List: ............................... Availability .................................... This report is public

This work is licensed under a Creative Commons Attribution-NonCommercialShareAlike 3.0 Unported License. This work is partially funded by the EU under grant IST-FP7-287704

CUbRIK R2 Pipelines for Query Processing

D6.2 Version 1.0

Disclaimer This document contains confidential information in the form of the CUbRIK project findings, work and products and its use is strictly regulated by the CUbRIK Consortium Agreement and by Contract no. FP7- ICT-287704. Neither the CUbRIK Consortium nor any of its officers, employees or agents shall be responsible or liable in negligence or otherwise howsoever in respect of any inaccuracy or omission herein. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7-ICT-2011-7) under grant agreement n째 287704. The contents of this document are the sole responsibility of the CUbRIK consortium and can in no way be taken to reflect the views of the European Union.

CUbRIK R2 Pipelines for Query Processing

D6.2 Version 1.0

Table of Contents EXECUTIVE SUMMARY

INTRODUCTION

SEARCH ENGINE FEDERATION: QUERY LANGUAGE AND ARCHITECTURE

2.1 INTRODUCTION 2.1.1 Purpose and Scope 2.1.2 Contributions 2.2 OVERALL PROCESS FOR QUERYING 2.2.1 Building index for documents and Web sources: crawling 2.2.2 Query execution 2.2.3 Architecture 2.3 RESEARCH ON GRAPHS BASED ON TEXTUAL DOCUMENTS 2.3.1 Introduction 2.3.2 Word distance 2.3.3 Correlation measure 2.3.4 Example: Jacques Chirac 2.3.5 Future work 2.4 ENTITY RECOGNITION 2.4.1 State of work 2.4.2 Possible integration in History of Europe 2.5 USAGE IN HISTORY OF EUROPE VERTICAL APPLICATION

3 3 3 4 4 6 7 10 10 11 11 13 14 14 14 15 15

3. COMMUNITY-AWARE MULTI-MEDIA SEARCH RESULT RANKING AND FILTERING

3.1 INTRODUCTION 3.2 RELEVANT RESEARCH 3.2.1 Social content and features 3.2.2 Learning to rank 3.2.3 Search Engines 3.3 DATA COLLECTION, METHODS AND CHARACTERISTICS 3.4 EFFECTIVENESS AND CORRELATION OF THE INDIVIDUAL FEATURES 3.4.1 Basic features 3.4.2 Social features 3.4.3 User Study 3.4.4 Effectiveness of the individual features 3.5 LEARNING TO RANK USING SOCIAL FEATURES 3.5.1 Video retrieval framework 3.5.2 Feature selection for LETOR approaches 3.5.3 Experimental results for feature selection 3.5.4 Experimental results for the impact of social features 3.6 SUMMARY OF THE RESEARCH FINDINGS 3.7 EXPANSION THROUGH IMAGES COMPONENT 3.8 CONCLUSION 3.9 FUTURE W ORK

17 17 18 18 19 19 20 20 20 22 22 23 24 25 26 28 29 29 30 30

CONTEXT-AWARE AUTOMATIC QUERY FORMULATION

4.1 INTRODUCTION 4.1.1 Purpose and Scope 4.1.2 Contribution 4.2 RELEVANT RESEARCH 4.2.1 User submitted queries 4.2.2 Query recommendation 4.3 PROPOSED METHOD

CUbRIK R2 Pipelines for Query Processing

31 31 31 32 32 33 35

D6.2 Version 1.0

4.3.1 Session extraction from log data 4.3.2 Classification of sessions based on time and spatial resolution 4.3.3 Clustering of sessions 4.4 ESTIMATING USER MOOD DURING SEARCH ENGINE USAGE 4.4.1 Introduction 4.4.2 Proposed Model 4.4.3 Case studies 4.5 SUGGESTING QUERIES TO THE USER 4.6 APPLICATION OF THE PROPOSED CONTEXT-AWARE AUTOMATIC QUERY FORMULATION

35 36 37 38 38 39 41 44

METHOD IN PRACTICE 4.7 EMPLOYING AOL QUERY LOG 2006 IN ORDER TO TEST THE METHOD‟S ACCURACY

46 48 48 48 48 49 49 50

4.7.1 Data collection methodology 4.7.2 Procedure 4.7.3 Results 4.8 DISCUSSION 4.8.1 Contribution 4.8.2 Future Work 5.

REFERENCES

CUbRIK R2 Pipelines for Query Processing

D6.2 Version 1.0

Executive Summary This document is an upgrade of WP6 deliverable D6.1 R1 PIPELINES FOR QUERY PROCESSING. It reports the advancement and results achieved in WP6 during the second year of the project. WP6 is the workpackage that deals with query processing in CUbRIK and its objectives cover the whole set of CUbRIK pipelines for different querying aspects. Some of these objectives were addressed in D6.1: 1. To use and validate the internal and external data models used to represent all objects involved in multimedia search defined in WP2; 2. To define the language for representing queries and parts of queries (internally and for exchange between federated search engines); 3. To reuse experience from the design of multimedia query systems in past projects, like RUSHES, VITALAS, I-SEARCH and PHAROS. The second year WP6 activities focused on a second set of objectives: 4. To exploit community knowledge and user profiles to personalize and fine tune query processing; 5. To include context information for improving the search results and user interaction; 6. To consider and include access control and copyright aspects for query processing; 7. To reuse the SMILA framework for unstructured data processing. The focus of the work done can be summarized as follows: Federated search pipelines were integrated in the History of Europe Vertical Application to retrieve images, documents, and web sources with one query. In order to be able to answer the queries, a mechanism for documents and web sources crawling has been built up. This crawling produces an index which is used to retrieve documents by querying for known entities or full text; Queries which ask for relationships among entities were addressed. Some studies regarding social graph creation based on documents were performed; The first comprehensive investigation for the impact of social features on video and image retrieval effectiveness was provided; Six state-of-the-art LETOR algorithms were implemented. The experiments demonstrated that rankers based on subsets of features including both basic and social features outperform those built by using only the basic features for using social features on video retrieval; An automatic context-aware query formulation was developed, capable of providing query suggestions, on the basis of the spatiotemporal characteristics of the user session; Mood exploitation in queries was implemented to augment the query formulation process with further personalized information. A query mood estimation algorithm has been developed and will be integrated in the next period.

CUBRIK R2 Pipelines for Query Processing

Page 1

D6.2 Version 1.0

Introduction

During the second year of the project, WP 6 focused on providing support for the Vertical Applications. Pipelines were designed and implemented which enhance functionality in the Fashion App and History of Europe App. Besides the V-Apps, some technologies were implemented as Pipelines belonging to the CUbRIK official release. For History of Europe, several Pipelines for the retrieval of documents were provided. These Pipelines covered the whole process for indexing documents and Web sources and retrieval. The Pipelines exploit the CUbRIK entity repository, which is implemented leveraging on and taking into account Entitypedia[R82]. An engine for federated search was provided. In Task 6.2, Pipelines for federated search, including index build and query execution, were developed to exploit images and documents as sources. The images in the History of Europe Vertical Application were retrieved using a Pipeline built in Task 6.3. In order to find the correct image, user behaviour was taken into account. A vast amount of social feedback expressed via ratings (e.g. likes and dislikes) and comments is available for the multimedia content shared through Web 2.0 platforms. However, the potential of such social features associated with shared content still remains unexplored in the context of information retrieval. In the work done in Task 6.3, we first studied the social features that are associated with the top-ranked videos retrieved from the YouTube video sharing site for real user queries. This technology was implemented in a separate application. Our analysis considers both raw and derived social features. Next, we investigated the effectiveness of each such feature on the video retrieval using state-of-theart learning to rank approaches. In order to identify the most effective features, we adopted a new feature selection strategy based on the Maximal Marginal Relevance (MMR) method, as well as utilizing an existing strategy. The findings reveal that incorporating social features is a promising approach for improving the retrieval performance. The methods developed in Task 6.3 were also applied in the Expansion through Images component, which is part of the History of Europe Vertical Application. These methods will be extended in the final CUbRIK release. In Task 6.4, an automatic context-aware query formulation method was developed, which is capable of providing query suggestions to the user of a search engine as soon as s/he logs in, prior to the submission of queries to the search engine. Queries are formulated herein through the matching of the current userâ&#x20AC;&#x;s session to past sessions belonging either to the same user (personal level suggestions) or others (global level suggestions), on the basis of session spatiotemporal characteristics. Moreover, a query mood estimation algorithm has been developed, which is capable, based on assumptions on how user mood is connected with search engine usage events, to estimate user mood associated with each query recorded in the server log and augmenting the query formulation process with such, further personalized information. The rest of the document is structured as follows: Chapter 2 describes the implemented Pipelines for federated search, including index build and query execution. Moreover, research findings for construction of social graphs based on textual documents and recognition of unknown entities are presented; In Chapter 3, methods how to rank using social features are presented. An investigation of the impact of social features on video retrieval effectiveness is given. The specific implementation in the HoE V-App is provided. In Chapter 4, extensive research report about using the userâ&#x20AC;&#x;s mood for query suggestion is provided. The related implementation as CUbRIK H-Demo is described. Referenced sources are listed in Chapter 5. In each chapter the contributions for the CUbRIK Vertical Apps "History of Europe" and "Fashion" are described.

CUBRIK R2 Pipelines for Query Processing

Page 2

D6.2 Version 1.0

2. Search engine federation: query language and architecture 2.1

Introduction

2.1.1

Purpose and Scope

There are two requirements for Pipelines for search engine federation: Pipelines have to be implemented that perform queries and give back results. These query Pipelines shall be applied to many different sources (e.g. documents, Web sources, multimedia etc.); The results have to be merged in order to present them on a single UI. Demos of these Pipelines are integrated in the History of Europe Vertical Application. In order to enable queries to retrieve correct data, an index has to be built. Research on how to use the index data to construct a net of relations is performed. Results are promising. Moreover, an analysis has been started on how to find out related entities which have not been recognized before. Among these entities are persons, countries, cities, companies, organizations, and dates.

2.1.2

Contributions

In the History of Europe Vertical Application images, PDF documents and Web sources when clicking on a person in the image are shown (see Figure 1). In order to get the results multiple queries are performed on different data sources. For the display all the results have to be merged.

Figure 1: Query for Georges Pompidou in History of Europe V App The queries for documents are the basic feature for the use case â&#x20AC;&#x153;expansion through documentsâ&#x20AC;?.

CUBRIK R2 Pipelines for Query Processing

Page 3

D6.2 Version 1.0

2.2

Overall process for querying

In order to enable a query to get results, an index has to be built beforehand. Otherwise the application would have to analyse the whole data base for each query again and again.

2.2.1

Building index for documents and Web sources: crawling

In order to build an index, the data sources have to be analysed. To get good results, entities have to be defined which will be retrieved beforehand. These are the phrases which are most relevant for the domain in which the search engine will be used. Besides the entities which are known in advance, there are two different relevant types: There may be entities which have not been identified as such yet. See Section 2.4 for a deeper investigation; Full text search is always a benefit when formulating queries. Crawling documents In the example of the History of Europe V-App, a huge number of documents is available. These documents are held in the CVCE collection. The documents are uploaded to a file server. They are stored in PDF format. An asynchronous SMILA workflow is established in order to get through the file path and analyse all documents which are found. For each document a record is built. This record contains all information which is relevant for the retrieval. Crawling Web sources In general, analyses of Web sources are the same as analyses of documents. However, some more topics arise which have to be kept in mind: The content is seen as text in the PDF documents; Some start addresses (URLs) have to be defined where the crawling starts; From these URLs, crawling is started. This means the given links are taken and the new URLs are analysed as well; A maximum link depth has to be defined. Otherwise the crawling would not stop in a useful time; A restriction to a Web domain or Web host is necessary in order not to leave the topic. The start URLs are defined manually. Starting from these, some links are followed. A â&#x20AC;&#x153;stay on hostâ&#x20AC;? functionality is implemented, i.e. only links which direct to the same host are taken into account. Known concepts In order to give access to frequently used concepts, they have to be modelled. These known concepts form the relevant phrases for a domain, e.g. History of Europe. In History of Europe, these concepts mainly are the persons for which co-occurrences are analysed. A concept is identified by an ID. Moreover it has a label. To detect a person inside a textual document keys which are synonyms are defined. For example, President Kennedy is modelled as given in Table 1.

CUBRIK R2 Pipelines for Query Processing

Page 4

D6.2 Version 1.0

52401ba7e4b0f3679e7ae501

label

John F. Kennedy

class

PERSON

keys

Kennedy JFK Table 1: Concept John F. Kennedy

For the History of Europe Vertical Application 1717 persons have been modelled which come from Entitypedia. Language detection For each language there are some words which do not bear much information. These words are usually used very frequently and are therefore part of almost every text in the language. For English these words are e.g. “the”, “a”, “he” and so forth. As there is not much information in these words they are usually ignored for retrieval. They are also called “stop words”. However, these words can be used to identify the language of a document. By finding these words in a text the language can be recognized. Record After crawling – file crawling or Web crawling – each “physical” document has a representation in the index. This is the record (see Figure 2).

Figure 2: Class diagram for record The ID is just for internal usage. The language attribute is for the identified language may be “unknown”. MIME is the type of the corresponding document. The attribute FilePath provides a way to access the document. Moreover, information concerning found concepts and full text are stored.

CUBRIK R2 Pipelines for Query Processing

Page 5

D6.2 Version 1.0

2.2.2

Query execution

In order to have unified access to multiple data sources all identified items (documents, Web sources etc.) are represented by a record in an index. This enables applications (and users) to retrieve information from various sources with one multimodal query. In order to extend the queries for (textual) documents, a query is performed for images in the History of Europe Vertical Application. Retrieval with given concepts In order to find documents which refer to the concept “John F. Kennedy” a query containing the ID of the concept has to be given: concepts:(855ucsih/Concept/PERSON/52401ba7e4b0f3679e7ae501) This query leads to documents and Web sources which contain the concept “John F. Kennedy” identified by one of its keys (like “JFK”). Retrieval with full text Besides retrieval with concept ID it is also possible to perform a “classic” full text search. If it is required to find all documents which contain the sequence of letters “currency”, a query can be entered like: currency You can add more words to find documents containing all words, and also qualify a set of words in order to find exact matches. For example the query: \"European Central Bank\" leads to results with “European Central Bank” in the text. A text only containing “Central European Bank” would not be retrieved. Filters In order to adjust the query to the application‟s needs, filters can be set according to attributes in the index. This is especially used for language attribute or MIME type. If you want to have a list of all documents with the sequence of letters “Winston Churchill” which are written in German language, the query is as follows: +_language : (german) +\"Winston Churchill\" The + signs define filters, the first for attribute language, the second for the exact sequence of letters “Winston Churchill”. It would be better to refer to the concept “Winston Churchill” as he sometimes is written “Winston S. Churchill” or “W. Churchill” and so on.

CUBRIK R2 Pipelines for Query Processing

Page 6

D6.2 Version 1.0

2.2.3

Architecture

SMILA components In order to build the index which is necessary for retrieval, the sources have to be crawled. In general there are multiple sources. In the case of History of Europe, all sources are taken into account: Files are from the CVCE collection and Web sources. The URLs used for Web crawling at the moment are listed in Table 2. The list can easily be extended. apcentral.collegeboard.com bookshop.europa.eu de.wikipedia.org en.wikibooks.org en.wikipedia.org europa.eu europeanhistory.about.com jsis.washington.edu primary-sources.eui.eu www.britannica.com www.coe.ba www.coe.int www.docstoc.com www.europarl.europa.eu www.gresham.ac.uk www.historytoday.com www.indexmundi.com Table 2: List of URL used for Web crawling The content of the sources is analysed using parallel execution of workers. This is shown in Figure 3.

CUBRIK R2 Pipelines for Query Processing

Page 7

D6.2 Version 1.0

Figure 3: Parallel crawling of multiple sources During runtime, the user enters queries. This can also be done by clicking on a personâ&#x20AC;&#x;s head (vertex of the graph) or a line between two persons (edge). See Section 2.5 for the description of integration in the Application. This user input then is processed by two parallel Pipelines: one is used for image retrieval, one is used for retrieval of documents/Web sources. This is shown in Figure 4.

Figure 4: Parallel execution of search Pipelines API access In order to provide search functionality, the index is wrapped in an interface. Via this interface the index can be accessed. Figure 5 shows how an application that used the Search API is connected to an index. The records contain a URL or identifier which allows to call the document for which the record is a representation.

CUBRIK R2 Pipelines for Query Processing

Page 8

D6.2 Version 1.0

Figure 5: Component diagram for connection between application and index In order to show a found document to the user (which is a task of the application) the workflow is as follows (see Figure 6): 1. A query is built in JSON format. This query is fired against the JSON/REST Interface, i.e. the Search API. Then a result set for the query is returned. This result set contains the records (or parts of them) which match the query; 2. The application can read out the URLs from the records in order to get access to the corresponding documents. In this way the application calls the documents of interest via accessing the original source. The document source returns the document which was accessed.

CUBRIK R2 Pipelines for Query Processing

Page 9

D6.2 Version 1.0

Figure 6: Sequence diagram for document access The query has to be sent to a URL like: https://hoe.empolisservices.com/ias/library/iasSearch.search which represents the Search API. The JSON snippet has to contain at least the query and the index to be accessed. If the index ID is wi85f4cqdq, an HTTP Post request has to be sent with body: { "query" : "+_language "indexName" : "wi85f4cqdq"}

(german)

+\"Winston

Churchill\"",

in order to retrieve all German documents containing the sequence of letters “Winston Churchill”. The parts of the query which shall be mandatorily part of the results are prefixed with „+‟.

2.3

Research on graphs based on textual documents

2.3.1

Introduction

For the construction of the “social graph” for persons in Entitypedia, the History of Europe Vertical Application uses co-occurrences on an image. This is very easy, as there definitely is a correlation between these persons because they have met, otherwise they would not have been on the same photograph.

CUBRIK R2 Pipelines for Query Processing

Page 10

D6.2 Version 1.0

For textual documents, the correlation is not so easily detected. Two persons may occur in the same PDF document which discusses European history. However, these documents may have many pages. If person A is mentioned in the document on page 5 and person B is detected on page 32 there is no need for a correlation between the two persons. Therefore we use an algorithm which takes distance into account.

2.3.2

Word distance

As mentioned, a co-occurrence of two entities (e.g. persons) in a document is not necessarily an indication for a real correlation. Therefore the documents need to be analysed deeper. For each concept, the position in the documents is identified. There may be more than one occurrence in a single document. For all these positions a box from 10 words in front of the concept and 10 words behind is spanned. All (known) concepts and (unknown) nouns phrased within this box are considered as â&#x20AC;&#x153;co-locatedâ&#x20AC;? with the concept in question. For reasons of feasibility, such a co-location is also considered as a co-occurrence. The 10 words distance is chosen arbitrarily. However, results are promising, in that a real cooccurrence with regards to content is detected using this method.

2.3.3

Correlation measure

Definition In order to calculate a real correlation measure, an important prerequisite is lost when considering the co-occurrence within the 10 words distance: the basis of the number of documents. The number of documents is not a reasonable basis as the number cooccurrences does not depend on that number. Many co-occurrences within one document can be arbitrary. In order to define a proper correlation measure r, this measure should be a metric and therefore symmetric:

for arbitrarily chosen concepts A and B. In order to define the measure r the conditional probability P(A | B) is used. This is an indication of how much the probability to find A is increased (or reduced) when B is found in a 10 words distance. Due to Bayesâ&#x20AC;&#x; theorem this is the same as:

Using this theorem the defined measure

is identified with

i.e. the measure is symmetric. CUBRIK R2 Pipelines for Query Processing

Page 11

D6.2 Version 1.0

It is the same as:

. The problem remains to calculate the probability itself. A usual probability is approximated as relative frequency. However, to calculate the relative frequency the number of hits has to be set in proportion to the overall number. Which “overall number” shall be used here? The number of documents is not appropriate, as it has nothing to do with the number of hits. As it turns out, this question need not be answered. Assume the “overall number” (whatever this means) is N. The frequencies of terms and co-located terms shall be N(A), N(B) and N(A ∩ B). Then it holds

. The measure therefore does not depend on N. Co-domain These measures define a correlation between two concepts (or between a concept and another term). The measure is between 0 and 1:

A value of 0 indicates there is absolutely no correlation between A and B, i.e. when considering any A and a 10 words box around there is not found a single B. A value of 1 indicates a full correlation. This means that when A is found, B is found within a 10 words distance and vice versa. So what is a correlation then? This is the same question as “What is the expectation for r(A, B)?” This of course depends. As we consider documents with many words compared to the number found A or B the expectation is near to 0. For example, in a 100 word document which contains one A and one B the a priori probability to have a correlation is 0.2. However, if A and B appear in a passage of 100 words it is likely they have something to do with each other, even if they are not co-located within a 10 words distance. In a 1000 words documents the probability diminished to 0.02, which is also the expectation for r. And a document with 100 words is still not a big one. Therefore all correlations >0 are considered as a mutual gain for probability of occurrence. Logarithmic scale In order to present a number which is easier to handle, these measures are transferred to a logarithmic scale. The measure is set to

. The co-domain of this logarithmic measure is [0, 100], 100 being the top value.

CUBRIK R2 Pipelines for Query Processing

Page 12

D6.2 Version 1.0

2.3.4

Example: Jacques Chirac

For a part of the CVCE Collection and a reduced set of entities the results are listed for the former French president Jacques Chirac in Table 3. Concept

Correlation rlog

Helmut Kohl

90.21

Tony Blair

86.39

France

81.60

NATO

76.22

United Kingdom

73.70

Table 3: Correlation between Jacques Chirac and other concepts There are only co-occurrences between Jacques Chirac and five other concepts. When using a wider input field, there appear more French Politicians (like Jospin and Mitterrand) and Gerhard Schröder. In Figure 7 the data of Table 3 are visualized. The thicker a line is, the bigger is the correlation. Colours from blue (less correlation) to red (strong correlation) underline the different strengths of the correlations. The diagram (“spaghetti plate”) is an approach to visualize the social net which can be established using the algorithm to define correlations.

CUBRIK R2 Pipelines for Query Processing

Page 13

D6.2 Version 1.0

Figure 7: Correlation between Jacques Chirac and other concepts

2.3.5

Future work

The component to calculate correlation measures between entities has been developed but is not integrated. It seems to be promising to extend the History of Europe social graph with the results of the correlation calculation. A way has to be elaborated to combine these correlation measures with the measures from the existing social graph. Then both measures can be used together. Two networks based on either the images or the documents are also possible. Moreover, an appropriate way to visualize the results has to be found. Especially in a graph with more than 1000 persons the visualization used in Figure 7 is not feasible.

2.4

Entity recognition

2.4.1

State of work

In order to find out which persons appear in the collection who are not modelled as a concept yet, we try to use an entity recognition function. This works for persons, dates, organisations, countries, companies, and cities. This way, which organisation, countries etc. a modelled CUBRIK R2 Pipelines for Query Processing

Page 14

D6.2 Version 1.0

person has something to do with it can be found out. For example, it can be found out that Jacques Chirac has something to do with Lionel Jospin, although the concept “Lionel Jospin” is not modelled in the run in question.

2.4.2

Possible integration in History of Europe

In order to improve the History of Europe Vertical Application two ways have been identified: Persons in Entitypedia can be enriched with information of the countries they are related to. A crowd-sourcing mechanism can be applied in order to suggest the countries with the highest correlation; In order to identify important terms, the most relevant entities can be presented. These entities are candidates to be included into Entitypedia.

2.5

Usage in History of Europe Vertical Application

When clicking on a vertex, the corresponding person is chosen. In Figure 8, “Franco Malfatti” is shown This leads to the presentation of corresponding images and documents. The documents are retrieved via JSON/REST request { "query" : "+Concepts:(+85f4cqdq/Concept/PERSON/523c5d83e4b0f3679e7ad0fd)", "indexName" : "wi85f4cqdq" } The index name is wi85f4cqdq, 523c5d83e4b0f3679e7ad0fd.

the

concept

Franco

Malfatti

Figure 8: Franco Malfatti in Social Graph In order to find co-occurrences, an edge has to be clicked on as shown in Figure 9. Then images with both persons are retrieved. In order to retrieve documents with both persons, both concepts have to be added to the query. For example: { "query" : "+Concepts:(+85f4cqdq/Concept/PERSON/523c5d83e4b0f3679e7ad0fd,

CUBRIK R2 Pipelines for Query Processing

Page 15

D6.2 Version 1.0

+85f4cqdq/Concept/PERSON/523c5da2e4b0f3679e7ad62e)", "indexName" : "wi85f4cqdq" } returns all documents with 523c5da2e4b0f3679e7ad62e).

Franco

Malfatti

and

Walter

Scheel

(ID

Figure 9: Selection of an edge in the social graph

CUBRIK R2 Pipelines for Query Processing

Page 16

D6.2 Version 1.0

3. Community-aware multi-media search result ranking and filtering 3.1

Introduction

Web 2.0 platforms, such as YouTube and Flickr, recently received widespread attention due to the variety and attraction of the shared content. In addition to sharing and accessing the content, such platforms allow their users to express themselves by rating the viewed objects (via clicking on the popular like/dislike buttons) and interacting with the other community members (via the comments feature). This behaviour results in a vast amount of social signals associated with the shared content that may be exploited, amongst others, to improve the retrieval effectiveness. For instance, the user ratings for an object can serve as a global indicator of its quality or popularity (analogously to how the web graph features, such as PageRank, serve for the same purpose for the web pages), and the comments and other collaboratively formed data can facilitate and enhance matching the shared content with the user queries. Despite the rapidly growing interest for Web 2.0 applications from both the industry and research communities, the impact of employing such social signals within a large-scale search scenario has never been fully explored. The research community also showed keen interest in analysing and exploiting the rich content shared in Web 2.0 platforms. Some earlier studies attempted to investigate the retrieval potential of the social signals, specifically comments, in isolation ([R80] [R69] [R72]). However, to the best of our knowledge, there is no study that systematically and exhaustively investigates the impact of a rich set of social signals on the retrieval performance in a realistic and state-of-the-art framework. How useful are the social signals to improve the retrieval effectiveness? In our work done in the context of Task 6.3 during the reporting period, we seek an answer to this central question. While doing so, we focus on the keyword-based video search for the YouTube video sharing site. Social features, as we call them in this work, refer to the information that is created by some explicit or implicit user interaction with the system (such as views, likes, dislikes, favourites, comments, etc.). The results are also applicable for similar image platforms. In contrast, we call the features that would be typically involved in a keyword search scenario, such as the textual similarity of the user queries to the video title, tags and description (i.e., metadata fields provided by the content uploader) as the basic features. Our work essentially explores whether the social features in combination with the basic features can retrieve more relevant videos; and if this is the case, which social features serve the best. Note that, while our choice of YouTube is based on the availability of a rich set of social features in this platform, we believe that our findings are applicable to the text, image and/or video search in other platforms that support similar kinds of features. Therefore, the methods developed in T6.3 are applied to the Expansion through the Images component which is integrated in the History of Europe Vertical Application. In our work, we present a unique dataset that includes a total of 50 popular submitted to YouTube and around 5,000 relevance annotations for the results of these queries. Furthermore, different from all the commercial and academic datasets, we define various social features obtained from the real YouTube results in addition to the typical basic features.

3.2

Relevant Research

The publications [R83] and [R84] are part of the research which is described here.

CUBRIK R2 Pipelines for Query Processing

Page 17

D6.2 Version 1.0

3.2.1

Social content and features

Web 2.0 platforms and social networks received widespread attention in the last decade. Musial and Kazienko provide an in-depth survey of the social networks from a broad perspective including the sites directly intended for such networking purposes (such as MySpace and LinkedIn) and other platforms where the users form an implicit community via interacting with the system (such as Flickr, YouTube, etc.) [R71]. Cheng et al. provides a large-scale analysis of the content in YouTube and provides statistics related to the videos, such as the distribution of categories, duration, size, bit rate and popularity [R55]. An analysis of the video characteristics, such as the popularity distribution and evolution over time are addressed in [R52]. Vavliakis et al. compare YouTube and two other data sharing platforms in terms of several factors and identify the correlations between these factors via regression analysis. Various properties of the comments posted for YouTube videos are analysed in [R75] and [R77]. In an out-of-the-laboratory study aiming to shed light on how people find and access videos on the Web, Cunningham et al. [R56] discuss under what circumstances the participants benefit from the comments. None of these works explore the retrieval potential of the analysed features. Among the social features associated with the shared content, the lion's share of research interest is devoted to the user comments due to their potential to improve the performance in several scenarios. In a recent survey, Potthast et al. categorize the comment-related tasks as comment-targeting and comment-exploiting [R72]. The works that aim to rank [R64] or diversify the comments [R62] and predict their ratings [R75] fall into the former group. In the latter category, there is a large body of works that utilize the comments for various purposes, such as summarizing the blog posts [R65], classification of YouTube videos [R58], predicting the content popularity [R69][R78][R79] and recommending the related content items [R74]. Despite the large number of works focusing on the comments, only a few of them investigate their potential to improve the retrieval effectiveness. They do this usually in isolation, i.e., independently from the other social and basic features. In one of the earliest studies, Mishne and Glance investigate the impact of comments on the retrieval performance for weblogs and report that employing comments does not improve the precision, but helps to retrieve both relevant and highly discussed blog posts [R69]. The user comments in MySpace are exploited for ranking the artists [R63]. In [R73], comments are leveraged for the aestheticaware re-ranking of image search results. The closest work to ours is [R80], which utilizes YouTube comments for the video retrieval. However, their work is only limited to the experimenting the comment feature within the known-item retrieval scenario. To the best of our knowledge, we are the first to investigate the retrieval effectiveness of a rich set of social features in combination with the basic ones within a realistic search scenario.

3.2.2

Learning to rank

In the last years, traditional ranking approaches based on the manually designed ranking functions (such as BM25, TF-IDF, etc.) are replaced or complemented by the rankers built by machine learning strategies [R53]. The commercial web search engines typically apply a twostage ranking process where a candidate set of documents is identified using a traditional yet relatively inexpensive approach (such as the ad-hoc functions exemplified above) in the first stage [R49]. Next, these candidate documents are re-ranked using a learning-to-rank (LETOR) strategy based on several hundreds of features. A variety of LETOR approaches appear in the literature, for which we refer to [R67] as an exhaustive survey. In a nutshell, LETOR approaches are broadly categorized into three categories, namely, point-wise, pairwise and list-wise depending on their loss function. In our research, we employed state-of-the-art representatives from each category, as we later describe. In addition to the learning algorithms, feature engineering is an equally important aspect of a LETOR framework. In the last few years, large search companies such as Microsoft, Yahoo!, and Yandex released benchmark datasets for so-called LETOR challenges. However, the features employed in these datasets are only broadly described (e.g., [R53]) and the actual feature names in the data are never disclosed, making it

CUBRIK R2 Pipelines for Query Processing

Page 18

D6.2 Version 1.0

impossible to analyze the importance/utility of a particular feature or class of features. To overcome this latter difficulty, a recent study presents a new dataset based on the data collected from a commercial Chilean search engine, TodoCL [R47]. Their dataset includes 79 queries with 3,119 relevance assessments and a total of 29 features. In [R68], Macdonald et al. employ the official queries and their top-ranked documents from the TREC collections to analyze the usefulness of the query features in a LETOR setup.

3.2.3

Search Engines

Web search engines, taking their fair share from this Web 2.0 wave, have taken steps towards a more “social” search. In early 2011, Bing announced its “LikedResults” feature, which, in a nutshell, annotates the result URLs with the names of the searchers' friends who liked these URLs publicly or shared them via Facebook. In 2012, this evolved to Bing's social search feature that is provided via a sidebar on the search results page. This sidebar subsumes a wide range of social functionalities, most strikingly identifying your Facebook friends, who might know about your query, based on their likes, profile information, shared photos, etc. During this time, Google also released its “Search plus Your World” feature that enriches algorithmic results with pages, photos and posts from the searchers' Google+ social network. While all of these recent developments imply the importance of social signals in search, the details, i.e., how exactly and to what extent such signals can be exploited in ranking query results, are not disclosed due to the highly competitive nature of the market.

3.3

Data Collection, Methods and Characteristics

The first challenge in investigating the impact of social features in ranking YouTube videos is creating a dataset based on real user queries. Previous studies typically obtain samples of YouTube content by running crawlers that are seeded with some generic queries (e.g., the queries from Google's Zeitgeist archive [R75] or terms from the blogs and RSS fields [R77]). Different from these works, we employ a methodology for creating two different query sets including the popular and rare queries that are actually submitted by YouTube users. In what follows, we describe our query set and the top-ranked videos retrieved for these queries. Query Set (QP): in order to construct a representative sample of real user queries, we made use of the auto-completion based suggestion service specialized for the YouTube domain from a major search engine. These instant suggestions are typically based on the previous queries submitted by other users [R76][R48]. We submitted all possible combinations of twoletter prefixes in English (i.e., aa, ab, ..., zz) and collected the top-10 query suggestions for each such prefix (e.g., “Aaliyah”, “aaron carter”, “abba dancing queen”, etc.) in a similar fashion to [R54]. This process yielded a set of 7,000 suggestions, from which a subset of 1,447 queries is sampled uniformly at random (to avoid overloading YouTube servers with an excessive number of requests in the next steps). For each q in QP, we obtained the top-300 result videos from YouTube API along with the available metadata fields (see Table 4). This process resulted in a superset of 138K videos, i.e., around 95 videos per query are retrieved. Among these videos, 132,697 of them are unique (i.e., only 4.3% of all videos overlap among different query results). Metadata

Notation

Metadata

Notation

No. of views

W(v)

Title

TitleText(v)

No. of likes

L(v)

Tags

TagText(v)

No. of dislikes

D(v)

Description

DescText(v)

No. of comments

C(v)

Comments

CommentText(v)

Uploader

U(v)

Age

G(v)

Table 4: Metadata fields stored for each video CUBRIK R2 Pipelines for Query Processing

Page 19

D6.2 Version 1.0

In addition to the metadata fields directly available via the API, we crawled up to 10,000 most recent comments that are posted for each video from actual HTML responses of YouTube (API can provide only up to 1,000 comments). Due to the difficulties of crawling HTML, we could obtain around 33 million comments posted for around 86K unique videos in our dataset. This is a fairly large set of comments as the recent works also employ similar (e.g., up to 1,000 comments for 40K videos in [R77]) or smaller number of comments (e.g., a total of 6.1 million comments in [R75]). Finally, we also constructed the profiles of the users who uploaded the videos. To this end, for each user u, we again crawled the HTML pages to obtain the number of uploaded videos, number of subscribers (i.e., the number of users who are following the user u), and total number of views for the content uploaded by the user u. We ended up with the profiles for 85,068 unique users, denoted as UP. Note that, the metadata fields in Table 4 (other than TitleText, TagText and DescText that are related to the basic features) constitute part of the raw social features, from which we derive various social features as described in detail in Section 3.4.

3.4

Effectiveness and Correlation of the Individual Features

In this section, we seek the answer for the following two questions: 1) How effective is each individual feature for ranking videos? and, 2) How are the rankings generated by the different pairs of features correlated? To answer these questions, for each query q, we need to re-rank the retrieved videos with respect to each feature f in our feature set, F. So, we begin with formally defining the basic and social features that are used for ranking the videos.

3.4.1

Basic features

Basic features are based on the metadata fields created by the actual uploader of the video, namely, video title, tags and description. The features “title similarity” (fTitle), “tag similarity” (fTags) and “description similarity" (fDesc) represent the vector-based similarity score of the query text, q, to a video's title (TitleText(v)), tags (TagText(v)), and description (DescText(v)), respectively. In our setup, for each of these metadata fields, we create the corresponding index using Lucene 3.5 library and employ its default retrieval function (based on the TF-IDF weighting model) to obtain the similarity scores for each video.

3.4.2

Social features

Social features are those that are formed due to some user interaction with the video after it becomes available. In this sense, we first exploit the raw features provided by the system. For the metadata fields shown in Table 4, namely, number of views (W(v)), likes (L(v)), and comments (C(v)), we create the features fW, fL, and fC, respectively. To be able to use them for ranking the videos, we normalize each feature value with the age of the video, G(v). For the sake of completeness, we also consider the age of a video as a possible ranking feature and denote as fG, while it is not a truly social feature (i.e., it is not based on the user interaction). Furthermore, we derive the following social features from the raw features and available data for our videos (all of these feature values are further normalized into [0, 1] range based on the maximum score observed for a given query): Normalized no. of ratings (fR): This feature represents the total number of ratings per video. The ranking criteria is: (L(v) + D(v))/G(v). Normalized ratio of likes (fRL): This feature captures the fraction of likes over all ratings for a video. The ranking criteria is: (L(v)=(L(v)+D(v)))/G(v). Normalized no. of comment authors (fCA): We extract the username fields from the crawled comments to capture the number of different users who commented on a video. The ranking criteria is: A(v)/G(v) where A(v) is the number of unique users who posted a comment for v. Uploader popularity (fUp): The ranking criteria for a video v with an uploader u is , where and includes the videos uploaded by

CUBRIK R2 Pipelines for Query Processing

Page 20

D6.2 Version 1.0

u. Comment similarity (fCom): We first aggregate the top-25 most popular comments (i.e., those with the highest number of likes) of each video into a single document and index these documents using Lucene. Then, the Lucene score between q and the comment document is computed for each v. Comment positivity (fPos): We analyze the sentiment expressed in the comments by using a public vocabulary based tool, SentiWordNet [R81], as in [R75]. Simply, this tool assigns a triplet representing the objectivity, negativity and positivity scores for each word in a comment, which are then averaged to obtain the overall scores for the comment. For ranking purposes, we only consider the average positivity score over all comments of a video, for which the tool can generate a score. The ranking criteria is: where Pos(ci) is the sentiment positivity score for the comment ci of a video v. Comment rating (fCR): We compute a comment's rating as the difference of the number of likes and dislikes that it has received. The ranking criteria for a video v is the average rating computed over all the comments posted for v. average of the values likes - dislikes for all comments of a given video. Commenter popularity (fCP): We anticipate that the popular/active commenters would comment on the interesting and useful content. Therefore, for each unique commenter c of a video, we computed commenter popularity with the formula , where and Videos(c) includes the videos uploaded by c. The ranking criteria for a video v is the average commenter popularity computed over the all comments posted for v. Commenter channel viewers (fCW): As another metric for the commenter popularity, we use the number of viewers for their YouTube channels. Again, we compute the average number of viewers of all unique commenters of a video as the ranking criteria. Commenter channel subscribers (fCS): The ranking criteria for a video v is the average number of channel subscribers of all unique commenters of v. Commenter contact (fCC): The ranking criteria is the average number of contacts of all unique commenters of a video. To sum up, our feature set F consists of three basic and seventeen social features in total, which are listed in Table 5 for easy reference. Notation

Description

Notation

Description

fTitle

Title-query Similarity

fCom

Comment-query similarity

fTags

Tags-query Similarity

fCA

No. of comment authors

fDesc

Desc.-query Similarity

fPos

Comment Positivity

No. of views

fCR

Comment rating

No. of likes

fCP

Commenter popularity

No. of comments

fCW

Commenter channel viewers

Age

fCS

Commenter channel subscribers

No. of ratings

fCC

Commenter contacts

fRL

Ratio of likes

fUp

Uplaoder popularity Table 5: The list of basic and social features (F).

CUBRIK R2 Pipelines for Query Processing

Page 21

D6.2 Version 1.0

3.4.3

User Study

To compute the retrieval effectiveness of each individual feature, we need the relevance annotations for all of the (q, v) pairs. As this task requires serious human effort, a subset of 50 queries is sampled uniformly at random from the query set. In order to obtain the relevance judgments for these queries, we conducted a user study that involves 37 participants. Nine of the participants are female and the rest are males, and the age range is 20-35. All participants are from computer science related disciplines and 3 of them are undergraduates, 30 of them are graduate students, and the rest are post-docs. The participants are physically located in Germany, Turkey and USA. We asked each participant to choose a few queries that are interesting for them from our set of queries. Each query is assigned to only one participant. We asked them to annotate the top-100 result videos for a given query using a 5-point rating scale, i.e., in the order of highly irrelevant, irrelevant, undecided, relevant, and highly relevant. The annotation process was carried out using our Web site http://godzilla.kbs.uni-hannover.de:9111/popularEvaluation/welcome.xhtml. Since videos are not downloaded but streamed directly from YouTube, it turned out that a small percentage of them have disappeared in time, i.e., been deleted by the uploader or not displayed in certain countries due to the copyright violation issues. The participants were asked to annotate such videos with rating 0. Finally, to avoid any bias, no social features were displayed along with the videos, but their titles and tags are kept to facilitate the judgment task. Since a few queries retrieve less than 100 videos, we ended up with 4,969 relevance annotations for our set.

3.4.4

Effectiveness of the individual features

Our dataset presented in the previous section allows us to compute the effectiveness of all the features described in the previous paragraphs. For the user centric set of social features, we further crawled the commenter profiles only for the annotated videos, as doing the same for all the videos would be very time-consuming due the access limitations of the YouTube. This yielded 23,721 commenter profiles. Then, for each query q, we obtained the top-10 ranking Rq,f for each feature . In order to evaluate the performance of each individual ranking Rq,f , we computed the Normalized Discounted Cumulative Gain (NDCG) metric using the well-known trec_eval software package. Figure 10 shows the average NDCG@10 scores for each feature over all 50 queries from the popular and tail query sets, respectively. The top-5 most effective features are fTags, fDesc, fTitle, fCom and fG. The features derived from the comments seem to be the most promising social features, as they appear among the top-5 most effective features and perform comparable to the basic features.

CUBRIK R2 Pipelines for Query Processing

Page 22

D6.2 Version 1.0

Figure 10: Average NDCG@10 for top-10 videos per feature We further explore how successful is each feature at a finer grain and compute the percentage of queries for which a particular feature yields the highest NDCG@10 score. Figure 11 shows that for the popular queries, the three basic features fTitle, fDesc, and fTags provide the best rankings for 16%, 14% and 12% of the queries, respectively. This means that, the remaining 58% of the queries can benefit from the social features.

Figure 11: Fraction of queries for which a given feature yields the ranking with the highest NDCG@10

3.5

Learning to Rank Using Social Features

In the light of the above findings, it is promising to combine the basic and social features to optimize the retrieval performance. Moreover, as there is a high overlap between the rankings generated by certain pairs of the features, it also seems reasonable to apply a CUBRIK R2 Pipelines for Query Processing

Page 23

D6.2 Version 1.0

feature selection algorithm. In what follows, we present our video retrieval framework involving a number of state-of-the-art learning to rank (LETOR) strategies and a greedy feature selection strategy adapted from [R61]. In this framework, we explore the impact of social features on the video retrieval.

3.5.1

Video retrieval framework

In recent years, traditional ranking approaches based on the manually designed ranking functions (such as BM25, TF-IDF, etc.) are replaced or complemented by the rankers built by machine learning strategies [R53]. In a typical LETOR framework, a machine learning algorithm is trained using a set of triples of (q, F, r), where q is the query id, F is the mdimensional feature vector for a result object retrieved for q, and r is the relevance score. The learnt model is used to predict the relevance score for each pair (q, F) in the test set, which is then sorted with respect to these predicted scores. The success of the ranking model is evaluated using measures like NDCG [R53]. The LETOR algorithms proposed in the literature fall into three categories, namely, pointwise, pair-wise and list-wise [R53]. In this paper, we employ six LETOR approaches that cover all of these categories. We provide a concise description of each approach and refer the readers to the literature for details. RankSVM: This approach extends traditional SVM by utilizing instance pairs and their labels during training. In this work, we use the implementation by Joachims [R66]. RankBoost: First introduced by [R59], this algorithm also employs a pairwise boosting technique for ranking. ListNet: Instead of taking documents pairs as the instances, list-wise approaches exploit the lists of documents during the learning. In particular, ListNet [R50] is based on the Neural Networks and employs Gradient Descent algorithm in the optimization stage. CoordinateAscent: This is again a list-wise linear model which uses coordinate ascent technique that optimizes multivariate objective functions by sequentially doing optimization in one dimension at a time [R82]. For RankBoost, ListNet and Coordinate Ascent approaches, we use the RankLib package(also see [R57]). Gradient Boosted Regression Trees (GBRT): This is a simple yet very effective pointwise method for learning non-linear functions [R60] and indeed, said to be the current state-of-the-art learning paradigm [R70]. Random Forests (RF): Random Forests is a point-wise ranking approach based on the bagging technique, i.e., applying the learning algorithm multiple times on different subsets of the training data and averaging the results [R70]. RF is proposed as a low-cost alternative to GBRT with the additional advantage of being very resistant to over-fitting. Initialized Gradient Boosted Regression Trees (iGBRT): This approach uses the predictions from the RF algorithm as a starting point for the GBRT algorithm [R70]. We use the RT-Rank library for the GBRT, RF and iGBRT. For the above algorithms, we experiment with various parameter values and report the results for the best-performing configuration for each setup. In particular, for RankSVM, the trade-off parameter between the training error and margin is set to 10. For RankBoost, the number of rounds for training is set to 300 and the number of threshold candidates to search is set to 5. The number of training epochs is set to 500 for ListNet. For CoordinateAscent, the number of random restarts is 5 and number of iterations to search in each dimension is 10. We also set the metric to optimize on training data as the NDCG. For GBRT, the tree depth parameter is set to 2, number of trees for the ensemble is 1,000, and learning rate is 0.1 (the latter two values are also used in [R70]). For RF, again following the practices in [R70], we set the tree depth as 10% of the number of features (i.e., this is 2 in our setup, as we have at most 20 features) and number of trees as 10,000 since the algorithm is safe for not to over-fit. Finally, for iGBRT, we used the above parameters for RF, obtained predictions for the current test set and piped these predictions to GBRT that is also invoked with the above parameters CUBRIK R2 Pipelines for Query Processing

Page 24

D6.2 Version 1.0

for the original algorithm.

3.5.2

Feature selection for LETOR approaches

Feature selection is a well-known approach in machine learning for enhancing the accuracy (e.g., by preventing over-fitting) of the learnt model and efficiency of the learning process [R61][R57]. Geng et al. address the vitality of the feature selection issue for machine learning based approaches to the ranking problem and propose a greedy feature selection strategy that is also applicable in our framework [R61]. Formally, given a set of features {f1, â&#x20AC;Ś, fm} and the target number of features, k, the goal is selecting k features that would yield the maximum performance for a LETOR algorithm. Each feature is associated with an importance score, Imp(f), which is an indicator of the retrieval effectiveness of f. Furthermore, for each feature pair (fi, fj ), similarity of their top-N rankings, is computed. The optimization problem is defined as choosing a set of k features that maximizes the sum of the feature importance scores and minimizes the sum of the similarity scores between any two features. In what follows, we discuss two greedy feature selection strategies to address the optimization problem described above. First, we briefly review a strategy, so-called GAS (Greedy search Algorithm of Feature Selection), that is introduced by Geng et al. [R61]. Next, we propose to adopt a well-known strategy, Maximal Marginal Relevance [R51], for the feature selection in our learning to rank framework. GAS: This is a greedy search strategy [R61] that starts with choosing the feature, say fi, with the highest importance score into the top-k feature set, S. Next, for each of the remaining features fj , the importance score is updated with the respect to the following equation:

where c is a constant to balance the importance and similarity optimization objectives. The algorithm proceeds with choosing the next feature with the highest importance score and updating the remaining scores, until k features are determined. MMR: This is again a well-known greedy strategy [R51] that is originally introduced for the search result diversification problem; i.e., to construct both relevant and diverse top-k results for a given query. We adopt MMR to choose the features that yield both the highest average effectiveness and, at the same time, the most diverse rankings. In a similar manner to GAS, the MMR strategy also starts with choosing the feature fi with the highest importance score into the top-k feature set, S. Next, in each iteration, MMR computes the score of an unselected feature fj according to the following equation:

where c is again a constant to balance the importance and similarity. In other words, the score of fj in MMR is computed by discounting the featureâ&#x20AC;&#x;s importance score with its maximum similarity to the features that are already selected into S. In our case, following the practice in [R61], the feature importance score, Imp(f), is set to the NDCG@10 score of f obtained over the queries in the training set. The similarity score Sim(fi, fj) between any two features fi and fj is computed by a variant of the Kendall's Tau metric between their top-10 rankings, again, over the queries in the training set.

CUBRIK R2 Pipelines for Query Processing

Page 25

D6.2 Version 1.0

3.5.3

Experimental results for feature selection

In our LETOR framework, all experiments are conducted using five-fold cross validation over the query set (with 50 queries in each) as described in the previous section. For each fold, we first used the training set of 40 queries (i.e., around 4,000 annotations) to determine the kfeature sets (where 1â&#x2030;¤kâ&#x2030;¤20) from the set of all basic and social features, i.e, F, using the greedy selection algorithm. Next, for each value of k, the LETOR algorithms in our repository are trained with the same set of instances and these k features; and tested on the remaining 10 queries (around 1,000 annotated instances). The average NDCG@5 and NDCG@10 scores are computed using trec_eval software for the test queries. The final scores are obtained by averaging over the folds. In Figure 12, we provide the performance of each LETOR algorithm with respect to the number of features selected with GAS or MMR strategies. As in [R61], the performance fluctuates as the feature set grows. Nevertheless, for almost all cases, there exists a set of features, so-called the best-k set, that yields a higher performance than using all of the features, which justify our use of a feature selection algorithm. We further observe that the MMR strategy, that is adopted to LETOR framework in this paper, is comparable to GAS, and for particular cases, it can even outperform the latter.

CUBRIK R2 Pipelines for Query Processing

Page 26

D6.2 Version 1.0

Figure 12: NDCG scores for the LETOR algorithms w.r.t. the number of features

CUBRIK R2 Pipelines for Query Processing

Page 27

D6.2 Version 1.0

3.5.4

Experimental results for the impact of social features

To expose the potential of social features for video ranking, we compare the retrieval performance of using the best-k feature sets (from Figures 3) to the performance of using only the basic features. For the latter case, we employ the features fTags, fTitle and fDesc for training and testing all of the LETOR algorithms. Our findings are reported in Table 4. Note that, all of the best-k sets that obtained with the GAS strategy include some social features as k is always found to be greater than the number of basic features, i.e., 3. The findings are similar for the majority of the cases with MMR. Before discussing our results, please note that we avoid comparing the LETOR algorithms to each other in our framework. This is because the one-way ANOVA test for comparing the NDCG@10 scores of 50 queries for these six algorithms reveals that the performance differences among them are usually not significant, regardless of the feature set they employ (i.e., the basic or best-k features). In other words, it is not accurate, from statistical point of view, which of these algorithms performs best in our video retrieval scenario, and thus it is important to improve the performance of any of these algorithms using the social features. As Table 6 reveals, for all the algorithms, the best-k features can improve the NDCG@10 scores that are obtained by the basic features alone. For some cases, the improvement is numerically small, i.e., around 1% (though, so are most of the results reported in the LETOR literature, e.g., see [R53][R70]) while in some other cases, using particular social features in combination with the basic features can add up to an absolute 7% to the effectiveness. The gains in NDCG@10 obtained by using the best-k sets with social features are found to be statistically significant on a 95% confidence level for the RF and iGBRT approaches. The smaller numeric improvements in NDCG@10 scores, observed for the other algorithms, are not statistically significant. RankSvm

RankBoost

CoordinateAsc

ListNet

GBRT

iGBRT

Basic Features

0.8655

0.8092

0.8356

0.8243

0.8528

0.8073

GAS

0.8664

0.8228

0.8425

0.8378

0.8616

0.8547

0.8605

0.8691

0.8146

0.8424

0.8384

0.8581

0.8588

0.8576

MMR

Table 6: Average NDCG@10 scores for LETOR algorithms using the basic and best-k features obtained with the GAS and MMR strategies (for bold cases, differences from the baseline are statistically significant). For GAS and MMR, we also denote the number of selected features (k) in parentheses

CUBRIK R2 Pipelines for Query Processing

Page 28

D6.2 Version 1.0

3.6

Summary of the research findings

Our major findings in this section are summarized as follows: We show that the social features can improve the video retrieval performance when combined with the basic features, which indeed constitute a very strong baseline on their own. Furthermore, we demonstrate the usefulness of such features not only for the popular queries, for which there might be additional clues obtained from the abundant click data, but also for the tail queries, for which such click information is very scarce. This latter finding is worthwhile, given that the competition among search engines is becoming more focused on queries in the long tail (e.g., [R81]). The same results hold for image retrieval performance. Our experiments reveal that using all the basic and social features within a LETOR framework is ineffective, and feature selection strategies can successfully eliminate the redundant features (i.e., those that have low retrieval effectiveness and/or high overlap with the already selected features). In contrast, the best-k sets still include several social features (see the values of k in Table 6), which indicates that some social features that were not so effective on their own turn out to be useful when combined with the other basic and social features. We finally show that the MMR strategy, as we adopt in our implementation, is comparable or even to superior to GAS for the purposes of feature selection.

3.7

Expansion through Images component

Our choice of running the community-aware multi-media search result ranking and filtering on YouTube is based on the availability of a rich set of social features in this platform. Therefore our findings are applicable to the text, image and/or video search in other platforms that support similar kinds of features. The latter is not just an assumption; it is based on preliminary studies which have been carried out by the LUH team. A detailed description of this activity and results will be part of D6.3 M29. Therefore the methods developed by LUH in Task 6.3 have been applied in the Expansion through Images component which is now part of the HoE V-APP. The input for this component is a query (e.g. Gerhard Schrรถder) and the output is a set of best-k Flickr Images. The components is nested in the HoE Pipeline, in order to provide a focussed example on the method a standalone example can be executed as java -jar ExpansionThroughImages.jar -k 10 -q "gerhard+schroder"

-k: the number of results to retrieve, ordered based on relevance (default 10) -q: the query text (default gerhard+schroder" In order to provide the best-k Images, our component is doing the following steps: 1) Retrieve the top-300 results (images with the associated social feedback) from the LUH Flickr HoE Dataset. 2) Re-rank the top-300 results based on text+community features. The algorithm we used here is RankSvm and the feature selection is performed using the GAS method (both described in Section 3.5.1 and Section 3.5.2) 3) The component returns the list of URLs of the images found to be relevant for the given text query. The overall functionality of the Expansion through Images component is shown in Figure 13.

CUBRIK R2 Pipelines for Query Processing

Page 29

D6.2 Version 1.0

Figure 13: The functionality of the Expansion through Images component

3.8

Conclusion

To the best of our knowledge, we provide the first comprehensive investigation for the impact of the social features on video retrieval effectiveness. To this end, we focus on a keywordbased video search scenario for YouTube. The social features employed in this work are derived from the raw meta-data fields of the videos as well as the profiles of users who share and/or interact with these items. We show that while the basic features relying on the similarity of the query to the video titles, tags and descriptions are the most effective for the video retrieval, the social features are also valuable and can yield the best rankings for up to 58% of the queries, indicating their potential to improve the retrieval effectiveness. Our evaluations using two greedy feature selection algorithms and six state-of-the-art LETOR algorithms support our hypothesis: the rankers based on the subsets of features including both basic and social features outperform those built by using only the basic features.

3.9

Future Work

In the rest of Task 6.3 we mainly have three future work directions. First, research is planned to focus on further feature selection methods for community features which are specific to Flickr in order to improve the effectiveness of our image retrieval component. Second, we aim to obtain larger annotated datasets by leveraging the popular crowdsourcing solutions. Finally, we plan to develop new LETOR strategies that are specialized for different feature types.

CUBRIK R2 Pipelines for Query Processing

Page 30

D6.2 Version 1.0

Context-aware automatic query formulation

4.1

Introduction

4.1.1

Purpose and Scope

Guessing user informational needs and formulating queries for the user in advance could be considered as an inverse engineering problem. Typically a user submits a query to a search engine and the engine returns some results. However, the aim of T6.4 (context-aware automatic query formulation) is to produce a mechanism that works the opposite way: predicts user‟s informational needs and thus automatically formulates an appropriate query, based on user‟s search behaviour (e.g. previously submitted queries and search results clicked) and other contextual information. A successful “guessing” module would be valuable to a number of search engine related applications, such as online advertisement and web page re-ranking. Moreover, this mechanism could greatly improve user experience and working efficiency. Impressively, Qiu and Cho [R1] list statistics (Nielsen/NetRatings reports [R2]) showing that if the time users spend searching on Google could be reduced by just 1% by integrating personalised features to the search engine, more than 187.000 person-hours (i.e. 21 years) would be economised each month. A personalised approach to inferring user‟s search intents has been suggested by several researchers [R1], [R3]. Although two users issuing the same query could have different informational needs, a typical search engine would return the same list of results to both of them. So as to confront this issue, it has been suggested that the search engine should construct a preference profile for each user based on his/her history search.

4.1.2

Contribution

In the context of T6.4, an algorithm has been developed during the reporting period, to analyse the behaviour of the users according the information gathered on their past activities and on the activities of other related users; it has the aim to suggest search queries to the users while they use the application. Specifically, the algorithm predicts user‟s informational needs and thus automatically formulates appropriate queries, based on user‟s search behaviour (e.g. previously submitted queries and search results clicked) and other contextual information. Both global and user history query log, temporal and spatial information are taken into account to infer user search intend. Moreover, we integrated to the algorithm enhanced personalized information, through estimation of user mood for every query user has submitted. Particularly, a method for estimating the mood for each query that the user submits, has been implemented. The method is based on assumptions about how user mood would be influenced by events taking place during search engine usage. Events are extracted from user click-through data. Thus, the algorithm is eventually capable to rank the returned query suggestions according to their associated mood, since a query with an associated “negative mood”, is a query that “disappointed” the user who submitted it. In order to acquire a first evaluation of the algorithm, developed for predicting the informational needs of the user and thus automatically formulating an appropriate query, the AOL query log 2006 has been employed. We have also used the wordnet::similarity module in order to implement semantic similarity and relatedness measures. Thus, a similarity measurement can be obtained between each query submitted by a user (at the AOL search engine) and the queries that our developed algorithm would have suggested to her/him. The algorithm receives as input only data (from the AOL query log) that had been submitted to the search engine before the timestamp of each submitted query. First results indicate a mean similarity of over 56.04%. CUBRIK R2 Pipelines for Query Processing

Page 31

D6.2 Version 1.0

In addition, in order to evaluate the feasibility of integrating the developed algorithm within a real search engine interface with real users and to qualitatively investigate its effectiveness in real-time automatic query formulation, we developed a Google-like search engine interface using the SMILA code and the fashion dataset. Moreover, we engaged 47 volunteer users to register to the search engine interface. Participants used the system for a period of three months.

4.2

Relevant Research

In this section, research topics closely related to automatically guessing user needs and formulating appropriate queries for the user in advance are briefly reviewed. These topics are mostly related to query recommendation and could provide useful insight.

4.2.1

User submitted queries

Adequately formulated queries are basic to the effective performance of keyword-based search engines, such as Google or Yahoo. A query, to be considered successful, has to be both expressive and selective [R4]. An expressive query would clearly convey the user‟s information need, while a selective query would bring about a reasonably limited set of matching results. A frequent problem concerning user input to search engines are queries consisting only of one or two keywords each [R5][R6]. These short queries are usually characterized by a high degree of ambiguity [R7]. For instance, the keyword “apple” could be inferred as a user‟s intention to be informed about the fruit apple, a company named apple, or a bar-restaurant in town named apple. In that sense, poor retrieval results are a consequence of short queries that lack of expressiveness. Even if a short query happens not to be ambiguous, it is very likely to be too general. For example, the keyword “sea” may imply a user‟s intention to learn about the Caspian Sea, the Black Sea, the Mediterranean Sea, or even an ocean. Hence, short queries‟ lack of selectiveness causes the search engine to retrieve a large number of results and thus often failing to accurately present the specific information that the user intended to search for. Most often search engine users want to be informed about things which they are largely unacquainted with. Therefore, they need assistance in order to formulate an effective search query. Modern search engines integrate artificial intelligence methods to assist users in creating a useful search query. In this context, query recommendation is a frequently used method to assist users toward this goal and is a research area directly related to the research topic addressed in the present work.

CUBRIK R2 Pipelines for Query Processing

Page 32

D6.2 Version 1.0

Figure 14: Recommended queries for keyword “apple”

4.2.2

Query recommendation

Query recommendation is a method integrated into search engines in order to help users reformulate their queries. When users specify a query, the search engine provides them with a list of suggested queries so that they can identify the query that most closely matches their informational needs (Fig. 1). For instance, if a user types the keyword “apple” the search engine would recommend queries such as “apple laptop” or “apple iphone price”. Traditionally, query recommendation methods are based on similarity metrics suggesting frequently used queries that are similar to the user input [R8][R9]. This approach has been based on the assumption that similar queries express the same informational need and that the most frequently used of them convey that need more appropriately. Nevertheless, there is a variety of similarity metrics that considerably differ from each other regarding the way they evaluate similarity. Recommending queries based on similarity may lead to two serious problems: Redundancy and monotonicity. Redundancy in recommendation may be caused when recommended queries are too similar, thus not helping the user effectively refine his/her query. For instance, if the user submits the query “types of cheese” and the search engine recommends the queries “names of cheese”, “all types of cheese”, and “types of cheese alphabetical”, it is rather unlikely that these queries would help the user in case she/he needed to refine his/her initial query, since the three recommended queries would return similar results. Monotonicity in recommendation refers to recommended queries that although are not too similar, refer to the same concept. For example, in case of the input query “apple”, the recommendations “apple juice” and “vitamins of apple” both interpret the keyword “apple” as fruit, while the user may have intended to search for “apple laptop”.

CUBRIK R2 Pipelines for Query Processing

Page 33

D6.2 Version 1.0

Session-based approaches to query recommendation A number of researchers proposed suggesting to a user query relevant terms that coincide in similar query sessions. Huang, Chien, & Oyang [R10] used query data passing through a proxy server to different search engines. They identified queries belonging to same search sessions in query logs, by taking into account queries‟ IP address and time so as to identify the beginning and end of each session within the logs. Based on Silverstein et al. [R11], who argued that a user submits queries referring to the same concept within a time frame and some time has to pass from the last query of a session for the user to submit queries referring to a different concept (i.e. start a new session), Huang, Chien, & Oyang [R10] employed a time threshold as a delimiter in order to distinguish between different query sessions. They observed that a 5 minute threshold was ideal for this purpose, which interestingly had also been used by Silverstein et al. [R11]. After having extracted sessions from query logs, Huang, Chien, & Oyang [R10] applied similarity metrics on a term cooccurrence matrix in order to define related terms in similar sessions and use them as suggestions for each other. Although the Huang, Chien, and Oyang [R10] recommendation algorithm has been described in various papers as a “session-based approach” [R12][R13], it does also employ user context information: It suggests terms that are relevant not only to user‟s currently issued query but also to his/her previously submitted ones (if any) in his/her current session. Fonseca, Golgher, Moura, and Ziviani [R8] follow a similar approach to Huang, Chien, & Oyang [R10]. However, they arbitrarily segment query sessions from log data, defining a session as all the queries made by a user in a predefined time interval. For their experiment, they seem to have randomly defined this interval to 10 minutes. Though, Fonseca, Golgher, Moura, and Ziviani [R8] use IP address as well to distinguish between different users. Finally, the algorithm of [R8] defines related terms in similar sessions but does not make use of user‟s context information. It should by noted that all methods described in this section have less chance to extract relevant terms for queries rarely submitted by users. Grouping similar queries into clusters Clustering queries based on click-through and lexicographical information Several researchers proposed grouping queries into clusters and using them as suggestions for each other. This cluster-based approach derives from click-through information, supposing that the more clicked URLs two queries share, the more related they are. Beeferman & Berger [R14], who were of the pioneers of this method, suggested forming a bipartite graph consisting of vertices on one side corresponding to queries and on the other side to URLs that users selected to click, and then applying an agglomerating clustering algorithm to graph‟s vertices in order to classify associated queries and URLs. It should be noted that content is not taken into account from this method, i.e. the method does not compare keywords between queries and/or clicked URLs. Wen, Nie, & Zhang [R5], however, did combine content (i.e. keywords) with click-through information so as to effectively cluster queries. Context-aware clustering Cao et al. [R12], go even further: They consider groups of similar queries to be concepts and thus cluster global query logs into different concept groups employing a click-through bipartite graph, as in [R14]. Then, they cluster user‟s previous queries (i.e. user context) into concept groups, forming a sequence of concepts to represent user‟s context. Next, their method defines which queries other users may ask after that specific sequence of concepts and return a popularity ranked list of recommended queries to the user. Because of excessive computational load, mining of historical sessions in search log data is performed offline, forming a “concept sequence suffix tree” against which the user‟s query is matched online

CUBRIK R2 Pipelines for Query Processing

Page 34

D6.2 Version 1.0

and concepts to which the user's next query may belong to are detected. Finally, the most popular queries in the concepts are presented to the user. Nevertheless, as He et al. [R13] state: “Although these methods can effectively find similar queries, in query recommendation, it is more interesting to recommend queries that a user may ask next in the query context, rather than suggest queries to replace the current query”.

4.3

Proposed Method

Our proposed method, developed in the context of T6.4 during the reporting period, is based on both personal and global query logs to infer user‟s search intend. Thus, in the absence of personal data at any given time and place, the search engine could still make a guess about user‟s informational need taking into account queries of other users, whose sessions however share similarities with the current user‟s session. The training part of the method is constituted of 4 steps, which are described in the respective sections of the deliverable: A. Session extraction from log data (4.3.1). B. Classification of sessions according to temporal and spatial criteria (4.3.2). C. Clustering of sessions based on the similarity of their queries (4.3.3). D. Calculation of mood scores for candidate query suggestions (4.4).

4.3.1

Session extraction from log data

In order to distinguish between different sessions, the method extracts the first 2 or 3 queries, submitted by unique IP addresses, by accepting time intervals of maximum 5 minutes between queries. Taking IP addresses into account to identify different users (and thus different search sessions) has been a common practice [R8][R10]. The 5 minute time interval is chosen based on Silverstein et al. [R11] and Huang, Chien, and Oyang [R10] research. The chosen number of queries per session is supported by previous studies approximating the average length of a query session to be 2~3 queries [R15][R16]. Let us, now, assume that Q is the set of all queries, qi is a query string, ipi the IP address of the machine from which a query qi was submitted, ti the timestamp at which a query qi was submitted, and l, or |qs| is the length of query session, i.e., the number of requests in a session. Thus, a query session qs could be defined as follows:

ip, r1,..., rn , r R, 2 n 3 ,

where r1,…,rn are up to 3 search requests submitted from the same IP address, and R is the set of all requests. The minimum value of n has been set to 2 because at least two requests are needed to form a query session. The maximum value of n (i.e. length of query session) has been limited to 3 in order to increase the certainty that extracted requests would convey the same informational need and decrease the risk of including into qs cases where the user‟s search intend shifts unexpectedly [R17]. It should be noted at this point that each search request is considered herein to consist of a timestamp, the query that was submitted to the search engine and the set of user actions that followed the presentation of search results (e.g. clicks on URLs of search results). Thus, a search request is encoded herein as:

ti , qi , mi

where qi is the query submitted to the search engine ti is the timestamp of the submitted query and mi is the set of user actions that followed the presentation of search results. To summarize, for a request ri to be added to a query session qs after ri-1 has been identified to belong in that session, the following conditions should be satisfied: 1)

ipi 1

ipi

CUBRIK R2 Pipelines for Query Processing

Page 35

D6.2 Version 1.0

ti 1

3) qs , l

time threshold(five minutes)

As a result, a set of all sessions S is finally obtained, i.e.:

S 4.3.2

qs1 ,.., qsi .

Classification of sessions based on time and spatial resolution

Following the above process of splitting the query logs into sessions, our method thereafter classifies sessions to groups according to temporal and spatial criteria, in order to increase resolution analysis in session data. Initially, the time stamps of queries are matched to specific time periods: - Early morning (06:01 to 09:00) - Morning (09:01 am to 12:00 am) - Late Morning (12:01 -15:00) - Early Afternoon (15:01 – 18:00) - Late Afternoon (18:01 -21:00) - Night (21:01 – 00:00) - Early After Midnight (00:00 – 03:00) - Late After Midnight (03:01 – 06:00) Having classified the sessions into groups in respect of the time period to which they belong, we subsequently further classify each group‟s sessions in respect of the further temporal and spatial criteria employed in our method, as explained below. In respect of time we use the following criteria: - “Month” (M) - “Name of Day” (NAD) - “Number of Day in the Month” (NUD) In respect of space, we use: - “City” (CI) - “Country” (CO) where CI and CO are criteria derived from the IP address. It should be noted at this point that in order to accomplish the translation of each IP adress into its corresponding city and country, a REST web service was developed from CERTH, that takes as input an IP and returns as output the corresponding city and country. This functionality is supported by an appropriate MySQL database that was also built in this context, which was populated through data obtained from the MaxMind Geolite City and Country IP geolocation databases [R18]. Our IP geolocation module was developed as a web service, publicly accessible at the IP: http://160.40.50.84:8080/WebApp1/webresources/ ipWS?ip=xxx.xxx.xxx.xxx, so as to simplify the effort needed for integrating its functionality with the rest of our query suggestion framework, and the web search engine interface that was developed so as to examine our method‟s operation in practice (shown in Figure 15 and Figure 16 of Section 4.6). Eventually, the sessions of each time period are classified into groups, in respect of all aforedescribed temporal and spatial characteristics employed. The number, denoted as g, of ways groups, formed using the above criteria, can be selected by taking into account order, is provided by the following relation: n

g r 1

CUBRIK R2 Pipelines for Query Processing

n! , (n r )! Page 36

D6.2 Version 1.0

where n is the total number of specified criteria (5 in this case). It should be noted that the aforedescribed session classification framework could be applied in the future by utilizing also different spatiotemporal criteria. Thus, future studies, using the method proposed in this work, could utilize further or even different temporal and spatial criteria and/or alter the selected time periods. In fact, effectively defining the above parameters requires evidence from a number of future studies and experiments.

4.3.3

Clustering of sessions

At this stage, it is important to cluster together unique similar sessions. Notably, groups generated from the above predefined criteria may contain overlapping data. For example, sets grouped according to CI criterion are also included in sets grouped according to CO criterion, i.e. CI CO . Moreover, global data overlap with personal data. Hence, two tasks have to be performed: 1) Use the union of groups M, NAD, NUD, CI, and CO in order to avoid double counting of sessions. 2) Estimate the similarity of these query sessions in order to create clusters. Let uSg and uSp be the sets containing all unique global and personal sessions respectively, i.e.

uSg

NADg

NUDg

CI g

COg

NAD p

NUDp

CI p

CO p

uSp

NAD p

NUDp

CI p

CO p

NADg

NUDg

CI g

COg

These sets eventually contain all global (uSg) and personal (uSp) sessions that have similar temporal and spatial characteristics to the logged in user‟s current session. The remaining task for the query suggestion method relies in identifying within these sets, the query terms that are appropriate to be provided to the user as suggestions. In order to do so, the clustering process described in the following is employed. An important part of the clustering process is to decide about the most adequate method for the task at hand. The particularities of our task are the following: 1) We are interested in guessing user‟s search intend and formulate appropriate queries encompassing general concepts. Suggesting too specific queries to the user may reduce the chance of making a correct guess. Thus, focusing on low frequency query sessions may not be beneficial. 2) The clustering method should not require manual setting of the resulting number of clusters. 3) New query sessions will be added to the search engine‟s history log. Therefore, the clustering method should have the ability to function incrementally. Therefore, the hierarchical clustering approach serves in our context better the needs of guessing user‟s search intends. Clustering specific concepts under more general concepts provides a larger space for guessing a user‟s search intend and does not require defining the number of clusters in advance. Moreover, a hierarchical clustering approach has the potential to be designed as an incremental adaptive procedure. Nevertheless, the quality of the clustering results also depends on the similarity method employed by a hierarchical clustering approach. In this work, the co-occurrence frequency between each unique query term of a session (r1x) and each unique query term of the session that is considered to belong to the same cluster (r2y) is taken into account toward calculating the similarity between sessions that will allow clustering on a basis of a session similarity threshold:

f r1x , r2 y

co - occurrence r1x , r2 y

Thereafter, session similarity is calculated as the average co-occurrence score of all possible unique query terms between the two examined sessions. For instance, similarity between

CUBRIK R2 Pipelines for Query Processing

Page 37

D6.2 Version 1.0

sessions

qs1

ip1 , r11 ,..., r1m and qs 2 sim qs1 , qs 2

ip2 , r21,..., r2n would be: sim r1i , r2 j / mn 1 i m ,1 j n

Thus, for each pair of sessions contained in uSg and uSp respectively, their similarity is calculated based on the above formula, and is then compared against a pre-defined experimentally set threshold so as to decide whether the two sessions belong to the same cluster or not. The co-occurrence frequency similarity method was chosen because it favours high frequency terms [R5], thus increasing generality in clusters. Finally, like in [R5], each one of the acquired clusters is named by the pair of common query terms with the highest frequency. These query terms are eventually added to the suggestions list that is bound to be provided to the user. Suggestions derived from the personal sessions set (uSp) are added first, whereas suggestions from the global sessions set (uSg) are added at subsequent positions of the list, as further explained below, in Section 4.5. At this point, the candidate suggestions list can either be provided as is to the search engine user, having the queries ranked in essence in terms of their occurrence frequency and whether they belong to “personal” or “global” sessions, or can be further refined through our developed algorithm for estimating user‟s mood during search engine usage. In the latter case, the candidate suggestions are evaluated in terms of the user mood which they were found to be associated with, through the mood estimation algorithm that was developed in the context of T6.4 and is described in the following (Section 4.4). The final suggestions list that is provided to the user through the complete version of our developed context-aware automatic query formulation method (as explained in Section 4.5) is therefore also based on mood as a further factor of the user‟s context.

4.4

Estimating user mood during search engine usage

4.4.1

Introduction

A successful affect recognition module would be valuable to a number of search engine related applications, such as online advertisement and web page re-ranking. For example, if a search engine can reason about the emotional state of a user from the input that the system receives, appropriate content could be displayed in a way adapted to the emotion or the mood of the user. Besides, this mechanism could provide a basis for efficient affective feedback presented by a search engine, and thus greatly improve user experience and working efficiency. Impressively, [R1] list statistics (Nielsen/NetRatings reports [R2]) showing that if the time users spend searching on Google could be reduced by just 1% by integrating personalised features to the search engine, more than 187.000 person-hours (i.e. 21 years) would be economised each month. In view of that, the aim of this work is to provide a basis for constructing a computational model able to estimate, as accurately as possible, user mood for each query a user submits to a search engine. Emotions do not simply consist of pleasant or unpleasant feelings, but also have causes of which humans are usually aware of [R18]. Importantly, as causal factors of emotions disappear, emotions gradually fade [R20]. On the other hand, an individuals‟ current mood is usually influenced from more than one source [R21]. Emotional experiences of humans decisively influence mood alterations [R22]. Nevertheless, knowledge about the cause is only a central feature of emotions, while moods not have such a focus and consequently can be the origin of a variety of perceptions [R18], [R23]. During search engine usage, individuals experience a variety of events (e.g. the search engine returns few or no results for a submitted query) that may trigger various emotional states. The user‟s affective state toward a search experience is the resultant of various emotional states that were activated during that experience. After the end of the search session, users are most likely to focus at their CUBRIK R2 Pipelines for Query Processing

Page 38

D6.2 Version 1.0

overall affective disposition (i.e. mood) towards the search experience, rather than on each particular event that took place during that experience and may have triggered an emotional state. Consequently, we constructed a model to calculate the assimilation of each event into user‟s mood, based on assumptions about the influence of possible events during a search session. The model derived from research stating that a person‟s emotions could be predictable if their goals and perception of relevant events were known [R24]. For instance, according to the OCC model [R24], joy and distress emotions arise when a person focuses on the desirability of an event in relation to his/her goals. The OCC model defines joy as a person pleased with a desirable event, and distress as a person displeased with an undesirable event. Its implementation in a computational model can be achieved by using agents, artificial intelligence techniques, reasoning on goals, situations, and preferences [R25]. In this case, a computational model of mood in the context of a search engine was built, based on relevant assumptions. Interestingly, previous research on web searching [R26] has revealed connections between emotion and success. User‟s success in finding desired information has been associated with positive feelings and user‟s behavioural intention to continue using the search engine. Yet, other researchers [R27] have provided evidence that longer times spent on searching can be linked to subjective feelings of “being lost in the web”. As already stated, the model provides an estimation of mood for each submitted query. Research evidence has revealed that if people are somehow inclined to regulate their mood in anticipation of social interaction, the direction of such regulatory attempts should be in the direction of neutrality, regardless of whether the initial mood is positive or negative [R28]. Moreover, there is research indicating that humans consider computers in a way similar to the social behaviour exhibited in human-human interactions [R29], [R30]. Consequently, users may also consider a search engine experience as a form of social interaction, so they may attempt to neutralize their mood when submitting a query. This kind of behaviour could, to a certain extent, be interpreted as readiness for participation in interaction that suspends or erases prior emotions or mood that would be possibly unrelated to new submitted queries. Whether and in what way user mood, as a result of previously submitted queries, should be taken into account when calculating user mood for a currently submitted query, is an issue that should be critically addressed. Nevertheless, at this point we only focus our assumptions on forming a model based on user‟s interaction with search results presented for each separate query a user submitted. Although the model has been based on relevant literature, at this early stage a heuristic approach has been followed in order to provide a first ground for future work. While the model seems to behave reasonably well in the context of three case studies presented at section 4.4.3, the model‟s assumptions should be validated through real user data. Moreover, in the future, the model could include additional variables encompassing further aspects of affect-relevant user search behaviour.

4.4.2

Proposed Model

We have explored several research questions in the context of a Google-like search engine. Mood is estimated based on assumptions concerning events taking place as a user navigates through the result pages (e.g. user clicks on a result). User‟s mood is calculated incrementally, meaning that the effects of events taking place add gradually in the model. This is done in order to take into account different emotional effects that may have taken place during user‟s search experience. Although, the model does not identify distinct emotions, it manages to distinguish between different events that may have triggered different emotional states. Hence, the user‟s estimated mood is a result of (emotional) effects emanating from diverse events after the user has submitted her/his query to the search engine. In order to model the effect of events, an exponential modality has been integrated to linear logic. This is supported from evidence that human senses are modelled more effectively through the use of an exponential function. Thus, there is research suggesting that this could also apply to affective modelling [R31], [R32], [R33]. Emotions, just like senses, may not CUBRIK R2 Pipelines for Query Processing

Page 39

D6.2 Version 1.0

respond to stimulus in a linear way. Therefore, the proposed method makes the assumption that the different events taking place, during user search experience, do not influence the user‟s mood in a linear way but in a logarithmic one. The exponential - exp ( ) function employed returns a number specifying e (the base of natural logarithms) raised to a power. That is to say, the natural logarithm of a number is the inverse of the exp ( ) function. The number e is used to express values of such logarithmic quantities as field level, power level, sound pressure level, and logarithmic decrement [R34]. Affective issues concerning humans could be defined as logarithmic quantities as well. Human cognitive and affective reactions have been suggested not to be linear to the stimulus causing these reactions, but rather exponential. Hence, in order to express the logarithmic decrement or increment of the influence of various events happening during a search session, the exp( ) function is used accordingly in order to model the form of that influence (e.g. inversely proportional to user‟s mood). In respect of a search session, where the user has submitted a query and results have been returned, assumptions have been formed based on the order in which each result was selected (clicked), the ranking of each selected result, and the number of times user clicked on a result. The following assumptions have been made: A) The sooner a user finds interesting information, the better. Thus, we assume that the ranking of each selected result along with the order in which each result was clicked would be inversely proportional to user‟s mood. B) Based on the notion that too many results clicked may imply increased user burden, we suppose that the number of selected results would also be inversely proportional to user‟s mood. Moreover, based on relevant literature we assume that the influence of previous results fades exponentially as the user clicks on more results. Therefore, the assumed inversely proportional effect of the number of results orderly clicked is expressed through an exponential modality. The proposed method calculates this effect both incrementally for each selected result, and in total for the entire number of results clicked. C) Clicking again on a result, after having clicked on several results, may be indicative of user‟s “positive” perception of that result. Thus, we assume that the number of times user clicked on a result would be proportional to user‟s mood. In order to model the aforementioned assumptions, the following relation has been used: 1 exp( ord n )

Qm 1

1, Ri

TC i

i 1

1 exp( ord i ) Rordi

RTC ordi ,

0, ord i

1 i n, i N n 1, n N ord1 0, ord 2 Ri

1, TC i

1..ord n

n 1, ord i

N N

(1) Qm is the estimated mood for a submitted query. ord i expresses the order of user‟s clicks on results. ord i starts from zero (i.e. ord1 0 for user‟s first click on a result, ord 2 1 for user‟s second click.. ord n

n 1 for user‟s last click on a result.).

Rordi denotes the rank of selected

result at ord i (e.g. in case user‟s first click was on the 2nd result of the page, ord1 0 and

2 ), and RTCordi is a number denoting how many times the user has clicked on the result

up to the current ord i (e.g. in case user first clicked on the first result, then on the third, and after that clicked on the first result again, ord 3 2, R2 1, RTC 2 2 ). Through the above formula, each query found in the search engine usage log can be assigned a mood score, by estimating through the user actions that followed the provision of CUBRIK R2 Pipelines for Query Processing

Page 40

D6.2 Version 1.0

search results, the mood that was associated with that specific query. This mood score is eventually used as a further contextual factor within our proposed method for context-aware query formulation.

4.4.3

Case studies

In this section, three case studies are presented, so as to demonstrate how mood is estimated from the proposed method in practice and moreover, to demonstrate the capability of the proposed method to assign a mood score on a search session queries. A. Case study 1: Early and late selection of information User 1: User 1 submits a query to the search engine and clicks on the third result. User 2: User 2 submits a query to the search engine and clicks on twelfth result. User 3: User 3 submits a query to the search engine and clicks on the twenty-ninth result. Users 1, 2, and 3 (Table 7) submitted a query to the search engine. Each one of them selected only one result. User 1 (predicted mood 0.23) clicks early on the third result of the first result page (we consider that each result page is consisted of 10 results). However, user 2 (predicted mood 0.08) has to go to the next page of results before she/he finds a desired result and click on it. Even worse, user 3 (predicted mood 0.03) has to navigate to a next page twice in order to find an interesting result. Table 7: Case study 1 USER 1

USER 2

USER 3

order

RTC

0.23

0.08

0.03

B. Case study 2: Moving upwards and downwards during result selection User 1: User 1 clicks on the first result just after submitting a query to the search engine. Then, she/he returns to the result page and clicks on the second result. User 2: User 2 selects the second result after submitting a query. Then, she/he returns to the result page and clicks on the first result. User 3: User 3 submits a query and first clicks on the second result. After returning to the result page she/he clicks on the third result. User 4: User 4 inputs another query and after pressing the submit button, clicks on the second result. Next, she/he returns to the result page, scrolls down to the last result of the page and clicks on the ninth result. User 5: User 5 first clicks on the third result after submitting a query to the search engine. Next, she/he returns to the result page and selects the second result. User 6: User 6 inputs a query, presses the submit button, and scrolls down to the last result of the page and clicks on the ninth result. Finally, she/he returns to the result page, scrolls up again and clicks on the second result. User 1 (predicted mood 0.43) recognized helpful information early on by clicking on the first and second result. User 2 (predicted mood: 0.32) as well found attractive information at the CUBRIK R2 Pipelines for Query Processing

Page 41

D6.2 Version 1.0

first and second result. However, the fact that she/he first clicked on the second result may be indicative of his/her perception of the first result being less relevant than the second. On the other hand, user 3 (predicted mood 0.23) also clicked first on the second result, but then moved downwards by clicking on the third result. User 4 (predicted mood 0.2) seems to also have found early on interesting information by clicking on the second result, but had to scroll down in order to find another attractive piece of information at the ninth result. The fact that user 5 (predicted mood 0.19) first clicked on the third result shows that she/he first found interesting information a little later than users 1, 2, 3, and 4 (Table 8), though user 5 next moved upwards by clicking on the second result. Finally, user 6 (predicted mood 0.11) seems to have been disappointed by the first impression of results and had to scroll down in order to find a potentially interesting result. The fact that she/he scrolled up again and clicked on the second result, gives the feeling that this was not her/his most preferable choice, but probably was better than nothing. Table 8: Case study 2 USER 1

USER 2

USER 3

Order

RTC

0.43

0.32

0.23

USER 4

USER 5

USER 6

Order

RTC

0.2

0.19

0.11

C. Case study 3: Returning to a result User 1: User 1 selects the first result after submitting a query to the search engine. Then, she/he returns to the result page and clicks on the third result. User 1 returns back to the result page for another time, scrolls down and clicks on the ninth result. User 2: User 2 inputs a query as well and presses the submit button. Then she/he clicks on the first result, returns back to the result page, scrolls down and clicks on the ninth result. Finally, user 2 returns back to the result page for another time, scrolls up again and clicks on the first result for a second time. User 3: User 3 submits a query and scrolls down the result page with the purpose of finding an interesting result. She/he clicks on the ninth result and after returning back to the result page scrolls up and clicks on the first result. Finally, user 3 returns to the result page for another time, scrolls down again and clicks on the ninth result for a second time. User 1 (predicted mood: 0.15) gives the impression that she/he found desired information without too much effort, although she/he also scrolled down to the ninth result. The fact that user 2 (predicted mood 0.18) returned to the first result after scrolling down and clicking on the ninth result, increases the chance that user 2 perceived the first result as highly relevant. Thus, user 2 provides a stronger impression than user 1 of having identified interesting CUBRIK R2 Pipelines for Query Processing

Page 42

D6.2 Version 1.0

information early on before scrolling down to the ninth result. On the other hand, the fact that user 3 (predicted mood 0.07) returned to the ninth result after scrolling up and clicking on the first result, also increases the possibility of user 3 perceiving the ninth result as highly relevant, but does not compensate for finding relevant information later than user 2 did (Table 9). Table 9: Case study 3 USER 1

USER 3

Order

RTC

4.4.4

USER 2

0.15

0.18

0.07

Discussion

This work is a first approach to modeling user affect during search engine use. As such, it aims at providing useful intuition. Integrating affect recognition to search engines creates great potential for transforming user experience through a new generation of affect-sensitive applications. Importantly, many search engine related activities, such as search advertisement and search engine marketing (SEM), could derive enormous benefits. For instance, if a search engine could estimate that a user is not in a mood for shopping, search ads could be avoided at that point. On the other hand, the effectiveness of search ads could be significantly increased, if displayed when a user is at an adequate affective state to browse products and services. With the purpose of demonstrating the model‟s ability to express the assumptions that have been made, three case studies were created. Case studies 1 and 2 represent assumption A. The scenario of case study 1, involving three users, has been set up in order to exhibit how the model's estimations about a user's mood would vary according to how early that user identified interesting information. Clearly, the model manages to reflect the assumed difference in mood between a user finding interesting information early on and a user that had difficulty in identifying a potentially helpful result. Case study 2 has been designed with the purpose of showing the capacity of the model to take into account the order in which results were selected. The scenario involves three pairs of users (i.e. six users). Users of each pair selected the same two results, but with different order. It becomes apparent that the model effectively mirrors the assumed difference in mood between each user, taking into account the ranking of each selected result along with the order in which each result was selected. Case study 3, involving three users and representing assumption C, has been planned in order to emphasize on the model‟s ability to reflect the notion that a user clicking a result for more than one time provides increased evidence of his/her positive perception of that result. The model reflects the assumed influence of such an event on user‟s mood, taking also into account the ranking of each selected result along with the order in which each result was clicked. Although no separate case study has been designed for assumption B, it is also obvious, from the aforementioned case studies that the model manages to express the assumption that the number of selected results would as well be inversely proportional to user‟s mood. In the near future the validity of the assumptions that have been made in this paper could be examined through an appropriate experimental protocol involving real users. Further assumptions about possibly affective inducing events, happening during a search engine experience, could also be integrated in the proposed model. For instance, the time a user spent at a certain result page, the total time of user navigation through result pages, and user

CUBRIK R2 Pipelines for Query Processing

Page 43

D6.2 Version 1.0

scrolling direction and frequency, are events that could provide a basis for forming further assumptions about user affective state. These assumptions could be as well validated through real user data. Further important factors that may significantly influence user‟s mood are affective effects of emotional tone of text and other content related information (e.g. images) included in results presented by a search engine. Presumably, user‟s personal and cultural background is essential when estimating the actual influence of the aforementioned factors on user‟s affective state. For instance, according to [R35], although complex search situations are frequently associated with uncertainty, it is the user‟s perception of complexity that triggers her/his negative feelings, rather than the actual complexity of a task. Besides, there is research evidence [R36] suggesting that user‟s emotional coping abilities may be more important to the success or failure of a search endeavor than user‟s cognitive skills. Evidently [R37], [R38], emotional features of user‟s search experience are not only related to finding desired information. Accordingly, it has been suggested [R39] that during the development and evaluation stages of a search engine interface all aspects of information-seeking experience should be taken into account. However, to the best of our knowledge this work is the first step in the direction of providing a computational framework for affect recognition in a search engine context. Our wish is that the affective computing community will soon take action and contribute effectively towards this purpose.

4.5

Suggesting queries to the user

When the user logs into the system, her/his session is first assigned to a time period (Early morning, Morning, etc.), respective to the time of logging in. Thereafter, the system tries to identify global and personal session clusters of the specific time period, which share similar spatiotemporal characteristics to the user‟s session. Consequently, there are four cases for the specific user time period: 1) Both global and personal clusters have been identified 2) Only personal clusters have been identified 3) Only global clusters have been identified 4) No clusters at all have been identified for that time period For each of these cases a different process is performed in order to make a guess about user‟s current search intend: Case 1: User clusters are merged with global clusters, which have all been formed based on the similarity of sessions, as described at step C. This produces a session list, with personal sessions having a higher rank than global ones. Without employing the method for mood estimation, the queries of these sessions that have a higher occurrence frequency can be provided as suggestions to the user. However, by incorporating our mood estimation method, the query of each session that was accompanied by a higher “mood score” is kept and added to the final suggestions list, at a respective rank as its session‟s one. At this point, two further processes are employed in cascade (P1 and P2) as described below, so as to augment the final resulting suggestions list with further suggestions deriving from the “personal” sessions set: - P1: In case that the resulting list holds up to now less than four (i.e. five is the max amount of total suggestions returned to the user) suggestions derived from the personal cluster, the system tries to identify up to three further sessions of the highest mood, which have not been added to the personal sessions cluster. From each of these sessions, the query term of highest mood sore is added to the query suggestions list, at a position that is between the so far identified personal and global

CUBRIK R2 Pipelines for Query Processing

Page 44

D6.2 Version 1.0

suggestions. The relative rank of these queries in the final suggestions list is defined proportionally to their relative mood score. In essence, this process (P1) results in augmenting the final suggestions list with queries deriving from sessions that although they were not similar (in terms of global query co-occurence) with the rest of the personal sessions cluster, they led to a relatively high mood score when they were submitted from the user to the search engine. - P2: In case that the resulting list still holds up to now less than five suggestions deriving from the personal sessions set, the system tries to identify personal sessions that contain as query, one of the suggestions which have derived from the global sessions set. Once such personal sessions are found, each one‟s query of highest mood score is added to the final suggestions list, again at a position that is between the so far identified personal and global suggestions, with a relative rank proportional to the relative mood score of the newly added queries. In essence, this specific process (P2) results in boosting global query suggestions that exist in past sessions of the specific user, or queries that co-existed with global query suggestions within the same user sessions and led to a higher degree of user satisfaction. For queries existing in more than one sessions, the query is added to the suggestion list only once, at the highest rank of occurrence. The final result is a list that holds “personal” suggestions at its beginning, followed by “global” suggestions if needed (i.e. when all personal suggestions are less than the number of total suggestions that should be returned, which in our case is five). All suggestions derive from sessions that share similar spatiotemporal context characteristics as the user‟s current session. Case 2: The name of user‟s higher cluster in hierarchy is returned. The result is a session list that holds only “personal” query suggestions. These sessions share similar spatiotemporal context characteristics as the user‟s current session. The obtained sessions lead to query recommendations similarly to Case 1, with the difference that the final suggestion list consists of queries derived only from the personal log of the user. The final suggestions list is in this case also augmented through the application of the P1 and P2 processes described above. Case 3: The name of global higher cluster in hierarchy is returned, providing global sessions that share similar spatiotemporal context characteristics as the user‟s current session. Due to the fact that no personal sessions were found that share similar (in terms of global cooccurrence) queries, no personal sessions were added to the personal sessions cluster, leading to inexistence of “personal” query suggestions in the present suggestions list. Practically, this case may appear for instance when only a few (e.g. 2 or 3) sessions exist for a newly registered user, and these sessions refer to query concepts that are totally different between them. However, by following the aforedescribed result list augmentation processes P1 and P2, the initial (global) query suggestions list is eventually augmented with queries deriving from the personal user sessions. The result is eventually a query recommendations list that holds again also “personal” and “global” query suggestions. Case 4: The name of global higher cluster in hierarchy of all time periods is returned. The result is a session list that may hold both “personal” and “global” query suggestions that share similar spatiotemporal context characteristics as the user‟s current session, although not necessarily referring to the user‟s current session‟s time period. The obtained sessions lead to query recommendations similarly to Case 1.

CUBRIK R2 Pipelines for Query Processing

Page 45

D6.2 Version 1.0

It should be underlined at this point that since the query recommendations obtained from the above processes are ranked based on the mood estimation method that has been developed as described in section 4.4, from each session of the returned cluster, the query with the highest mood score is considered as suggestion. Thus, query terms that have resulted in a mood score indicating more positive affect are promoted and finally, the query of each session that was associated with more positive affect is suggested to the user.

4.6 Application of the proposed context-aware automatic query formulation method in practice The feasibility of integrating the developed algorithm within a real search engine interface with real users shall be evaluated, to obtain a first qualitative insight regarding its effectiveness in real-time automatic query formulation. Therefore, a further CUbRIK H-Demo, not originally planned in D9.2, was developed as a CUbRIK Pipeline relying on the fashion dataset. It is presenting a Google-like search engine interface. The H-Demo will be embedded in Fashion V-App final version. This interface is shown in Figure 15 and Figure 16.

Figure 15. The developed Google-like search engine interface of the fashion dataset, enhanced with the developed automatic query suggestion method Figure 15 above shows the developed interface that is shown to the user (in this case, it is user113), right after logging in. At the “Maybe you would like to search for:” section of the interface (on the right of the “Search” button), the query suggestions that were formulated from our developed method based on the user‟s context are shown. These suggestions, provided right after logging in, prior to the user‟s submission of any query to the search engine, were formulated based on the user‟s context, in terms of the spatial and temporal criteria that were described in section 4.3.2 and also the mood estimation method described in section 4.4. Figure 16 below demonstrates the interface, as presented to the same user after the selection of the suggestion “jeans” from the suggestions list.

CUBRIK R2 Pipelines for Query Processing

Page 46

D6.2 Version 1.0

Figure 16. The developed Google-like search engine interface, after userâ&#x20AC;&#x2122;s selection of suggested query (jeans) Fourty-seven volunteer users were engaged to register to the search engine interface. Participants used the system for a period of three months. Through the developed interface, the developed automatic query formulation method was found capable to work efficiently in practice, being able to automatically formulate and provide query suggestions within an average time frame of typically less than 1 sec. The developed interface was found to be a valuable tool for refining the proposed method during its development process so as to increase its effectiveness and eliminate bugs inside the code of the developed system. By using the interface and examining the proposed methodâ&#x20AC;&#x;s query suggestion capabilities in practice, both our development team and the volunteer users provided feedback related to issues that were spotted regarding the method and the corresponding developed software moduleâ&#x20AC;&#x;s functionality. In general, the feedback that was provided from the volunteer real users regarding the effectiveness of our developed context-aware automatic query formulation method was positive, especially highlighting the usefulness of the fact that the search engine interface was providing them query suggestions as soon as they logged in, prior to the submission of any search query from their side.

CUBRIK R2 Pipelines for Query Processing

Page 47

D6.2 Version 1.0

4.7 Employing AOL query log 2006 in order to test the Methodâ&#x20AC;&#x2122;s accuracy 4.7.1

Data collection methodology

Research relating to search logs has been hampered by the limited availability of appropriate click datasets. In the case of the present work, the AOL query log 2006 is taken as an experimental corpus. We used the AOL query log [R40] in our experiments, as it is publicly available and sufficiently large to guarantee statistical significance (other public query logs are either access-restricted or are small).

Table 10: Sample AOL search log records The log contains historical records, each of which registers the details of a web search conducted by a user. Table 10 shows some sample records extracted from the AOL search engineâ&#x20AC;&#x;s search log. Because the entire log set is too large, we randomly sample 7,000 distinct queries. These queries and their click-through logs are extracted as our dataset. Since users are randomly sampled, this dataset could reflect the characteristics of the entire logs. A similar approach has been followed by previous research works [R41], [R42], [R43], [R44], [R45], [R46].

4.7.2

Procedure

In order to acquire a first evaluation of the algorithm, developed for predicting the informational needs of the user and thus automatically formulating an appropriate query, AOL query log 2006 has been employed as described above. We have also used wordnet::similarity module in order to implement semantic similarity and relatedness measures. Thus, a similarity measurement can be obtained between each query submitted by a user (at the AOL search engine) and the queries that the algorithm that we are developing would have suggested to that user. For each of the 7,000 queries of the randomly created dataset, the algorithm receives as input only data that had been submitted to the search engine before the timestamp of each query. The method developed for suggesting queries to the user was evaluated both with and without integrating the method for estimating user mood.

4.7.3

Results

Results without integrating the user mood estimation method in the query suggestion process Mean similarity was 54.8% with a standard deviation of 3.2. There is some likelihood, called the confidence level, that the true population mean error falls within a particular range, called the confidence interval, around the mean similarity value obtained from our sample. A confidence level of 90% gives a confidence interval of 4.89, which means that the range for the true population mean is between 51.6% and 58%. This is observed variability for a sample that is nearly Gaussian.

CUBRIK R2 Pipelines for Query Processing

Page 48

D6.2 Version 1.0

Results with integration of the user mood estimation method in the query suggestion process Mean similarity was 57.9% with a standard deviation of 3.1. There is some likelihood, called the confidence level, that the true population mean error falls within a particular range, called the confidence interval, around the mean similarity value obtained from our sample. A confidence level of 90 percent gives a confidence interval of 4.89, which means that the range for the true population mean is between 54.8% and 61%. This is observed variability for a sample that is nearly Gaussian. Advantage of integrating the proposed mood estimation method in query suggestion From the afore-described results of the present study, it first of all becomes clear that in the given dataset, the incorporation of the mood estimation method for ranking query suggestions provided increased wordnet-based similarity between the query suggestions and the actual queries that were provided from the users. These results were in line with our expectation that taking mood into account within the query suggestion process, would improve the quality of provided suggestions. They demonstrate that through the mood estimation –based promotion of recommendations, whose queries were connected within the user logs with more positive affect, the user is provided with results that are more likely to be similar to his actual intention of the queries to-be submitted. Thus, the present results come to support our hypothesis that taking mood into account within a query suggestion procedure can lead to a further improved personalized web search system.

4.8

Discussion

In the present chapter, the method that was developed in the context of T6.4 during the reporting period, toward context-aware automatic query formulation was presented. In the following, the main points of the proposed methodology are first of all discussed, toward further highlighting the method‟s contribution. Thereafter, our plans for future work in the context of T6.4, toward developing further context-aware automatic query suggestion methods are described.

4.8.1

Contribution

The problem of query suggestion for improving web search is typically addressed by taking into account the queries that the user has submitted so far during her/his current session. These past queries are typically referred to as the “context” of the web search session. Based on the queries issued so far, query suggestion systems typically try to identify within their logs, sessions that share similarities to the current user‟s session, and provide as recommendations, queries that were issued subsequently to queries that are similar to the ones issued by the user so far in the present session. Thus, the notion of “context” in query recommendation systems typically refers only to the past user queries. Nevertheless, by considering only the past issued queries of the session as the user context, significant aspects of the session, such as its spatiotemporal characteristics are not taken into account. Moreover, the incorporation of spatiotemporal characteristics within a query suggestion system can enable the provision of suggestions as soon as the user logs in, prior to the submission of any query to the search engine. Following this line, the present study takes a step forward and develops a query recommendation mechanism on the basis of spatial and temporal characteristics of user sessions. As a result, the proposed method is capable to provide as soon as the user logs in, personalized query suggestions based on spatiotemporal characteristics of the user context. The temporal criteria used herein are the “Time of Day” (translated into the respective time period), “Month”, “Name of Day” and “Number of Day in the Month”, all obtained through the timestamp of the session. The spatial criteria are the “City” and “Country” of the user, obtained through her/his IP during the session. By identifying clusters of similar sessions (either global sessions or personal ones, belonging to the specific, currently logged in user) in CUBRIK R2 Pipelines for Query Processing

Page 49

D6.2 Version 1.0

respect of the aforementioned spatiotemporal parameters, the proposed method is capable to provide query suggestions derived from those clusters that share similar spatiotemporal characteristics to the user‟s current session. Furthermore, a novel mood estimation method has been developed within T6.4, toward incorporating also affective aspects in the proposed query suggestion methodology. In particular, by taking into account the actions of the users (as recorded in the server logs) that followed the provision of each query during their search sessions, a mood score is estimated. The calculation of this score follows assumptions that have been drawn by considering how the results obtained through a web search session and the subsequent user actions that they would lead to, could induce positive or negative affect to the user. In this respect, user mood in the present study is in essence related to the easiness of achieving goals through the utilized web search system. Thus, through the proposed methodology, each query of the server log is assigned a mood score, depicting whether the submission of the specific query to the search engine resulted in inducing positive or negative mood to the user, based on our assumptions regarding the “action-based” indicators of the user mood within a web search session. Through experimental evaluation using the AOL dataset, it was found that our proposed query suggestion methodology, without using the mood estimation method, was capable to provide suggestions that had wordnet-based similarity to the actual user input at the level of 54.8%. Thereafter, by incorporating also our proposed mood estimation method in the query suggestion process, wordnet-based similarity between the suggestions and the actual user input was improved by 2.9%, reaching the 57.9% level. These results first of all indicate that our proposed method, even without the incorporation of mood characteristics, is capable to provide, on the basis of the spatiotemporal session context, suggestions to the user that are relevant to the actual query that would have been submitted, even without knowledge of the “search context” in terms of previous queries submitted to the search engine during her/his current session. Moreover, the incorporation of our mood estimation method within the query suggestion process was found to improve the wordnet-based similarity between suggestions and actual user queries, underlining the method‟s potential toward more effective future personalized query suggestion systems. The results obtained so far highlight the potential of our proposed method for query suggestion that takes into account only spatial, temporal and mood characteristics of the user context. Being based only on spatiotemporal context parameters, the developed method is capable to provide suggestions to the user as soon as s/he logs in to the system, prior to the submission of any query to the search engine, acting thus as a solution to the problem of suggesting queries prior to user query submission during a session.

4.8.2

Future Work

In the rest of T6.4 duration, research is planned to focus on further query suggestion methodologies, toward advancing current state-of-art methods that take into account the user context only in terms of the queries previously provided to the search engine during the current session. In particular, effort will be put in augmenting current SoA query suggestion approaches (such as adjacency, co-occurrence etc.) with spatial and temporal session features in collaboration with WP4. The goal of the future work will thus be the development of a method that will be capable to provide spatiotemporal context-based recommendations that evolve as the user submits queries to the search engine after logging in. During the session, the foreseen method is planned to take into account the past queries of the user, along with the spatiotemporal characteristics of the user context, toward improving the effectiveness of current SoA query suggestion approaches.

CUBRIK R2 Pipelines for Query Processing

Page 50

D6.2 Version 1.0

REFERENCES

[R1] F. Qiu and J. Cho, “Automatic identification of user interest for personalized search,” Proc. Fifteenth Int’l Conf. on World Wide Web (WWW '06). ACM, New York, NY, USA, pp. 727-736, 2006. [R2] Nielsen netratings search engine ratings report. http://searchenginewatch.com/reports/article.php/2156461, 2003. [R3] Kazunari Sugiyama, Kenji Hatano, and Masatoshi Yoshikawa. 2004. Adaptive web search based on user profile constructed without any effort from users. In Proceedings of the 13th international conference on World Wide Web (WWW '04). ACM, New York, NY, USA, 675-684. DOI=10.1145/988672.988764 [R4] Ruirui Li, Ben Kao, Bin Bi, Reynold Cheng, and Eric Lo. 2012. DQR: a probabilistic approach to diversified query recommendation. In Proceedings of the 21st ACM international conference on Information and knowledge management (CIKM '12). ACM, New York, NY, USA, 16-25. DOI=10.1145/2396761.2396768 http://doi.acm.org/10.1145/2396761.2396768 [R5] Wen, J.R., Nie, J.Y., & Zhang, H.J. (2001). Clustering user queries of a search engine. In Proceedings of the 10th international conference on World Wide Web (pp. 162–168). Hong Kong. 1–5 May. [R6] S. Beitzel, E. Jensen, O. Frieder, D. Lewis, A. Chowdhury, and A. Kolcz. Improving automatic query classification via semisupervised learning. In Proc. of the 5th IEEE International Conference on Data Mining(ICDM-05), 2005. [R7] B.J. Jansen, A. Spink, and T Saracevic. Real life, real users, and real needs: A study and analysis of user queries on the Web. Information Processing and Management, 36(2):207 – 227, 2000. [R8] B. M. Fonseca, P. B. Golgher, E. S. de Moura, and N. Ziviani, “Using Association Rules to Discover Search Engines Related Queries,” in Proceedings of the First Conference on Latin American Web Congress, pp. 66–71, 2003. [R9] Z. Zhang and O. Nasraoui, “Mining search engine query logs for query recommendation,” in Proceedings of the 15th international conference on World Wide Web, pp. 1039–1040, 2006. [R10] Chien-Kang Huang, Lee-Feng Chien, and Yen-Jen Oyang. 2003. Relevant term suggestion in interactive web search based on contextual information in query session logs. Journal of the American Society for Information Science and Technology, 54, 7 (May 2003), 638-649. [R11] C. Silverstein, M. Henzinger, H. Marais, and M. Moricz. Analysis of a very large altavista query log. Technical Report SRC 1998-014, Digital Systems Research Center, 1998 [R12] Huanhuan Cao, Daxin Jiang, Jian Pei, Qi He, Zhen Liao, Enhong Chen, and Hang Li. 2008. Context-aware query suggestion by mining click-through and session data. In Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '08). ACM, New York, NY, USA, 875-883. DOI=10.1145/1401890.1401995 http://doi.acm.org/10.1145/1401890.1401995 [R13] Qi He, Daxin Jiang, Zhen Liao, Steven C. H. Hoi, Kuiyu Chang, Ee-Peng Lim, and Hang Li. 2009. Web Query Recommendation via Sequential Query Prediction. In Proceedings of the 2009 IEEE International Conference on Data Engineering (ICDE '09). IEEE Computer Society, Washington, DC, USA, 1443-1454. DOI=10.1109/ICDE.2009.71 http://dx.doi.org/10.1109/ICDE.2009.71 [R14] Doug Beeferman and Adam Berger. 2000. Agglomerative clustering of a search engine query log. In Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining (KDD '00). ACM, New York, NY, USA, 407-416. [R15] A. Spink and B. J. Jansen. Web search: Public searching of the web. New York: Kluwer, 2004.

CUBRIK R2 Pipelines for Query Processing

Page 51

D6.2 Version 1.0

[R16] B. J. Jansen, A. Spink, C. Blakely, and S. Koshman. Defining a session on web search engines. Journal of The American Society for Information Science and Technology, 58(6):862C871, 2007. [R17] Kevin Hsin-Yih Lin, Chieh-Jen Wang, Hsin-Hsi Chen. Predicting Next Search Actions with Search Engine Query Logs. In proceeding of: Proceedings of the 2011 IEEE/WIC/ACM International Conference on Web Intelligence, WI 2011, Campus Scientifique de la Doua, Lyon, France, August 22-27, 2011. [R18] MaxMind Geolite City and Country IP Geolocation Databases http://dev.maxmind.com/geoip/legacy/geolite/ [R19] N.H Frijda, “Varieties of affect: Emotions and episodes, moods, and sentiments,” in The nature of emotion, P. Ekman and R.J. Davidson, Eds. New York: Oxford University Press, 1994, pp. 59–67. [R20] R. Gockley, R. Simmons, and J. Forlizzi, “Modeling affect in socially interactive robots”, Proc. Fifteenth IEEE Int’l Symp. on Robot and Human Communication, Hatifeild, UK, September 6-8, 2006. [R21] R. Neumann, B. Seibt, and F. Strack, “The influence of mood on the intensity of emotional responses: Disentangling feeling and knowing,” Cognition and Emotion, vol. 15, pp. 725-747, 2001. [R22] C.G. Wan, J.Y. Zhao, and Y.Y. Zhang, “An emotion generation model for interactive virtual robots,” Proc. Int’l Symp. on Computational Intelligence and Design, pp. 238-241, 2008. [R23] A. Ortony and G.L. Clore, “Emotions, moods, and conscious awareness,” Cognition and Emotion, vol. 3, pp. 125–137, 1989. [R24] A. Ortony, G.L. Clore, and A. Collins, The cognitive structure of emotions. Cambridge, UK: Cambridge University Press, 1988. [R25] C. Conati, “Probabilistic Assessment of User‟s Emotions during the Interaction with Educational Games,” J. Applied Artificial Intelligence, special issue on merging cognition and affect in HCI, vol. 16, pp. 555-575, 2002. [R26] C. Tenopir, P. Wang, Y. Zhang, B. Simmons, and R. Pollard, “Academic users' interactions with ScienceDirect in search tasks: Affective and cognitive behaviours,” Inf. Process. Manage., vol. 44, no. 1, pp.105-121, 2008. [R27] J. Gwizdka and I. Spence, “Implicit measures of lostness and success in web navigation,” Interacting with Computers, vol. 19, no. 3, pp. 357-369, 2007. [R28] R. Erber, D.M. Wegner, and N. Therriault, “On Being Cool and Collected: Mood Regulation in Anticipation of Social Interaction,” J. Personality and Social Psychology, vol. 70, no. 4, pp. 757-766, 1996. [R29] B. Reeves and C. Nass, The Media Equation: How People Treat Computers, Television, and New Media Like Real People and Places. Cambridge Univ. Press, 1996. [R30] C. Nass and Y. Moon, “Machines and Mindlessness: Social Responses to Computers,” J. Social Issues, vol. 56, no. 1, pp. 81-103, 2000. [R31] J. Piesk and G. Trogemann, “Animated interactive fiction: Storytelling by a conversational virtual actor,” Proc. Int’l Conf. on Virtual Systems and MultiMedia (VSMM '97), Geneva, Switzerland, September 1997. [R32] J. Olveres, M. Billinghurst, J. Savage, and A. Holden, “Intelligent, expressive avatars,” Proc. First Workshop on Embodied Conversational Characters (WECC’98), pp. 47–55, 1998. [R33] M. Qing-mei, W. Wei-guo, “Artificial emotional model based on finite state machine,” J. Central South University of Technology, vol. 15, no. 5, pp. 694−699, 2008. [R34] I.M. Mills, B.N. Taylor, and A.J. Thor, “Definitions of the Units Radian, Neper, Bel and Decibel,” Metrologia, vol. 38, pp. 353-361, 2001. [R35] C.C. Kuhlthau, “The role of experience in the information search process of an early career information worker: Perceptions of uncertainty, complexity, construction, and sources,” J. American Society for Information Science, vol. 50, no. 5, pp. 399–412, 1999. [R36] D. Nahl, “Affective and cognitive information behavior: Interaction effects in internet use,” Proc. American Society for Information Science and Technology, 2005. [R37] J. Kracker, “Research anxiety and students‟ perceptions of research: An experiment. Part I. Effect of teaching Kuhlthaus‟s ISP model,” J. American Society for Information Science and Technology, vol. 53, no. 4, pp. 282–294, 2002. [R38] J. Kracker and P. Wang, “Research anxiety and students‟ perceptions of research: An experiment. Part II. Content analysis of their writings on two experiences,” J. American Society for Information Science and Technology, vol. 53, no. 4, pp. 295–307, 2002.

CUBRIK R2 Pipelines for Query Processing

Page 52

D6.2 Version 1.0

[R39] J. Kalbach, “I'm feeling lucky. The role of emotions in seeking information on the Web,” J. American Society for Information Science and Technology, vol. 57, no. 6, pp. 813–818, 2006. [R40] G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. In 1st InfoScale, 2006. [R41] Z. Dou, R. Song, and J. Wen, “A large-scale evaluation and analysis of personalized search strategies,” in Proceedings of the 16th international conference on World Wide Web, pp. 581–590, 2007. [R42] Silverstein, H. Marais, M. Henzinger, and M. Moricz. Analysis of a very large web search engine query log. SIGIR Forum, 33(1):6–12, 1999. [R43] Y. Xie and D. R. O‟Hallaron. Locality in search engine queries and its implications for caching. In INFOCOM ‟02, 2002. [R44] B. J. Jansen, A. Spink, J. Bateman, and T. Saracevic. Real life information retrieval: a study of user queries on the web. SIGIR Forum, 32(1):5–17, 1998. [R45] S. Wedig and O. Madani. A large-scale analysis of query logs for assessing personalization opportunities. In Proceedings of KDD ‟06, pages 742–747, 2006. [R46] S. M. Beitzel, E. C. Jensen, A. Chowdhury, D. Grossman, and O. Frieder. Hourly analysis of a very large topically categorized web query log. In Proceedings of SIGIR ‟04, pages 321–328, 2004. [R47] Alcantara, O.D.A., Jr., A.R.P., de Almeida, H.M., Goncalves, M.A., Middleton, C., Baeza-Yates, R.A.: Wcl2r: A benchmark collection for learning to rank research with clickthrough data. JIDM 1(3), 551-566 (2010). [R48] Bar-Yossef, Z., Gurevich, M.: Mining search engine query logs via suggestion sampling. Proc. VLDB Endow. 1, 54 - 65 (Aug 2008). [R49] Cambazoglu, B.B., Zaragoza, H., Chapelle, O., Chen, J., Liao, C., Zheng, Z., Degenhardt, J.: Early exit optimizations for additive machine learned ranking systems. In: WSDM. pp. 411-420 (2010). [R50] Cao, Z., Qin, T., Liu, T.Y., Tsai, M.F., Li, H.: Learning to rank: from pairwise approach to listwise approach. In: ICML. pp 129-136 (2007). [R51] Carbonell, J.G., Goldstein, J.: The Use of MMR, Diversity-Based Reranking for Reordering Documents and Producing Summaries. In: SIGIR. pp 335-336 (1998). [R52] Cha, M., Kwak, H., Rodriguez, P., Ahn, Y.Y., Moon, S.: Analyzing the video popularity characteristics of large-scale user generated content systems. IEEE/ACM Trans. Netw. 17(5), 1357-1370 (2009). [R53] Chapelle, O., Chang, Y.: Yahoo! learning to rank challenge overview. Journal of Machine Learning Research - Proceedings Track 14, 1-24 (2011). [R54] Chelaru, S., Altingovde, I.S., Siersdorfer, S.: Analyzing the polarity of opinionated queries. In: Proc. of ECIR'12. pp. 463-467 (2012). [R55] Cheng, X., Dale, C., Liu, J.: Statistics and social network of youtube videos. In: Proc. of IEEE IWQoS'08 (2008). Cunningham, S.J., Nichols, D.M.: How people find videos. In: JCDL. pp. 201-210 (2008). [R56] Dang, V., Croft, W.B.: Feature selection for document ranking using best first search and coordinate ascent. In: Proc. of SIGIR'10 Workshop on Feature Generation and Selection for Information Retrieval (2010). [R57] Filipova, K., Hall, K.: Improved video categorization from text metadata and user comments. In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. pp. 835-842 (2011). [R58] Freund, Y., Iyer, R., Schapire, R.E., Singer, Y.: An efficient boosting algorithm for combining preferences. J. Mach. Learn. Res. 4, 933-969 (2003). [R59] Friedman, J.H.: Stochastic gradient boosting. Comput. Stat. Data Anal. 38(4), 367378 (Feb 2002). [R60] [R61] Geng, X., Liu, T.Y., Qin, T., Li, H.: Feature selection for ranking. In: Proc. of SIGIR'07. pp. 407-414 (2007). CUBRIK R2 Pipelines for Query Processing

Page 53

D6.2 Version 1.0

[R61] Giannopoulos, G., Weber, I., Jaimes, A., Sellis, T.K.: Diversifying user comments on news articles. In: WISE. pp. 100-113 (2012). [R62] Grace, J., Gruhl, D., Haas, K., Nagarajan, M., Robson, C., Sahoo, N.: Artist ranking through analysis of online community comments. Tech. rep., IBM Research Technical Report (2008). [R63] Hsu, C.F., Khabiri, E., Caverlee, J.: Ranking comments on the social web. In: Proc. Of CSE'09. pp. 90-97 (2009). Hu, M., Sun, A., Lim, E.P.: Comments-oriented document summarization: understanding documents with readers' feedback. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. pp. 291-298 (2008). Joachims, T.: Training linear svms in linear time. In: KDD'06. pp. 217-226 (2006). [R64] Liu, T.Y.: Learning to rank for information retrieval. Foundations and Trends in Information Retrieval 3(3), 225-331 (2009). [R65] Macdonald, C., Santos, R.L.T., Ounis, I.: On the usefulness of query features for learning to rank. In: CIKM. pp. 2559-2562 (2012). [R66] Mishne, G., Glance, N.: Leave a reply: An analysis of weblog comments. In: Workshop on the Weblogging ecosystem (2006). [R67] Mohan, A., Chen, Z., Weinberger, K.: Web-search ranking with initialized gradient boosted regression trees. Journal of Machine Learning Research 14, 77-89 (2011). [R68] Musial, K., Kazienko, P.: Social networks on the internet. World Wide Web 16(1), 3172 (2013). [R69] [R72] Potthast, M., Stein, B., Loose, F., Becker, S.: Information retrieval in the commentsphere. Transactions on Intelligent Systems and Technology (ACM TIST) (to appear) 3 (2012). [R70] San Pedro, J., Yeh, T., Oliver, N.: Leveraging user comments for aesthetic aware image search reranking. In: WWW'12. pp. 439-448 (2012). [R71] Shmueli, E., Kagian, A., Koren, Y., Lempel, R.: Care to comment?: recommendations for commenting on news stories. In: Proceedings of the 21st World Wide Web Conference. pp. 429-438 (2012). [R72] Siersdorfer, S., Chelaru, S., Nejdl, W., San Pedro, J.: How useful are your comments?: analyzing and predicting youtube comments and comment ratings. In: WWW'10. pp.891-900 (2010). [R73] Silvestri, F.: Mining query logs: Turning search usage data into knowledge. Foundations and Trends in Information Retrieval 4(1-2), 1-174 (2010). [R74] Thelwall, M., Sud, P., Vis, F.: Commenting on youtube videos: From guatemalan rock to el big bang. JASIST 63(3), 616-629 (2012). [R75] Tsagkias, M., Weerkamp, W., de Rijke, M.: News comments: Exploring, modeling, and online prediction. In: Proceedings of the 32nd European Conference on IR Research. pp. 191-203 (2010). [R76] Yano, T., Smith, N.A.: What's worthy of comment? content and comment volume in political blogs. In: Proceedings of the Fourth International Conference on Weblogs and Social Media (2010). [R77] Yee, W.G., Yates, A., Liu, S., Frieder, O.: Are web user comments useful for search? In: Proc. of SIGIR'09 Workshop on LSDS-IR (2009). [R78] Esuli, A., Sebastiani, F.: Sentiwordnet: A publicly available lexical resource for opinion mining. In: Proc. of LREC'06. pp. 417-422 (2006). [R79] Metzler, D., Bruce Croft, W.: Linear feature-based models for information retrieval. Inf. Retr. 10(3), 257-274 (2007). CUBRIK R2 Pipelines for Query Processing

Page 54

D6.2 Version 1.0

[R80] S. Chelaru, C. Orellana, I. S. Altingovde. Can Social Features Help Learning to Rank YouTube Videos?. In Proceedings of the 13th international conference on Web Information Systems Engineering, WISE 2013. [R81] S. Chelaru, C. Orellana, I. S. Altingovde. How useful are Social Features for Learning to Rank YouTube Videos? Accepted at the World Wide Web Journal (Springer). [R82] http://entitypedia.org

CUBRIK R2 Pipelines for Query Processing

Page 55

D6.2 Version 1.0