CUbRIK Platform Rel. update by CUbRIK Project

R3 CUBRIK INTEGRATED PLATFORM RELEASE Human-enhanced time-aware multimedia search

CUbRIK Project IST-287704 Deliverable D9.4 WP9

Deliverable Version 1.0 â&#x20AC;&#x201C; 30 September 2013 Document. ref.: cubrik.D94.ENG.WP9.V1.0

Programme Name: ...................... IST Project Number: ........................... 287704 Project Title: .................................. CUbRIK Partners: ........................................ Coordinator: ENG (IT) Contractors: UNITN, TUD, QMUL, LUH, POLMI, CERTH, NXT, MICT, ATN, FRH, INN, HOM, CVCE, EIPCM Document Number: ..................... cubrik.D94.ENG.WP9.V1.0 Work-Package: ............................. WP9 Deliverable Type: ........................ Document Contractual Date of Delivery: ..... 30 September 2013 Actual Date of Delivery: .............. 30 September 2013 Title of Document: ....................... R3 CUbRIK Integrated Platform Release Author(s): ..................................... Vincenzo Croce, Marilena Lazzaro (ENG)

Approval of this report ............... Summary of this report: .............. CUbRIK Release 3 Accompanying document History: .......................................... Keyword List: ............................... Availability .................................... This report is public

This work is licensed under a Creative Commons Attribution-NonCommercialShareAlike 3.0 Unported License. This work is partially funded by the EU under grant IST-FP7-287704

R3 CUbRIK Integrated Platform Release

D9.4 Version 1.0

Disclaimer This document contains confidential information in the form of the CUbRIK project findings, work and products and its use is strictly regulated by the CUbRIK Consortium Agreement and by Contract no. FP7- ICT-287704. Neither the CUbRIK Consortium nor any of its officers, employees or agents shall be responsible or liable in negligence or otherwise howsoever in respect of any inaccuracy or omission herein. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7-ICT-2011-7) under grant agreement n째 287704. The contents of this document are the sole responsibility of the CUbRIK consortium and can in no way be taken to reflect the views of the European Union.

R3 CUbRIK Integrated Platform Release

D9.4 Version 1.0

Table of Contents EXECUTIVE SUMMARY

CUBRIK RELEASE 3 DESCRIPTION 1.1 CUBRIK PIPELINES 1.2 H-DEMO CONCEPT 1.3 VERTICAL APPLICATION CONCEPT 1.4 W HAT IS IN R3 1.4.1 Release 3 in a Nutshell 1.5 CUBRIK SVN 1.6 BRINGING IT ALL TOGETHER: INTEGRATION IN CUBRIK 1.6.1 Bundle vs Code 1.6.2 R3 Deployment Environment preparation and set-up 1.6.3 CUbRIK development environment installation and configuration

PIPELINES OF R3: H-DEMOS 2.1 CUBRIK H-DEMO NEWS CONTENT HISTORY 2.1.1 H-Demo vs. V-Apps (Fashion and HoE) 2.1.2 H-Demo vs CUbRIK pipeline(s) 2.1.3 Data set description 2.1.4 Architecture overview 2.1.5 External components 2.1.6 Components integrated in CUbRIK R3 2.1.7 Third party library 2.1.8 How to install the H-Demo bundle 2.2 PEOPLE IDENTIFICATION H-DEMO 2.2.1 H-Demo vs. V-Apps (Fashion and HoE) 2.2.2 H-Demo vs CUbRIK pipeline(s) 2.2.3 Data set description 2.2.4 Architecture overview 2.2.5 Third party library 2.2.6 Components integrated in CUbRIK R3 2.2.7 How to install the H-Demo bundle 2.2.8 How to exploit the H-demo source code 2.3 MEDIA ENTITY ANNOTATION RELEVANCE FEED-BACK 2.3.1 H-Demo vs. V-Apps (Fashion and HoE) 2.3.2 H-Demo vs CUbRIK pipeline(s) 2.3.3 Data set description 2.3.4 Architecture overview 2.3.5 Third party library 2.3.6 Components integrated in CUbRIK R3 2.3.7 How to install the H-Demo bundle 2.3.8 How to exploit the H-demo source code 2.4 LIKE LINES 2.4.1 H-Demo vs. V-Apps (Fashion and HoE) 2.4.2 H-Demo vs CUbRIK pipeline(s) 2.4.3 Data set description 2.4.4 Architecture overview 2.4.5 Third party library 2.4.6 Components integrated in CUbRIK R3 2.4.7 How to install the H-Demo bundle 2.4.8 How to exploit the H-demo source code 2.5 CROSSWORDS 2.5.1 H-Demo vs V-Apps (Fashion and HoE) 2.5.2 H-Demo vs CUbRIK pipeline(s)

R3 CUbRIK Integrated Platform Release

3 4 5 5 6 8 8 9 10 11 19 19 19 20 20 20 21 21 21 21 23 23 23 24 24 25 25 25 29 30 30 31 33 33 34 34 34 38 38 38 38 39 39 40 41 41 43 43 44 44 D9.4 Version 1.0

2.5.3 Data set description 2.5.4 Architecture overview 2.5.5 Crosswords workflow for the Feedback Acquisition and Processing 2.5.6 How to install the H-Demo 2.5.7 How to exploit the H-demo source code 2.6 ACCESSIBILITY AWARE RELEVANCE FEEDBACK 2.6.1 H-Demo vs V-Apps (Fashion and HoE) 2.6.2 H-Demo vs CUbRIK pipeline(s) 2.6.3 Data set description 2.6.4 Architecture overview 2.6.5 Third party library 2.6.6 Components integrated in CUbRIK R3 2.6.7 How to install the H-Demo bundle 2.6.8 How to exploit the H-Demo source code 2.7 CONTEXT-AWARE AUTOMATIC QUERY FORMULATION 2.7.1 H-Demo vs. V-Apps (Fashion and HoE) 2.7.2 H-Demo vs CUbRIK pipeline(s) 2.7.3 Data set description 2.7.4 Architecture overview 2.7.5 Third party library 2.7.6 Components integrated in CUbRIK R3 2.7.7 How to install the H-Demo bundle 2.7.8 How to exploit the H-demo source code 2.8 SEARCH ENGINE FEDERATION 2.8.1 H-Demo vs. V-Apps (Fashion and HoE) 2.8.2 H-Demo vs. CUbRIK pipeline(s) 2.8.3 Data set description 2.8.4 Architecture overview 2.8.5 Third party library 2.8.6 How to install 3

PIPELINES OF R3: CUBRIK VERTICAL APPLICATIONS 3.1 SEARCH FOR SME INNOVATION (FASHION) APPLICATION 3.1.1 How to install the Fashion V-App bundle 3.2 HISTORY OF EUROPE APPLICATION 3.2.1 How to install the History of Europe bundle

R3 COMPONENTS 4.1 COMPONENTS SPECIFICATION 4.1.1 Accessibility 4.1.2 Content Provider Tool 4.1.3 Connection to the CVCE collection 4.1.4 Copyright aware Crawler 4.1.5 Crowd face position validation 4.1.6 Descriptor Extractor 4.1.7 Entity verification & annotation 4.1.8 Entitypedia integration and data provisioning 4.1.9 Expansion through documents 4.1.10 Expansion through images 4.1.11 Expert crowd entity verification 4.1.12 Face detection 4.1.13 Face identification 4.1.14 GWAP Sketchness (Clothing Item Identification) 4.1.15 Image Extraction from Social Network 4.1.16 Implicit Feedback Filter LikeLines 4.1.17 License checker 4.1.18 Lower & Upper body parts detector

R3 CUbRIK Integrated Platform Release

44 44 45 46 47 48 48 49 49 50 53 53 55 58 66 67 67 68 69 71 71 71 74 79 79 79 80 82 92 92 94 95 95 96 97 115 116 116 117 117 118 118 119 119 120 120 121 121 122 122 123 123 124 124 125 D9.4 Version 1.0

4.1.19 4.1.20 4.1.21 4.1.22 4.1.23 4.1.24 4.1.25 4.1.26 5

Media harvesting and upload Object Store Provenance checker Query for entities Social graph creation Social Graph network analysis Trend Analyser Visualization of the social graph

CONCLUSION

125 125 127 127 128 128 129 129 130

Figures / Tables Figure 1: CUbRIK Pipeline structure ......................................................................................... 3 Figure 2: CUbRIK H-Demos per Release ................................................................................. 4 Figure 3: CUbRIk Releases Timeline ........................................................................................ 5 Figure 4 CUbRIK official svn repository .................................................................................... 8 Figure 5: Add a new Target Platform ...................................................................................... 12 Figure 6: New Target Definition window ................................................................................. 13 Figure 7: Add SMILA v1.1 as Target Platform ........................................................................ 13 Figure 8: Add content window ................................................................................................. 14 Figure 9: SMILA 1.1 added as Target Platform....................................................................... 15 Figure 10: Import SMILA source in the workspace ................................................................. 16 Figure 11: Import Project Panel .............................................................................................. 16 Figure 12: Importing a jar bundle in Eclipse ............................................................................ 17 Figure 13: CUbRIK Bundles running configuration ................................................................. 18 Figure 14: Activity diagram for CUbRIK pipelines in News Content History H-Demo............. 20 Figure 15 NCH Demo basic architecture overview ................................................................. 21 Figure 16: A web-based user interface enables users to retrieve and browse results ........... 23 Figure 17: Functional overview of the People Identification H-Demo ..................................... 25 Figure 18: SMILA pipelines for Media Entity Annotation ......................................................... 31 Figure 19 Entity Model for Media Entity Annotation H-Demo and HoE V-App ....................... 32 Figure 20 SMILA console screenshot ..................................................................................... 35 Figure 21 Example of Chrome Simple REST Client usage .................................................... 35 Figure 22: Human users (top left) interact with the LikeLines player component (center)...... 39 Figure 23: LikeLines pipelet (bottom) providing a bridge to the LikeLines server (top) .......... 40 Figure 24: Crossword application system architecture ........................................................... 45 Figure 25. Crosswords gameplay page .................................................................................. 46 Figure 26 architectural block diagram of the H-Demo ............................................................ 51 Figure 27: flow chart of (a) the Data Fetcher and (b) the Updating of the Index Jobs ........... 52 Figure 28: flow chart of Extraction of the usersâ&#x20AC;&#x2122; profile workflow ............................................ 52 Figure 29: flow chart of Accessibility related filtering workflow ............................................... 53 Figure 30: flow chart of user profile updating .......................................................................... 53 Figure 31 SMILA console screenshot ..................................................................................... 57 Figure 32. Overall Architecture of the context-aware query formulation H-Demo .................. 69 Figure 33. Flow chart of the logging Job ................................................................................. 70 Figure 34. Flow chart of the off-line training Job ..................................................................... 70

R3 CUbRIK Integrated Platform Release

D9.4 Version 1.0

Figure 35. Flow chart of the context aware query formulation Jobs ....................................... 70 Figure 36 SMILA console screenshot ..................................................................................... 73 Figure 37 Search Engine Federation Architectural overview .................................................. 83 Figure 38 BPEL for Pipeline lomUpdateAndQueryTerms ....................................................... 84 Figure 39 BPEL for Pipeline CalculateQueryTermCount ........................................................ 85 Figure 40 BPEL for Pipeline DomainPoolInsertionPipeline .................................................... 86 Figure 41 BPEL for Workflow Generate Domain Pool, part 1 ................................................. 87 Figure 42 BPEL for Workflow Generate Domain Pool, part 2 ................................................. 88 Figure 43 BPEL for Workflow Crawl Domains, part 1 ............................................................. 89 Figure 44 BPEL for Workflow Crawl Domains, part 2 ............................................................. 90 Figure 45: Fashion V-App screenshot ..................................................................................... 96 Figure 46 using prompt to start the mongo-db for History of Europe ...................................... 99 Figure 47:How start the CrowdSearcher service .................................................................... 99 Figure 48: The CrowdSearcher service is running and ready............................................... 100 Figure 49: The TEF is running and ready ............................................................................. 100 Figure 50: SMILA jobs Starter ............................................................................................... 103 Figure 51: HoE GUI ............................................................................................................... 104 Figure 52: Process a New Folder .......................................................................................... 105 Figure 53: Process a New Folder 2 ....................................................................................... 105 Figure 54: Collection exploration ........................................................................................... 106 Figure 55: Face detail ............................................................................................................ 107 Figure 56: Jobs Execution ..................................................................................................... 108 Figure 57: Jobs details page ................................................................................................. 108 Figure 58: Face Identification Task ....................................................................................... 109 Figure 59: CrowdSearcher UI for Expert Identification ......................................................... 110 Figure 60: Image Validation performed ................................................................................. 111 Figure 61: Answers list .......................................................................................................... 112 Figure 62 Crowd Face position validation ............................................................................. 112 Figure 63: Face detail after Crowd validation........................................................................ 113 Figure 64: Face detail after Crowd validation........................................................................ 114

R3 CUbRIK Integrated Platform Release

D9.4 Version 1.0

Executive Summary This deliverable is the document accompanying the third Integrated release of the CUbRIK platform (R3), scheduled for Y2. General goal is to provide to an adopter of the Platform – a generic Business Ecosystem stakeholder – a “how to” guide to exploit and extend the CUbRIK platform for its own business. D9.21 introduces the concept of the CUbRIK Pipeline and describes the plan of each platform release. In order to provide to the reader a self-contained document, some of these concepts are repeated here. Moreover, from D.9.2 chapter 2, the release scheduling is reported in order to provide an overview of the plan vs what is actually delivered. Chapter 1 describes what is contained in the release; the introduction in D9.2 is extended to provide a comprehensive description of release parts and arrangement; the concepts of both H-Demos and V-Apps are repeated and contextualized. A description of both CUbRIK Deployment and Development environments is provided. In particular, the latter is related to the Open Source approach of CUbRIK. The concepts of Bundle and Code, inside the release package, are introduced, describing how the source code is provided. This third release of the platform collects artefacts from both H-Demos and V-Apps. H-Demos are demonstrators of horizontal features, while V-Apps are applications built relying on the CUbRIK platform and validating the platform features in the domain of reference. Even if the main part of the H-Demos has converged in the V-Apps, these still constitute a value for the platform. In fact, H-Demos are a reusable set of artefacts; pipelines implementing a set of specific features and their related tools can be exploited by a Platform Adopter to personalize and tailor the processes for its specific needs. Chapter 2 provides an in-depth description of the structure of the H-Demos, related pipelines, components and third part libraries; this information is complemented with installation and configuration guidelines. Following the description in Chapter 2, Chapter 3 introduces the two vertical applications, VApps, for the selected domains of practice, that is History of Europe (HoE) and SME Innovation (Fashion). Comprehensive details of the V-Apps pipelines structure are further provided in D10.1 2; Complementarily, this document provides an in-depth description of the installation details for pipelines, related components and third part libraries. Chapter 4 reports the identity card of all CUbRIK components belonging to this release and provides a mapping Components Vs V-Apps and H-Demos. The “in depth” detailed description of all the components is further provided in D8.23. Chapter 5 summarizes the results achieved during this second year and provides the map of Planned vs Achieved results. A reference of the paragraphs with delivery description is reported. Platform integration is an articulated process involving different steps including but not limited to testing and deploying. A summary list of all the deployed H-Demos and V-Apps in the ENG infrastructure is provided. Considering the whole umbrella of CUbRIK deliverables, this document accompanying a release, is mainly aimed at guiding the Platform Adopter on how to obtain the platform, install it and run it. Particular focus is further put on guidelines for re-usage and tailoring of processes. Complementary information is provided by other deliverables, in particular: Pipelines for multimodal content analysis & enrichment are described in D5.24; Pipelines for query processing are described in D6.25; Pipelines for relevance feedback are described in the correspondent D7.16. D8.27 focuses on components for content analysis and platform

1 D9.2 Delivery management Plan and testing specification 2 D10.1 First CUbRIK application demonstrator 3 D8.2 R2 Component and pipeline support services 4 R2 PIPELINES FOR MULTIMODAL CONTENT ANALYSIS & ENRICHMENT 5 R2 Pipelines for query processing 6 R1 Pipelines for relevance feedback 7 R2 COMPONENT AND PIPELINE SUPPORT SERVICES R3 CUbRIK Integrated Platform Release

Page 1

D9.4 Version 1.0

support. Moreover D10.1 8 provides a general overview of each application demonstrator for the two CUbRIK domains of practice, the two V-Apps.

8 First CUbRIK application demonstrator R3 CUbRIK Integrated Platform Release

Page 2

D9.4 Version 1.0

CUbRIK Release 3 description

CUbRIK differs from the development of a monolithic do-it-all architecture. It follows a differential approach based on SMILA9 as the underlying framework for supporting workflow definition and core services execution. CUbRIK relies on a framework for executing processes (aka pipelines), consisting of collections of tasks. In order to understand the CUbRIK structure, that is reflected in the Release, it is necessary to provide to the Reader the CUbRIK Pipeline concept. Moreover, for this release, the H-Demo and V-Apps have to be referred too.

1.1 CUbRIK pipelines CUbRIK platform is built for multimedia search practitioners, researchers and end-users and relies on CUbRIK pipelines which bundle and distribute the tasks to be executed. A CUbRIK pipeline is a conceptual workflow constituting a fragment of Search application business logic. Each pipeline is described by Jobs -automatic operations- and human activities (CrowdTasks, Q&A and GWAP) that are chained in a sequence. Each CUbRIK Job is implemented as a SMILA workflow that is constitute by Actions aggregation; an Action can be a: 1. Worker: a single processing component in an asynchronous workflow 2. Pipelet: a reusable component in a BPEL workflow used to process data contained in records 3. Pipeline. a synchronous BPEL process (or workflow) that orchestrates pipelets and other BPEL services (e.g. web services).

Figure 1: CUbRIK Pipeline structure Both H-Demos and V-Apps are released as one or more CUbRIK pipelines belonging one of the three categories: Content Analysis and Enrichment pipelines: responsible for making content searchable (media low level features extraction) ad additionally, according to the Human in the loop approach, the pipelines manage the integration with humanexecuted tasks to enrich the â&#x20AC;&#x153;understandingâ&#x20AC;? of the media content. Query execution pipelines: responsible for dealing the query processing in CUbRIK. (Simple and federated way) also taking account of personalization and fine tune query aspect. The latter, is another facet of Human in the loop approach in gaining query 9 http://www.eclipse.org/smila/ R3 CUbRIK Integrated Platform Release

Page 3

D9.4 Version 1.0

effectiveness exploiting the Crowd evaluation Feedback acquisition and processing pipelines: responsible for supporting retrofit content analysis and enrichment with the gathered feedbacks from the Community

1.2 H-Demo concept Horizontal demos (H-Demos) are implemented to implement and proof same horizontal features of the CUbRIK Platform and are conceived with a twofold goal: the first is to provide some of these features than converged towards a concrete exploitation inside one of the two V-App(s). As second goal H-Demos, for the complete features set, are released with a specific pipeline in order to be re-used by a CUbRIK Platform adopter, a User that intend to exploit the feature set for its own process application. , In CUbRIK seven H-Demos were defined according to the analysis of domain of practice: 1. Logo detection 2. News history 3. People Identification 4. LikeLines: Time-point specific search via implicit-user derived information 5. Media Entity Annotation 6. Crosswords 7. Accessibility aware Relevance feedback Logo detection and first version of Media Entity Annotation were released as part of R1 at M12. LikeLines and first version of People Identification belonged to R2, due at M18; R2 was an internal milestone, all the artefacts belonging this release are further packaged into this release, R3 at M24. H-Demos listed above, were originally planned in D9.2 that describes the overall plan for CUbRIK platform releases; During this second year two additional H-demos were conceived and developed: - Context-aware automatic query formulation - Search Engine Federation Both provide a CUbRIK Pipeline for query processing, were tested as H-Demo and are planned to be embedded in further CUbRIK release. In particular, the former exploits the fashion dataset and it will be embedded in Fashion VApp final version for Y3. The latter is implemented as potential extension of the querying part of History of Europe V-App. The following table is the original H-Demo summary from D9.2 :

Figure 2: CUbRIK H-Demos per Release

R3 CUbRIK Integrated Platform Release

Page 4

D9.4 Version 1.0

1.3 Vertical Application concept A part H-Demo, aimed at providing proofs-of-concept of specific functionalities, two vertical applications were developed for two domains of practice (SME Innovation-Fashion domain and History of Europe domain); these are search driven applications that rely on CUbRIK to demonstrate the added value of the platform. According to the general CUbRIK time- plan, depicted in Figure 3 first version of both History of Europe and Search for SME Innovation (Fashion) V-App are delivered as part of R3 through CUbRIK components and pipelines.

1.4 What is in R3 CUbRIK platform is structured to be developed in three steps: Proof of concept –developed on initial requirements- released at M14; First Prototype, that is the first version of the application released at M24; Complete application, fully implementing CUbRIK features that will be released at M36.

Figure 3: CUbRIk Releases Timeline Five main development phases are defined in the project timescale. The end of each phase corresponded to a CUbRIK project release which collects the progresses –in terms of artefacts- achieved at that time: Delivery Date

Release version

Objective

M12

1.0

Initial release of the platform: to get a first version of a CUbRIK Platform having initial version of all available components and Hdemos pipelines in place and integrated in a common environment. Main goal is to validate developed technology and data model, and to the test workflow mechanisms at covering the thorough pipelines range .

M18

2.0

Advancement release: enhancements and Bug fixing, possible runtime environment extension. The enhancement will essentially be focused on Pipeline for Feedback acquisition and processing – the relevant feedback and in functionalities implemented in first prototype of “CUbRIK History of Europe Application” and “Search for SME Innovation Application”. Further identification of some feature extensions and improvements.

M24

3.0

Integrated release of platform: release of a second set of functionalities related to H-Demos. Moreover “CUbRIK History of Europe Application” and “Search for SME Innovation Application” first prototypes will extend the Platform with some additional features.

R3 CUbRIK Integrated Platform Release

Page 5

D9.4 Version 1.0

Delivery Date

Release version

Objective

M30

4.0

Advancement release: enhancement and bug fixing in order to consolidate the platform and prepare it for the final version. Some additional features will be provided as part of H-Demo.

M36

5.0

Final release: to improve effectiveness & efficiency to push CUbRIK closer to real business scenarios

R3 Goal is to provide the second integrated release collecting functionalities from both HDemos and Vertical Demonstrators (V-Apps). Moreover this is collecting the advancement that were achieved as part of R2.

1.4.1

Release 3 in a Nutshell

The table below, extends the one reported in D9.2, and provides Release 3 in a nutshell; It reports the plan for the second year of the project. Moreover, since R2 was an internal milestone, pipelines and components belonging to this release are collected and packaged as part of the official R3. What is delivered column, groups, at the same level the artefacts, that are pipelines, components and datasets belonging to R3 : Delivery Date

Release version

What is delivered

In detail

M18

2.0

Social network analysis, trust & people search techniques

Ranking strategies for Multimedia objects that make use of social features obtained from Web 2.0 platforms.

Pipelines for relevance feedback

People Identification H-Demo Like Lines H-Demo

Incentive models and algorithms

Incentive models applied to concrete crowdsourcing scenarios : GWAP (Sketcheness, Crossword), Q&S (CrowdSearcher framework),CrowdTask (MICT platform, CrowdFlower framework ) Entity game framework-crosswords scenario: Crossword H-Demo for Entity Repository Uncertainty Reduction Component for relevance feedbackcrosswords scenario: Crossword H-Demo

R3 CUbRIK Integrated Platform Release

Page 6

D9.4 Version 1.0

M24

3.0

Time core services, components and pipelines for preliminary Support to "History of Europe" application: extended at covering also SME fashion VApp

Component for analysis -using various algorithmsof the data coming in the databases and those coming from the crawlers in order to find popular content and specific trending topics: fashion v-app / Trend analyzer to extract features from the multimedia content (images) and use the trend analyser to compute trends. Component for text query and a geolocation and crawls data from the web. Text query Fashion V-App crawler Crawler for picasa, panoramio, flickr, European and geolocalization part of Media Entity

Pipelines multimodal analysis enrichment

for content &

News History H-Demo NCH Extraction History of Europe V-App Search for SME Innovation application (Fashion) V-App

Pipelines for query processing

News Content History H-Demo People Identification H-Demo: Media Entity Annotation H-Demo (time related query extension). Context-aware automatic query formulation H-Demo Search engine federation H-Demo

Pipelines for relevance feedback

Crosswords H-Demo Media Entity Annotation H-Demo (relevance feedback - CrowdFlower extension) Accessibility Aware Relevance feedback

Component pipeline services

HoE and Fashion V-App components

and support

Table 1: Release 3 in a nutshell

R3 CUbRIK Integrated Platform Release

Page 7

D9.4 Version 1.0

1.5 CUbRIK SVN Release 3 of the CUbRIK platform is available in the official CUbRIK SVN repository hosted in ENG server at https://cubrikfactory.eng.it/svn/CUBRIK/ The repository is accessible via HTTPS using registered account. Specific credentials were created in order to allow the Release downloading Demos repository URL

https://cubrikfactory.eng.it/svn/CUBRIK/

Account name

XXXXXXXXXXX

Account password

XXXXXXXXXXX

The SVN is manly organized in two folders: WORK and PACKAGE. The first one contains all CUbRIK components belonging the CUbRIK R3 and is used by developers for files checkingout/in; the second one is used for the official release (R1-M12; R3 - M24 and R5- M36 ) and managed by ENG. A package folder of R3 was created and further organized in: • V-Apps: Fashion and HoE • H-Demos (one folder for each H-demo) • Components In relation to Components folder, in each component subfolder there is a zip file containing the software released plus a doc file for the component specification. Figure below depicts the PACKAGE directory:

Figure 4 CUbRIK official svn repository

1.6 Bringing it all together: Integration in CUbRIK CUbRIK architecture is an example of differential design. The platform relies on SMILA as the underlying framework for supporting workflow definition and execution that is supporting the core services. As reported in 1.1CUbRIK pipelines each CUbRIK Pipeline is includes jobs that are implemented as a SMILA workflow; So each CUbRIK Pipeline deploy implicitly implies, in general, a SMILA pipeline.

R3 CUbRIK Integrated Platform Release

Page 8

D9.4 Version 1.0

SMILA, as Eclipse, supports pipelines installation and debug inside the development environment and support also a bundle deploy. CUbRIK R1 was delivered exploiting CUbRIK Platform service via the installation of SMILA framework as development environment. This third released is mainly delivered through SMILA bundles10 approach (producing plugin). Result is a collection of jar files plus some configuration files(e.g json, bpel file) responsible for describing CUbRIK pipelines. This choice has been made to better show the modularity and the simplicity of the CUbRIK platform in terms of installation and configuration. Bundles need only the SMILA framework binary distribution, without the need to have development environment installed. R3 CUbRIK platform delivery via SMILA bundles is the result of installation, testing and bugfixing activity done by ENG before the official delivery. Integration activity was performed according the following methodology: •

Each partner responsible for H-demo/V-app development performs commit in the CuBRIK official SVN repository • Engineering as partner responsible for integration, performs checkout of H-demo/Vapp released by partner owner as CUbRIK pipelines embedded SMILA framework • Engineering performs smooth installation and check the pipeline • Engineering further tests H-demo/V-app released and, in case, reports bugs using the CUbRIK Tracking system11 • H-demo/V-app partner owner fixes the bug and perform a new commit in the SVN • Engineering updates its installation and performs a new test to check if the bug was properly fixed or not. • Engineering finalize the installation, create the bundle and deploy H-demo/V-app in a server hosted in ENG DMZ. • Engineering draws up the installation and configuration guideline as results of its experience and provide info also for running the bundle. Section 1.6.2 R3 Deployment Environment preparation and set-up reports info on how to prepare and set-up the CUbRIK deployment, moreover for each H-demo/V-apps a fully guideline is provided in order to perform each specific installation and configuration.

1.6.1

Bundle vs Code

As anticipated R3 CUbRIK platform is released not only as SMILA bundle, a sort of “compiled” format, but also as source code. The latter allows the CUbRIK Platform extension, in terms of software source for those parts that are released under open source licence. CUbRIK pipelined implemented for R3 are released in CUbRIK official svn through H-demo and V-App. For each of them a folder structure is manly reported as follow: • JAR • Configuration • Data/other • Framework JAR folder contains SMILA bundles that are actually jar files. Each jar file has also a src folder containing the source code that can be imported in the CUbRIK development environment. So the jar usage approach is twofold, for pipelines development and for pipeline actual deployment and running. Starting from bundle, the developer can create a new Eclipse project and import the CUbRIK pipelines within development environment for pipelines re-usage, personalization and tailoring. Otherwise the bundle can be deployed on SMILA framework binary distribution in order to have pipelines running as these are.

10 http://wiki.eclipse.org/SMILA/Documentation/HowTo/How_to_export_a_bundle 11 https://89.97.237.243/cubrikbugs/enter_bug.cgi R3 CUbRIK Integrated Platform Release

Page 9

D9.4 Version 1.0

Configuration folder contains the needed files to proper configured the CUbRIK pipelines in both deployment and development environment. Data/other folder contains files needed for example for content data injection or file used for user profiling. Moreover, for those H-demos leveraging on a specific framework as backend (e.g People Identification), the framework folder includes all the necessaries back installation files. A preliminary phase of CUbRIK development environment is needed for the setting up; a detailed guideline is reported in section 1.6.3. Once the development environment is properly installed and configured, CUbRIK pipelinesâ&#x20AC;&#x2122;s source code can be imported in the Eclipse workspace as described in section 1.6.3.1. As general approach CUbRIK applications can exploit both open source and proprietary components. In relation to V-Apps and H-Demos belonging R3, some components are released under proprietary licence; for these cases this document reports detailed information on how to obtain the component and how to require the licence, including contact points.

1.6.2

R3 Deployment Environment preparation and set-up

CUbRIK R3 is installed in a server farm composed by two servers with the following SW requirements: Operating system 32-bit operating system

Windows XP

32-bit operating system

Ubuntu 10.04.4 LTS

Deployment environment Service Pack 3

jre v7 (actual link: http://www.or acle.com/tec hnetwork/jav a/javase/dow nloads/jre7downloads1880261.htm l)

1.1(actual link: http://www.eclipse.org/downloads/dow nload.php?file=/rt/smila/releases/1.1/S MILA-1.1-win32.win32.x86.zip)

jre v7(actual link http://www.or acle.com/tec hnetwork/jav a/javase/dow nloads/jre7downloads1880261.htm l)

1.1 (actual link: http://www.eclipse.org/downloads/dow nload.php?file=/rt/smila/releases/1.1/S MILA-1.1-linux.gtk.x86.zip )

As anticipated, for CUbRIK deployment environment the SMILA framework binary distribution is requested. In the following the step by step procedure is reported : 1. Download the packaged binary SMILA v1.1 at http://www.eclipse.org/smila/downloads_archive.php choosing the releases according the SO that is in your machine like SMILA-1.1-win32.win32.x86.zip. 2. Unzip the file you downloaded in the chooses directory to have a folder structure like follow: /<SMILA> /configuration /features R3 CUbRIK Integrated Platform Release

Page 10

D9.4 Version 1.0

/jmxclient /plugins /workspace .eclipseproduct ... SMILA SMILA.ini 3. Check the precondition: to run SMILA you need to have jre v7 executable added to your PATH environment variable: • add the path of your local JRE executable to the PATH environment variable or • add the argument -vm <path/to/jre/executable> right at the top of the file SMILA.ini. Make sure that -vm is indeed the first argument in the file and that there is a line break after it. It should look similar to the following: -vm d:/java/jre7/bin/java 4. Run smila.exe (smila.sh in case of Linux installation) that is under your SMILA installation 5. Note that jetty that is the HTTP server embedded in SMILA uses as port default the 8080 value. Moreover, both People Identification and LikeLines GUI frontend refer to 8080 as default port. In case of they are installed and run on the same machine, remember to change the port value. In case of SMILA: edit configuration/org.eclipse.smila.clusterconfig.simple/clusterconfig.json file and modify the value of the parameter “httpPort” with the port value number chosen; in case of People Identification and LikeLines, refer to section 2.2.7.1 and section 2.4.7.1 respectively.

1.6.3

CUbRIK development environment installation and configuration

In case of pipelines re-usage and personalization the CUbRIK development environment needs to be set-up for source code handling. This implies some dependencies from SMILA framework, the table below collects these requirements and provides also links for environment tools downloading: Windows OS Eclipse SDK version

4.2.0

Available at: http://www.eclipse.org/downloads/downloa d.php?file=/eclipse/downloads/drops4/R4.2-201206081400/eclipse-SDK-4.2win32.zip

SMILA source code version

1.1

Available at: http://www.eclipse.org/downloads/downloa d.php?file=/rt/smila/releases/1.1/SMILA1.1-core-source.zip

JDK version

1.7

Available at: http://www.oracle.com/technetwork/java/jav ase/downloads/jdk7-downloads1880260.html

R3 CUbRIK Integrated Platform Release

Page 11

D9.4 Version 1.0

Please follow steps below to proper set-up the CUbRIK development environment: 1. Download the Eclipse SDK version, e.g. eclipse-SDK-4.2.0-win32.zip file and extracts it twice in 2 different folders one called Eclipse SDK 4.2.0 and the other one called Eclipse SDK 4.2.0 target”; in this way there are 2 different directory such us D:\Eclipse SDK 4.2.0 and D:\Eclipse SDK 4.2.0 target. Note that they have to contain not the eclipse folder as result of unpacking the zip file but its content. 2. Download the SMILA 1.1 source code zip file and extract in the SMILA-1.1-coresource folder 3. Create an empty folder called for example H-Demo_ContentAware_workspace that will be used as workspace folder of your development environment 4. Start Eclipse in your eclipse-SDK-4.2.0 folder and select the workspace folder created at step 3 (e.g. H-Demo_ContentAware_workspace) 5. Open the “Preferences” window using the “Window” menu and on the left side select “Plug-in Development” > “Target Platform”; click on add to add a new Target Platform and the New Target Definition window will be showed

Figure 5: Add a new Target Platform In the New Target Definition window click on Next button and Target Platform view will be showed. Select the first choice and click on Next and the Edit Target Definition will be showed (Figure 7)

R3 CUbRIK Integrated Platform Release

Page 12

D9.4 Version 1.0

Figure 6: New Target Definition window

Figure 7: Add SMILA v1.1 as Target Platform 6. In Edit Target Definition view, write SMILA 1.1 in text field name, select Locations Tab, then click on Add button, select Installation (Figure 8) and then click on Next to browse to the Eclipse SDK 4.2.0 target folder. Click on Finish

R3 CUbRIK Integrated Platform Release

Page 13

D9.4 Version 1.0

Figure 8: Add content window 7. In Edit Target Definition view, select Locations Tab, then click on Add button, select Directory, then click on Next to browse to the SMILA-1.1-coresource\SMILA.extension\eclipse folder. Click on Finish. 8. Going back in â&#x20AC;&#x153;Preferencesâ&#x20AC;? window, select SMILA 1.1 as Target Platform and click on Apply and then OK. At this point, if this procedure was properly performed the following should appear on your installation.

R3 CUbRIK Integrated Platform Release

Page 14

D9.4 Version 1.0

Figure 9: SMILA 1.1 added as Target Platform 9. Import SMILA source in the workspace: in the Package Explorer panel right click and select Import-> Existing projects into Workspace->Next->Browse and select the directory where the SMILA-1.1-source folder is located, ensure to check “Copy projects into workspace” and ”Select All” the projects listed in the Package Explorer panel. Click on Finish

R3 CUbRIK Integrated Platform Release

Page 15

D9.4 Version 1.0

Figure 10: Import SMILA source in the workspace

Figure 11: Import Project Panel Now the CUbRIK development environment is ready to be used and you can proceed with the import of CUbRIK pipelines, so please refer to the following section.

R3 CUbRIK Integrated Platform Release

Page 16

D9.4 Version 1.0

1.6.3.1 CUbRIK pipeline extension This section explains how to import the CUbRIK pipelines implemented for the R3 CUbRIK Integrated Platform within the CUbRIK environment. 1. Download CUbRIK pipeline: create a folder called for example CUbRIKContentAware_source and proceed with the checkout from the CuBRIK official SVN using Tortoise as SVN Client 12(we are assuming it is already installed in your pc; other SVN client can be used in case) 2. Import CUbRIK pipeline in the workspace: assuming you have properly set-up the CUbRIK development environment, run Eclipse and in the Package Explorer panel right clink and select Import-> Existing projects into Workspace, Choose ”Select archive file” and browse to the jar bundle you want to import and press “Finish” button. Remember to repeat the import of each jar file contained in the JAR folder you downloaded

Figure 12: Importing a jar bundle in Eclipse 3. Configure CUbRIK pipelines: please refer to the configuration guideline of each HDemo/Vapp released that is reported in sections of Chapter 2 – 3. Remember that in the development environment, the SMILA root folder is SMILA.application 4. Run CUbRIK pipelines in the development environment: To run the CUbRIK pipelines inside the development environment you have to prepare some environment configurations as follow: In the Package Explorer view, select the SMILA.application project, right click and then choose “Refresh” Open the Run Configuration window, select the OSGi Framework > SMILA and a list of bundles imported is showed. For each bundle in that list, check the Start Level and Auto-Start property fields and set them as follow: 12 http://tortoisesvn.tigris.org/ R3 CUbRIK Integrated Platform Release

Page 17

D9.4 Version 1.0

true if the bundle is a servlet, a worker or a service default otherwise

Figure 13: CUbRIK Bundles running configuration

Click on “Validate Bundles” button and if requested check the missed bundles Run CUbRIK pipeline clicking on “Run” button or on feature of “Run” menu. The CUbRIK pipeline environment is now running

R3 CUbRIK Integrated Platform Release

Page 18

D9.4 Version 1.0

Pipelines of R3: H-Demos

CUbRIK R3 is composed by all the pipelines belonging to V-Apps and H-Demos. As introduced in 1.2 H-Demo concept the latter has a specific goal to be re-used by CUbRIk Platform adopter. Table 1: Release 3 in a nutshell lists H-Demos pipelines, further details are provided in this chapter. Detailed description of V-Apps processes and pipelines are provided in D10.1 and D5.2. H-Demos pipelines: 1. News Content History: Content Analysis and Enrichment pipeline Query Execution pipeline 2. People Identification Query Execution pipeline Feedback acquisition and processing pipeline 3. Multimedia Entity Annotation Query Execution pipeline Feedback acquisition and processing pipeline 4. Like Lines Feedback acquisition and processing pipeline 5. Crosswords Feedback acquisition and processing pipeline 6. Accessibility aware Relevance feedback Feedback acquisition and processing pipeline 7. Search engine federation Query Execution pipeline 8. Context-aware automatic query formulation Query Execution pipeline

2.1 CUbRIK H-demo News Content History The purpose of the News Content History H-Demo is to find video content that has been reused by several TV news shows. This is especially interesting for rare, exclusive content provided by news agencies or private persons. The H-Demo application provides a web based interface to query by text or by video & text and displays relationships between news clips that share the same content. The technical approach to find these segments employs a dense matching in order to exactly identify video fragments that have been reused thus finding fragment originating from the same camera (record).

2.1.1

H-Demo vs. V-Apps (Fashion and HoE)

The News Content History H-Demo will be used in HoE V-App through the Expansion through Video component due for Y3 in order to offer a of getting new contextual information by delivering topic related news videos and their relationship. The News Content History H-Demo will be used in Fashion V-App through the Visual Feature Extraction component in order to extract visual features used for the classification of clothing.

R3 CUbRIK Integrated Platform Release

Page 19

D9.4 Version 1.0

2.1.2

H-Demo vs CUbRIK pipeline(s)

This H-demo provides is designed to implement two kind CuBRIK pipelines, a Pipeline for multimodal content analysis & enrichment and Pipeline for query processing; these have a complementary role in the whole process of News History. Query Pipeline, at the time of this document release is still in preliminary version and is not included as formal pipeline. It is integrated as Web Front- and Backend-Application the underlying data base and demonstrates the whole News Content History process. On the contrary, Content analysis & enrichment pipeline, is included; the related detailed description is in D5.2. The whole process and dependencies are made evident in the diagram below:

Figure 14: Activity diagram for CUbRIK pipelines in News Content History H-Demo The process is structured as follow: a video content query is triggered by the Web App; the Generic Visual Feature Extractor component performs videos features extraction and generates descriptors for the processed files; these information are used by the Video Segmenter Matcher component that matches the visual descriptors of a reference media file against one or more other media file descriptors. Results are then showed by the Web App where during result browsing there is a crowd involvement to accomplish some tasks like the validation of found segment matches. Crowd can be identified with persons interested how news footage is used among different tv stations, i. e. journalist, media scientists, interested private persons. The crowd tasks that have been implemented yet do not require any special knowledge or ability.

2.1.3

Data set description

For evaluation and demonstration purposes scripts for producing artificial news clips videos have been developed within the News Content History H-Demo. As turned out during the project, the annotation of real world news data is almost impossible. Anyhow for demonstration purposes also a real world data set may be used while considering copyright issues of the material.

2.1.4

Architecture overview

The News Content History H-Demo is implemented in a client server web app. The top level architecture is shown in Figure 15.

R3 CUbRIK Integrated Platform Release

Page 20

D9.4 Version 1.0

Figure 15 NCH Demo basic architecture overview The figure above shows the main components involved in the typical app scenario. The data analysis process that create the XML data is currently conducted by native standalone components which are partly integrated in SMILA.

2.1.5

External components

The data creation process for this version of the H-Demo is an offline video analysis step which mainly consists of the feature extraction and the matching framework. While the Generic Visual Feature Extractor is already fully integrated the segment matcher was not during the time this deliverable but will be available in the next version.

2.1.6

Components integrated in CUbRIK R3

The Generic Visual Feature Extractor component analyses visual data such as images and videos extracts features and generates descriptors for the processed files. The features/descriptors supported by the interface can easily be extended since internally the components can be configured by xml processing instructions. This component is already integrated into a pipeline even if it is not used in the current demo yet. It represents a part of the video analysis step and is re-usable in the pipeline.

2.1.7

Third party library

The Generic Visual Feature Extractor component uses the Fraunhofer XPXNativeInterface accessed via JNI

2.1.8

How to install the H-Demo bundle

2.1.8.1 How to setup and run the web application Please find the required files like the web application and a sample database for the demo at the svn: https://cubrikfactory.eng.it/svn/CUBRIK/Demos/NewsHistory_FRH/V1.0. The used video material is stored at: http://cubrikfactory.eng.it:82/newscontenthistory/DATA/video_samples/ Follow these steps to setup and run the demo: 1. install Java7 if not already on your system see: http://www.java.com/en/download/ 2. download and extract tomcat see: http://tomcat.apache.org/download-70.cgi 3. copy newshistory-webapp.war from svn to tomcats webapp folder (tomcat/webapps/) R3 CUbRIK Integrated Platform Release

Page 21

D9.4 Version 1.0

4. 5. 6. 7.

copy demo database (newsData folder from svn) to: tomcat/bin/newsData copy demo content (video_samples) to: tomcat/webapps/ROOT/video_samples/ start webserver tomcat/bin/startup open browser and connect to http://localhost:8080/newshistory-webapp/index.html remark: use a WebKit browser (e.g. Google Chrome or Safari) for optimal representation 8. enjoy the demo 9. close browser and shutdown tomcat (close tomcat window)

2.1.8.2 How to setup and run the Generic Visual Feature Extractor Pipelet The pipelet plugin can be found at: https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/NewsHistory_FRH_POLMI/JARS/ The required SMILA configuration is located at: https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/NewsHistory_FRH_POLMI/configuration/ After starting SMILA with the extractor plugin you can follow these steps to run an extraction 1. Start the indexer with a POST request to: http://localhost:8080/smila/jobmanager/jobs/indexUpdate/ 2. Execute the extraction workflow by starting the default job with a POST to: http://localhost:8080/smila/jobmanager/jobs/initNCH 3. Or create an own job by posting an adapted version of the following json data to: http://localhost:8080/smila/jobmanager/jobs/ { "name":"JOB_NAME", "workflow":"NCHcrawling", "parameters":{ "tempStore":"temp", "dataSource":"file", "rootFolder":"PATH/TO/YOUR/VIDEO/DATA", "jobToPushTo":"indexUpdate", "mapping":{ "fileContent":"Content", "filePath":"Path", "fileName":"Filename", "fileSize":"Size", "fileExtension":"Extension", "fileLastModified":"LastModifiedDate" } } } // you should adapt the name of the job and the path to your video data 4. Start your new job by sending a POST request to: http://localhost:8080/smila/jobmanager/jobs/[JOB_NAME]

R3 CUbRIK Integrated Platform Release

Page 22

D9.4 Version 1.0

2.2 People Identification H-demo The People Identification H-Demo is an application, leveraging on a specific framework, demonstrating people recognition in a photo collection. The underlying approach automatically detects faces within photos and then discriminates these detected faces amongst each other based on some initial labelling. In other words, the recognition of people within photos is primarily based on their faces. The framework is roughly split into three parts: • A web-based frontend to render a graphical user interface • A backend service that exposes the framework's functionality via a REST/JSONbased API (a SOAP-based interface for SMILA interconnection is also included) • A backend core that implements the main processing functionalities like photo handling, face detection, feature extraction, face recognition and storage.

Figure 16: A web-based user interface enables users to retrieve and browse results The People Identification is available in the official CUbRIK repository https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/PeopleIdentification_QMUL. Three subfolders are reported: - configuration: contains all files like bpel file necessary for the CUbRIK pipeline - framework: contains the framework part of People Identification H-demo - JAR: contains budles for People Identification H-demo and related source code

2.2.1

H-Demo vs. V-Apps (Fashion and HoE)

The People Identification H-Demo demonstrates two functionalities that are covered by the History of Europe V-Apps. In particular, these are: • Social graph creation (History of Europe V-App) • General concept of face detection and recognition (History of Europe V-App)

2.2.2

H-Demo vs CUbRIK pipeline(s)

Two pipelines are released as part of the People Identification H-Demo: • Pipelines for query Execution • Pipelines for Feedback Acquisition and Processing The last one was due for M18 as part of R2 internal project milestone and now it is part of the official R3. From a technical point of view, the People Identification H-Demo provides a REST/JSONbased API to expose its functionality to a web-based graphical user interface. In addition, a SOAP-based API provides an interconnection to SMILA via provided pipelets. While the R3 CUbRIK Integrated Platform Release

Page 23

D9.4 Version 1.0

People Identification backend itself does not rely on SMILA or specific CUbRIK pipelines, the pipelets allow other SMILA components to utilize the people recognition functionality of the framework, e.g. by invoking the provided pipelets within their pipeline. The pipelets expose the main functionalities necessary to perform people recognition, which are: importing face photos, setting the labels of faces (thus providing training information), querying or retrieving the (predicted) labels of faces, and providing feedback with respect to the predicted results (e.g. validating the results).

2.2.3

Data set description

The H-Demo primarily targets consumer photo collections. Such photo collections mainly depict people and are usually rich in contextual information. The employed recognition framework does however not rely on normalized face shots. Thus, photos taken in uncontrolled environments should work to some degree. Generally, the recognition framework is best suited for photo collections depicting only few individual people, as it is the case with family photo collections. Photos depicting people that frequently appear together are of special interest to exploit any social semantics. In the latter case, such datasets usually come from a single source.

2.2.4

Architecture overview

Figure 17 shows the main functional overview of the People Identification H-Demo. Note, the interfaces between user, API and framework are not explicitly shown. As indicated earlier on, the H-Demo includes a REST/JSON-based interface (used by the graphical user interface) and a SOAP-based interface (for pipelet-based SMILA interconnection). To expose the people recognition functionalities provided by the People Identification framework to SMILA (to be invoked and thus utilized by other SMILA-based components), the H-Demo includes several SMILA pipelets that connect to the SOAP-API of the People Identification framework. In particular, the H-Demo provides the following SMILA pipelets (subdivided by functionality and type of CUbRIK pipeline category they relate to): Pipelines for Query Execution: •

Pipelet name: ImportPhotoPipelet Description: Import a face photo along an optional label Signature of functionality: faceID importPhoto(photoURL, label=’’)

•

Pipelet name: GetFaceLabelPipelet Description: Query the (set or predicted) label of a given face Signature of functionality: label getFaceLabel(faceID)

•

Pipelet name: SetFaceLabelPipelet Description: Set the label of a given face Signature of functionality: setFaceLabel(faceID, label) Pipelines for Feedback Acquisition and Processing: •

Pipelet name: ValidateFacePipelet Description: Validate the label of a given face Signature of functionality: validateFace(faceID)

•

Future pipelet name: InvalidateFacePipelet (only as a stub in backend so far) Description: Invalidate the label of a given face Signature of functionality: invalidateFace(faceID)

R3 CUbRIK Integrated Platform Release

Page 24

D9.4 Version 1.0

Figure 17: Functional overview of the People Identification H-Demo

2.2.5

Third party library

The frontend of the People Identification H-Demo is written in HTML and JavaScript and only requires a modern HTML browser to display. Note: The web-based frontend requires Internet access to dynamically load the Dojo JavaScript library and the Bootstrap CSS library. However, this restriction can be lifted by hosting both libraries locally (the main HTML file needs to be slightly altered for this). The backend of the H-Demo is implemented in Python and requires a Python environment adhering to version 2.7. Unless using a scientific Python distribution as recommended (as it includes almost all required Python modules), the following Python modules are required: • Numpy >= 1.5 • Scipy >= 0.11 • OpenCV >= 2.4.4 • Scikit-learn >= 0.13.1 • Scikit-image >= 0.8 • Bottle >= 0.12 (included) • PyMongo >= 2.5 • Spyne >= 2.10.9 (only required for SMILA interconnection) Lastly, the backend expects a MongoDB database service (version 2.2 or higher) running on the local host (and the default port).

2.2.6

Components integrated in CUbRIK R3

To interconnect the People Identification backend with SMILA and expose its people recognition functionality, several pipelets are provided, these are: • ImportPhotoPipelet • SetFaceLabelPipelet • GetFaceLabelPipelet • ValidateFacePipelet

2.2.7

How to install the H-Demo bundle

It is advisable to install this H-demo under a Linux/GNU SO like an Ubuntu release since libraries used by the People Identification framework working only on that environment. It is strongly recommended to install the following scientific Python distribution, as it comes packaged with almost all required modules:

R3 CUbRIK Integrated Platform Release

Page 25

D9.4 Version 1.0

•

Download and install (www.continuum.io)

•

Install PyMongo by running: pip install pymongo

Anaconda

1.6

higher

from

Continuum

Analytics

• Install Spyne (only required for SMILA interconnection) by running: pip install spyne Note: Anaconda <= 1.6 on Linux might only include an older version of OpenCV that has a serious bug. After installing Anaconda, first update OpenCV by running anaconda/bin/conda update opencv. This should update OpenCV to at least 2.4.6 as of late August 2013. If not, a workaround, is to download and compile OpenCV 2.4.4 or higher from www.opencv.org, and overwrite cv2.so in anaconda/lib/python2.7/site-packages. In addition, we provide a Linux 64bit binary version in bin_workaround/cv2.so (this might not work if your platform differs). If not using the Anaconda distribution, run the following commands (on the Linux command line): >>> apt-get update >>> apt-get install python2.7 >>> easy_install pip >>> pip install numpy >>> pip install scipy >>> pip install pymongo >>> pip install spyne ... and similarly for all other packages except for OpenCV, which needs to be compiled and installed manually (see installation guides on www.opencv.org). Download MongoDB from www.mongodb.org, extract the binaries and start the database service by typing for example: >>> mkdir /tmp/peopledb >>> ./mongodb/bin/mongod --smallfiles --dbpath /tmp/peopled •

Download the H-demo People Identification framework from the CUbRIK repository at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/PeopleIdentification_QMUL/ framework and copy all content under a given directory.

2.2.7.1 IP Host and Port Configuration 1. Open the server.py file that is under your installation of People Identification framework (PeopleIdentification_QMUL/framework/src/web) and modify both the PORT number and IP_HOST. if isOnQMULServer: run(host='IP_HOST', port=PORT) else: run(host='IP_HOST', port=PORT) 2. Open the server_smila.py file that is under your installation of People Identification framework (PeopleIdentification_QMUL/framework/src/web) and modify both the PORT number and IP_HOST. if isOnQMULServer: server = make_server('IP_HOST', PORT, wsgi_app) else:

R3 CUbRIK Integrated Platform Release

Page 26

D9.4 Version 1.0

server = make_server('IP_HOST, PORT, wsgi_app)

2.2.7.2 CUbRiK deployment configuration As anticipated in section 2.2.6, the People Identification H-Demo provides several SMILA pipelets to expose its people recognition functionality implemented in the backend to other SMILA components. However, the People Identification backend itself does not require or rely on other CUbRIK work. The People Identification pipelets are provided as SMILA bundles. The following steps described as proceed with their installation and configuration: 1. Download the org.eclipse.smila.peoplerecognition.pipelets_1.2.0.jar file from https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/PeopleIdentification_QMUL/JAR and copy it under the /plugins directory of your SMILA installation 2. Download the following files: • ImportPhotoPipeline.bpel • GetFaceLabelPipelet.bpel • SetFaceLabelPipelet.bpel • ValidateFacePipelet.bpel • deploy.xml.peoplerecognition from https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/PeopleIdentification_QMUL/configuration and copy them under the /configuration directory of your SMILA installation 3. Open the deploy.xml file that is under the configuration folder and copy all its content in the deploy.xml file into the configuration\org.eclipse.smila.processing.bpel\pipelines directory of your SMILA installation

2.2.7.3 How to run the H-Demo 2.2.7.3.1 Web-based frontend (GUI) Start the frontend service and REST-API along the backend service to put the People Identification framework in operation and make the web-based frontend (GUI) accessible: >>> cd src >>> python peoplerec.py To show the graphical user interface, open a HTML browser and point it to: http://<server-addr>:8080/ Lastly, import some photos, then go on the Faces tab and label at least a few faces for each person to be recognized (do not forget to press enter after entering each name). Any faces of newly imported photos will then be automatically matched against any of the already labelled face photos. The Photos and People tabs allow to further browse and group the results, e.g. any social co-occurrences are shown in a popup when hovering over a person in the People tab. Press Ctrl-C to terminate the REST-API service (you can keep the service running, but you might have to select a different default address/port for either it or SMILA to continue with demonstrations in the next section).

2.2.7.3.2 SMILA interconnection •

•

Assuming SMILA already installed and running (see section 1.6.2) First follow section 2.2.7 to install the People Identification framework, but instead of running 'python peoplerec.py' (the REST/JSON service) as mentioned in section 2.2.7.3.1, run 'python peoplerec_smila.py' to start the SOAP-API service instead Given that you installed and configured the People Identification pipelets as

R3 CUbRIK Integrated Platform Release

Page 27

D9.4 Version 1.0

•

explained in section 2.2.8.2, you can now list and monitor all invoked SMILA pipelets using a REST client like Resty (https://github.com/micha/resty) or RESTClient (RESTClient firefox add-ons at https://addons.mozilla.org/Enus/firefox/addon/restclient/): Execute a pipelet, for instance import a photo with a face ; do not forget to adapt the local path that is the photoURL parameter repoterted in the POST request below: POST http://<server-addr>:<PORT(default value is 8080)>/smila/pipeline/ImportPhotoPipeline/process/ {"_configuration": { "IN_PHOTO_URL": "photoURL", "IN_LABEL": "label", "OUT_FACE_ID": "faceID"}, "photoURL": "/home/markusb/photos/merkel1.jpg", "label": "Merkel"} The next examples demonstrate how to actually recognize a person based on only its face information:

•

Import another face photo of a different person: POST http://<server-addr>:<PORT(default value is 8080)/smila/pipeline/ImportPhotoPipeline/process/ {"_configuration": { "IN_PHOTO_URL": "photoURL", "IN_LABEL": "label", "OUT_FACE_ID": "faceID"}, "photoURL": "/home/markusb/photos/sarkozy1.jpg", "label": "Sarkozy"}

•

Do the same for a third person: POST http://<server-addr>:<PORT(default value is 8080)/smila/pipeline/ImportPhotoPipeline/process/ {"_configuration": { "IN_PHOTO_URL": "photoURL", "IN_LABEL": "label", "OUT_FACE_ID": "faceID"}, "photoURL": "/home/markusb/photos/cameron1.jpg", "label": "Cameron"}

•

Import another face photo of the same last person, but this time do not specify the label (by either not providing it or setting it to an empty string): POST http://<server-addr>:<PORT(default value is 8080)/smila/pipeline/ImportPhotoPipeline/process/ {"_configuration": { "IN_PHOTO_URL": "photoURL", "IN_LABEL": "label", "OUT_FACE_ID": "faceID"}, "photoURL": "/home/markusb/photos/cameron3.jpg"} Note that each of the ImportPhotoPipelet pipelets returns the face ID of the face corresponding to the imported photo in the result, e.g. "faceID": "520cc8aea0231a0680bea78c" Now retrieve the predicted label of the most recently imported face photo using the previously returned face ID: POST http://<server-addr>:<PORT(default value is 8080)/smila/pipeline/GetFaceLabelPipeline/process/ {"_configuration": { "IN_FACE_ID": "faceID", "OUT_LABEL": "label"}, "faceID": "520cc8aea0231a0680bea78c"} The returned label should be "Cameron"

•

R3 CUbRIK Integrated Platform Release

Page 28

D9.4 Version 1.0

While the previous two pipelets demonstrate the core recognition functionalities, the following two examples show how to further interact with the recognition results: â&#x20AC;˘

â&#x20AC;˘

2.2.8

Each time a face photo is imported without a face label, the people identification framework will predict or guess the depicted person (given by its label) based on previously labelled photos of the same person. The recognition performance (overall as well as of a particular person) can be improved by directly providing feedback. If the predication (as returned by getFaceLabel) of an unlabelled face photo is right, one can validate this predication as following: POST http:// <server-addr>:<PORT(default value is 8080) /smila/pipeline/ValidateFacePipeline/process/ {"_configuration": { "IN_FACE_ID": "faceID"}, "faceID": "520cc8aea0231a0680bea78c"} Lastly, it is also possible to invalidate a prediction by providing it the correct label, or to simply later re-label the person associated with a particular face photo: POST http:// <server-addr>:<PORT(default value is 8080)/smila/pipeline/SetFaceLabelPipeline/process/ {"_configuration": { "IN_FACE_ID": "faceID", "IN_LABEL": "label"}, "faceID": "520cc8aea0231a0680bea78c", "label": "David Cameron"}

How to exploit the H-demo source code

2.2.8.1 CUbRIK environment configuration (SMILA) The People Identification H-Demo provides several SMILA pipelets in the package org.eclipse.smila.peoplerecognition.pipelets to expose its people identification functionality implemented in the backend to other SMILA components. However, the People Identification backend itself does not require or rely on other CUbRIK work. The People Identification piplets are provided as a source code distribution (to be used with a source code SMILA development environment), and a ready-to-use binary distribution (to be used with a packaged binary SMILA environment) with pre-configured default settings.

2.2.8.2 Configuration of the 3rd party library and components integrated in SMILA for CUbRIK R3 If using a packaged binary SMILA environment, it is enough to simply copy org.eclipse.smila.peoplerecognition.pipelets_1.2.0.jar (located within the smila directory) to the plugins directory of the SMILA environment. The pipelets package is pre-configured (it expects the People Identification SOAP-API accessible on localhost:8888) and will be invoked automatically. If testing under a source code SMILA development environment, or if wishing to debug or develop the People Identification pipelets (e.g. to alter the default SOAP-API address), the source code distribution org.eclipse.smila.peoplerecognition.pipelets needs to be simply imported as an existing workspace into the SMILA development environment. To invoke the pipelets automatically, the usual Auto-Start option under the OSGi framework (section SMILA) has to default to true for the pipelets package.

R3 CUbRIK Integrated Platform Release

Page 29

D9.4 Version 1.0

2.3 Media Entity Annotation relevance feed-back The Media Entity Annotation horizontal demo (h-demo) demonstrates harvesting of representative images for named entities stored in the entity repository. The entity repository enhanced with the multimedia content can be used e.g. to search and to visualize named entities in search results. In the first release of the H-Demo (R1), a set of SMILA pipelets were developed for accessing entity repository in order to create, read, and update entities and for crawling multimedia social networks in order to fetch multimedia content (in particular images) related to named entities. A set of famous Italian monuments was used as an input data-set for this h-demo. In total, the input data-set consisted of ~100 monuments located in different Italian cities such as Rome and Florence. The crawling components developed in the first release were completely automatic. In the R3 release, this h-demo is extended with 3 components:

1. relevance feed-back - a component for refining results of automatic media crawling by creating crowd-sourcing tasks using CrowdFlower crowd-sourcing platform13;

2. entity-centric search - a component for querying entity repository by using Entity Query Language (EQL), a semantic entity-oriented query language similar to SQL.

3. A Europeana crawler that collects not only URLs of related images but also time information, such as the year the photo was taken, to enable time related queries in the Entity repository This H-Demo for R3 includes two different kind of CUbRIK Pipelines: Query Execution and Feedback acquisition and processing pipelines. The former is covered by the CrowdFlower extension while the latter by both Entity Query Language (EQL) extension and the new Europeana crawler. The Media entity annotation is available in the official CUbRIK repository at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/MediaEntityAnnotation_UNITN. Three subfolders are reported: - configuration: contains all files like bpel file necessary for the CUbRIK pipeline - JAR: contains boudles for Media Entity H-demo and related source code - data: contains the CMLCode_Diversity.txt file necessary to create crowd-sourcing tasks using CrowdFlower crowd-sourcing platform

2.3.1

H-Demo vs. V-Apps (Fashion and HoE)

Various components of Media Entity Annotation H-Demo are re-used in the History of Europe (HoE) V-App. Europeana crawler is used in the HoE V-App to build a demonstration dataset from the Europeana records. For the collection of the images we use as queries a set of 50 popular European politicians from the Establishment of the European Union until today. Europeana returns a diverse set of metadata since it links to several repositories with different data models. The crawler tries to extract time information from the metadata and to push them into Entity repository for supporting time queries related to the extracted images. Entitypedia integration component of HoE V-App is built on top of the same Entitypedia CUbRIK java client used in Media Entity Annotation H-Demo. The Entitypedia CUbRIK java client is used to perform CRUD operations on entities, attributes, and meta-attributes. Allowed types of entities are people, locations, events and organizations. Allowed metaattributes include provenance, time validity, and accuracy. (Fuzzy) name search techniques are used to find candidate people entities for provided names. EQL is used to perform more complex searches. For instance, it is possible to perform search for images of people with a given name and which were uploaded after a given date. 13 http://crowdflower.com/ R3 CUbRIK Integrated Platform Release

Page 30

D9.4 Version 1.0

2.3.2

H-Demo vs CUbRIK pipeline(s)

CUbRIK pipelines for media entity annotation h-demo are shown in Figure 18

Figure 18: SMILA pipelines for Media Entity Annotation First, monument entity is searched for a given name by using entity search component. Then media crawler components are used to harvest images relevant to the monument. Harvested images with associated metadata are stored back into the entity repository. A crowd-sourcing task for refinement of automatically harvested representative images is generated and submitted to CrowdFlower crowd-sourcing platform. After relevance judgments are generated by users of the crowd-sourcing platform, CrowdFlower sends a notification. Media entity annotation pipeline resumes by collecting Judgments from CrowdFlower. Relevant entities are retrieved from the entity repository. Relevance judgments from CrowdFlower are used to modify/correct entities. Finally, updated entities are stored back to the entity repository. Entity Model (Figure 19) is used in both Media Entity Annotation H-Demo and HoE V-App. In the Entity Model, every attribute of every entity can be associated with meta-data information, which includes provenance, accuracy, and time validity of attribute. Entity Repository supports keyword, time, and EQL based queries. Name search is an example of a keyword based query. (Exact) name search is used in Media Entity Annotation H-Demo to find monument entities. (Fuzzy) name search is used to find candidate people entities in HoE V-App. Time attributes of entities allows filtering of search results by time intervals. For instance, images uploaded after a specific date can be retrieved. EQL queries are used when relationship between multiple entities should be exploited, e.g. it is possible to find people which co-appear in the same image. The relations between Tag, Image, and Person entities will be exploited in this case. A more detailed description of automatic media harvesting and crowd-sourcing techniques used in Media Entity Annotation H-Demo is available in D4.1 and D7.1.

R3 CUbRIK Integrated Platform Release

Page 31

D9.4 Version 1.0

Figure 19 Entity Model for Media Entity Annotation H-Demo and HoE V-App

R3 CUbRIK Integrated Platform Release

Page 32

D9.4 Version 1.0

2.3.3

Data set description

Input: A set of famous Italian monuments with metadata information was expert-generated. In total, experts collected a set of 94 monuments located in different Italian cities such as Rome, Florence, and Milan. Entities related to the cities or monuments were also collected. Output: Using different attribute values from the 94 monuments, a data set of about 25.000 images (up to 100 images * from 3 sources * 100 monuments) was generated. Images were queried from Panoramio, Picasa and Flickr. In the case of the usage of the H-Demo for the History of Europe V-App the dataset is focused on the Europeana data and the case of European politicians (as in the case of the HoE V-App). In this case: Input: a set of 50 well known European politicians Output: a list of Europeana records that contain image URLs and time information for the depicted event as well as other available metadata.

2.3.4

Architecture overview

The media entity annotation H-Demo is split into two workflows, both belonging CUbRIK Relevance feedback pipeline : â&#x20AC;˘ CrowdFlowerMediaAnnotation â&#x20AC;˘ CrowdFlowerFeedback CrowdFlowerMediaAnnotation workflow aims at crawling images and updating entities with them, than manually validate them exploiting crowd force on CrowdFlower resource. CrowdFlowerFeedback workflow downloads the human validations from CrowdFlower and updates the entities according to the validation results. In order to create and process crowdsourcing jobs on CrowdFlower, following pipelets were implemented in CrowdFlowerMediaAnnotation pipeline:

1. CFTaskGeneratorPipelet 2. CFJobCreationPipelet 3. CFJudgementCollectorPipelet CFTaskGeneratorPipelet takes a monument entity as input, extracts needed for task generation attributes and outputs generated for CrowdFlower tasks represented as SMILA records. CFJobCreationPipelet takes the generated task and a set of attributes required by CrowdFlower as input, creates CrowdFlowerJob object and sends it to CrowdFlower server. As output, the pipelet returns an id of a job created on CrowdFlower server. Generated tasks are provided by CFTaskGenerator pipelet. The set of attributes that require manual input is the following: API Key, job title, job instructions, path to cml file. API Key is needed for authentication purpose and provided by CrowdFlower after registration on the site. Job title and job instructions are required attributes for job creation. CML is a precompiled file that defines forms to collect judgements. The file is provided with the CrowdFlowerMediaAnnotation pipeline. However, the exact path to the cml file is needed to be provided manually. CrowdFlowerFeedback pipeline contains two pipelet:

1. CFJudgementCollectorPipelet 2. EntityUpdateByJudgmentPipelet CFJudgementCollectorPipelet aims at downloading collected judgements from CrowdFlower side as soon as they are ready. Moreover this pipelet processes the results derived from CrowdFlower. First, it checks the confidence interval and if the interval is higher than 0.67 the results are considered to be valid. Second, it processes judgments and extracts information about images, whether they are representative or not for each monument. Finally, the pipelet forms an output from this information. EntityUpdateByJudgmentPipelet updates entities in

R3 CUbRIK Integrated Platform Release

Page 33

D9.4 Version 1.0

Entitypedia according to the collected judgments extracted by the previous pipelet.

2.3.5

Third party library

In addition to the libraries which were used in R1 the following third party library are required: • json-simple-1.1.1.jar – is a simple Java toolkit that is used to encode and decode JSON. https://code.google.com/p/jsonsimple/downloads/detail?name=json-simple-1.1.1.jar • httpclient-4.1.2.jar and httpcore-4.1.2.jar are components of Java toolset that focused on working with HTTP protocol http://hc.apache.org/downloads.cgi • jsoup-1.6.3.jar -

2.3.6

Components integrated in CUbRIK R3

CrowdFlower API provides developers with the rich functionality for automatisation of the process of job creation and judgement collection. Since CrowdFlower uses a RESTful API the middle layer between SMILA and CrowdFlower API is required. To this end Java wrapper for CrowdFlower API was developed. CFlientAPI is the wrapper that replicates all the functionality provided by CrowdFlower API. It is released as part of the overall Media Entity installation.

2.3.7

How to install the H-Demo bundle

The Media Entity H-Demo is provided as SMILA bundle. The following steps describe as proceed with its installation and configuration:

1. Download

the both cubrikproject.entitypedia.client_1.0.0.jar and cubrikproject.entitypedia.crawler_1.0.0.jar file from https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/MediaEntityAnnotation_UNITN/JAR and copy it under the /plugins directory of your SMILA installation 2. Download the contents of https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/MediaEntityAnnotation_UNITN/configuration folder into the configuration directory of your SMILA installation 3. In case you have an installation of SMILA with no other features installed you need only to rename deploy.xml.mediaentity in order to overwrite the SMILA original one, otherwise open the deploy.xml.mediaentity file that is under the configuration folder you downloaded and copy the content between  and  in the deploy.xml file into the configuration\org.eclipse.smila.processing.bpel\pipelines directory of your SMILA installation 4. Edit the process.properties file that is under configuration/org.eclipse.smila.processing.bpel folder of your SMILA installation and change the value of parameter pipeline.timeout to 1200.

2.3.7.1 How to run the H-Demo The Media Entity H-Demo can be tested using a REST client like Resty14, RESTClient firefox15 add-ons or Chrome Simple REST Client16 In the following description it is assumed that test of H-demo is done on the same machine hosting the H-Demo; otherwise, remember to replace “localhost” with the name or the IP of the server running the H-Demo.

14 https://github.com/micha/resty 15 https://addons.mozilla.org/En-us/firefox/addon/restclient/ 16 https://chrome.google.com/webstore/detail/simple-rest-client/fhjcajmcbmldlhcimfajhfbgofnpcjmb R3 CUbRIK Integrated Platform Release

Page 34

D9.4 Version 1.0

In order to test the CrowdFlower interaction demo you should also create a CrowdFlower account and after that get your API access key that is needed to access the service17 Moreover ensure that the server where you install the demo could reach (for example in case of proxy or firewall) the entitypedia API server on the 8082 port. To test it, try to open with a web browser the URL http://api.entitypedia.org:8082. Once installed the Media Entity H-demo, run the SMILA.exe file in your SMILA root folder. The software will start loading the bundles. Please wait until you read something like reported in the screenshot below:

Figure 20 SMILA console screenshot Now the server is ready to interact with you. The monument name query: â&#x20AC;&#x153;San Petronio Basilicaâ&#x20AC;? will be used for retrieving the corresponding monument entity from the Entity Repository. A crawler will be started for retrieving related image entities. Images will be connected to the monument entity and the Entity Repository will be updated. NOTE: 'query' can be only one of the English names of Italian monuments. Each test keeps few minutes, depending of the speed of your internet connection. Figure below reports an example of Chrome Simple REST Client usage

Figure 21 Example of Chrome Simple REST Client usage

17 https://new-auth.crowdflower.com/registrations/new?redirect_url=https://crowdflower.com/jobs R3 CUbRIK Integrated Platform Release

Page 35

D9.4 Version 1.0

In the following is reported a list of REST calls that allow to exploit H-demo functionalities: 1. see the entity in the entitypedia METHOD

POST

URL

http://localhost:8080/smila/pipeline/MediaEntityAnnotation/process/

BODY

{"query" : "San Petronio Basilica"}

RESPONSE

The json representation of the entity

2. annotate the entity using images from: a. Flickr METHOD

POST

URL

http://localhost:8080/smila/pipeline/MediaEntityAnnotationFlickr/process/

BODY

{"query" : "San Petronio Basilica"}

RESPONSE

The json representation of the entity

b. Twitter METHOD

POST

URL

http://localhost:8080/smila/pipeline/MediaEntityAnnotationTwitter/process/

BODY

{"query" : "San Petronio Basilica"}

RESPONSE

The json representation of the entity

Panoramio

METHOD

POST

URL

http://localhost:8080/smila/pipeline/MediaEntityAnnotationPanoramio/process/

BODY

{"query" : "San Petronio Basilica"}

RESPONSE

The json representation of the entity

d. Picasa METHOD

POST

URL

http://localhost:8080/smila/pipeline/MediaEntityAnnotationPicasa/process/

BODY

{"query" : "San Petronio Basilica"}

RESPONSE

The json representation of the entity

3. get a a set of related image URLs and time metadata as Events related to the extracted image entity about a European politician METHOD

POST

URL

http://localhost:8080/smila/pipeline/MediaEntityAnnotationEuropeana/process/

BODY

{"query" : "Mario Draghi"}

RESPONSE

The json representation of the entity

R3 CUbRIK Integrated Platform Release

Page 36

D9.4 Version 1.0

4. query Entitypedia for a set of entity using EQL (Entity Query Language) METHOD

POST

URL

http://localhost:8080/smila/pipeline/EntitySearch/process/

BODY

{"query" : "select i, e.start from tag t join t.computer_file i join t.entity e where e.start < '1999-12-31 00:00:00'"}

RESPONSE

The json representation of the entities

5. interact with CrowdFlower platform a. creating a Crowd application for resolving any issue about image and entity associations (please use your own key instead of xxx for CFkey attribute) METHOD

POST

URL

http://localhost:8080/smila/pipeline/CrowdFlowerMediaAnnotation/process/

BODY

{"query":"San Petronio Basilica", "CFkey":"xxx", "title":"Select different images of italian Monuments for photo collection", "instruction":"Select different images of italian Monuments for photo collection", "cmlPath":"CMLCode_Diversity.txt"}

RESPONSE

{"CFkey":"xxx", "title":"...", "instruction":"...", "cmlPath":"...", "_recordid":"...", "job_id_status":"JOB_ID"} This is the ID of the Job created in CrowdFlower

b. retrieving the results of the CrowdFlower job (please use your own key instead of xxx for CFkey attribute, and the CF JOB_ID corresponding to the previously created job). NOTE: this feature asks to CrowdFlower to give back the result of the user interactions. If no CrowdFlower users made the job, no results will be returned. METHOD

POST

URL

http://localhost:8080/smila/pipeline/CrowdFlowerFeedback/process/

BODY

{"CFkey":"xxx", "jobID":"JOB_ID”}

RESPONSE

{"CFkey":"xxx", "jobID":"JOB_ID", "_recordid":"…", "judgments":"[]", "tags":"0"}

R3 CUbRIK Integrated Platform Release

Page 37

D9.4 Version 1.0

2.3.8

How to exploit the H-demo source code

Assuming that the SMILA environment installation was properly performed, if Media Entity Hdemo R1 was already installed, only the following items are needed to be updated from SVN: • • •

cubrickproject.entitypedia.client package that is embedded in the corrisponding jar file CrowdFlowerMediaAnnotation.bpel deploy.xml.mediaentity that has to be rename in deploy.xml Otherwise you need to download all content of https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/MediaEntityAnnotation_UNITN. In section Architecture overview is reported an overview of pipeletes and pipelines constituting this H-demo.

2.4 Like Lines User interactions with multimedia items in a collection can be used to improve their representation within the information retrieval system’s index. The LikeLines H-Demo collects users’ interactions for identifying the most interesting/relevant fragments in a video. The collected user interaction can be implicit (e.g., play, pause, rewind) and explicit (i.e., explicitly liking particular time points in the video). The Like Lines is available in the official CUbRIK repository at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/H-Demos/LikeLines_ TUD Three subfolders are reported: - configuration: contains all files like bpel file necessary for the CUbRIK pipeline - framework: contains the framework part of Like Lines H-demo - JAR: contains bundle for Like Lines H-demo

2.4.1

H-Demo vs. V-Apps (Fashion and HoE)

Likelines can be used in contexts which benefit from spontaneous and implicit crowd-based feedback, as in the case of trend analysis of people’s preferences in the SME Fashion V-App. For example, user's playback behaviour can signal which fragments of a fashion-related video contain the most interesting or appealing clothing items. LikeLines allows extraction of individual keyframes containing these clothing items, which can be passed to the SME Fashion App for further processing.

2.4.2

H-Demo vs CUbRIK pipeline(s)

The LikeLines H-Demo provides the components needed for it to be used in a relevance feedback pipeline: 1. An enhanced JavaScript video player component in order to collect implicit and explicit user feedback. 2. A back-end server component with which the JavaScript component communicates for storing the collected user feedback. 3. A SMILA Pipelet for communicating with a LikeLines back-end server and computing and extracting the top N key-frames for a queried YouTube movie.

R3 CUbRIK Integrated Platform Release

Page 38

D9.4 Version 1.0

Figure 22: Human users (top left) interact with the LikeLines player component (center) These interactions serve as implicit and explicit feedback are stored on a server component (right).

2.4.3

Data set description

The timecode-aware video dataset (developed in WP7) was used for the testing of the HDemo. However the LikeLines H-Demo can make use of virtually any set of YouTube videos, the application allows the specification of youtube video URI to be played in.

2.4.4

Architecture overview

The LikeLines H-Demo asynchronously aggregates both human activities and automatic computations. Human activities take place outside of CUbRIK (Figure 22). Videos (Figure 23, “Video”) embedded on websites using the LikeLines player component (Figure 23, “LikeLines Video Player”) benefit from receiving implicit and explicit feedback from users (Figure 23, “Interactions”). This feedback is stored on the server (Figure 23, “LikeLines Server”).

R3 CUbRIK Integrated Platform Release

Page 39

D9.4 Version 1.0

Figure 23: LikeLines pipelet (bottom) providing a bridge to the LikeLines server (top) In order to be able to use the feedback in automatic pipelines, the LikeLines H-Demo provides a SMILA pipelet (Figure 23: “LikeLines”) that acts as a bridge to the LikeLines server. This pipelet can be embedded in a SMILA pipeline using BPEL to annotate SMILA records with user feedback metadata (N most interesting keyframes). The pipelet can also launch automatic indexing jobs depending on the amount of user feedback stored on the server. For example, if insufficient user feedback has been received for a video, the pipelet can decide to launch a content analysis job. When such a job is completed, the results of the analysis (Figure 23, “MCA”) are uploaded to the LikeLines server.

2.4.5

Third party library

The LikeLines player relies on the following third party JavaScript libraries: •

jQuery >= 1.7.2

The LikeLines server relies on the following third party Python libraries: • Flask • PyMongo • Flask-PyMongo The LikeLines server requires a MongoDB server in order to run. The LikeLines SMILA pipelet requires the following Java libraries: Google GSON (https://code.google.com/p/google-gson/)

R3 CUbRIK Integrated Platform Release

Page 40

D9.4 Version 1.0

2.4.6

Components integrated in CUbRIK R3

The LikeLines H-demo provides a SMILA pipelet leveraging on a wrapper, it is the Implicit Feedback Filter LIKELINES. The pipelet can be used in a SMILA pipeline to query a LikeLines server in order to compute the top N key-frames in a video.

2.4.7

How to install the H-Demo bundle

As reported in section 2.4.4, the LikeLInes H-demo is composed by: - LikeLines server - LikeLines player - SMILA pipelet that acts as a bridge to the LikeLines server Regarding the LikeLines server, as anticipated in section 2.4.5, it requires the installation of some third party library like the following packages: • • •

Flask PyMongo Flask-PyMongo The simplest way of installing these packages is using pip. You can install pip by first installing easy_install by following the instructions listed on this page. You can then execute the following command in a terminal to obtain pip: $ easy_install pip The required Python packages can then be installed as follows: $ pip install Flask $ pip install PyMongo $ pip install Flask-PyMongo

2.4.7.1 Port Configuration Web-based frontend (GUI) uses as default port the 8080, if need it can be configurated running the following command: $ cd likelines_source $ python -m SimpleHTTPServer PORT For example, a different port must be specified if you want to run the demo on the same machine that is running SMILA on the same port.

2.4.7.2 CUbRIK deployment configuration the LikeLines H-demo is provided as SMILA bundle so a pipeline is already provided. The following steps describe as proceed with its installation and configuration: 1. Download the cubrikproject.tud.likelines_1.0.0.201309111631.jar file from https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/H-Demos/LikeLines_ TUD/JAR and copy it under the /plugins directory of your SMILA installation 2. Download the following files: • LikeLinesPipeline.bpel • deploy.xml.Likelines from https://cubrikfactory.eng.it/svn/CUbRIK/PACKAGES/R3/H-Demos/LikeLines_ TUD/configuration and copy them under the /configuration directory of your SMILA installation 3. Open the deploy.xml.Likelines file that is under the configuration folder you downloaded and copy all its content in the deploy.xml file into the configuration\org.eclipse.smila.processing.bpel\pipelines directory of your SMILA installation 4. (optional)In case the Like Lines server and the SMILA bundle aren’t running in the R3 CUbRIK Integrated Platform Release

Page 41

D9.4 Version 1.0

same machine, open the LikeLinesPipeline.bpel file and modify the Like Line server IP address as follow:

<extensionActivity> <proc:invokePipelet name="invokeLikeLines"> <proc:pipelet class="cubrikproject.tud.likelines.pipelets.LikeLines" /> <proc:variables input="request" output="request" /> <proc:configuration> <rec:Val key="server">http://ip-server-likeline:9090/</rec:Val> <rec:Val key="input_field">youtube_id</rec:Val> <rec:Val key="n">5</rec:Val> <rec:Val key="output_field">topkeyframes</rec:Val> </proc:configuration> </proc:invokePipelet> </extensionActivity>

2.4.7.3 How to run the H-Demo 2.4.7.3.1 Running the server This section assumes you have downloaded the full LikeLines source code from https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/H-Demos/LikeLines_ TUD/framework. Once downloaded and unpacked to a directory, the following two processes need to be started. The first process to be started is a MongoDB server on the default port. You can start the MongoDB server by simply executing mongod in a terminal: $ mongod The second process is the actual LikeLines backend server that will receive requests to store and aggregate user playback behaviour. To start this process, go into the server subdirectory and run the LikeLines.server Python module. The example above shows how to run the LikeLines server on port 9090. $ cd likelines_source/server $ python -m LikeLines.server -p 9090

2.4.7.3.2 Running the Web-based frontend (GUI) The following items are required: • A HTML5-compatible browser supporting the Canvas element and JavaScript. • Internet access (for the YouTube API and jQuery library). • The LikeLines server running on the local machine (see section above). The demo also requires a Web server that will serve examples/demo.html. Note that you cannot simply open the web page locally (the browser would simply refuse to execute JavaScript in a local context). A simple way of hosting the demo example is to use Python's builtin HTTP server: $ cd likelines_source $ python -m SimpleHTTPServer 8080

R3 CUbRIK Integrated Platform Release

Page 42

D9.4 Version 1.0

Assuming the LikeLines server is already running on the same machine, the demo can be started by pointing your HTML5-compatible browser to http://localhost:8080/examples/demo.html

2.4.7.3.3 SMILA interconnection Assuming SMILA already installed and running (see section 1.6.2), and both Like Lines server and fronted running, the POST http:// IP_SERVICE_SMILA:PORT/ smila/pipeline/LikeLinesPipeline/process Body : { "Text": "Some record", "youtube_id": "YouTube:kYtGl1dX5qI" } Where youtube_id is the value you can catch from the Web-based frontend (GUI) of like Lines: right click on the video showed.

2.4.8

How to exploit the H-demo source code

2.4.8.1 CUbRIK environment configuration (SMILA) In order to use the LikeLines SMILA pipelet, it needs to be embedded in a pipeline. Below is an example of how to configure the pipelet: <extensionActivity> <proc:invokePipelet name="invokeLikeLines"> <proc:pipelet class="cubrikproject.tud.likelines.pipelets.LikeLines" /> <proc:variables input="request" output="request" /> <proc:configuration> <rec:Val key="server">http://likelines-shinnonoir.dotcloud.com</rec:Val> <rec:Val key="input_field">youtube_id</rec:Val> <rec:Val key="n">1</rec:Val> <rec:Val key="output_field">topkeyframes</rec:Val> </proc:configuration> </proc:invokePipelet> </extensionActivity>

2.4.8.2 Configuration of the 3rd party library and components integrated in SMILA for CUbRIK R3 The LikeLines SMILA pipelet requires the Google GSON library. This library is available at https://code.google.com/p/google-gson/. The GSON JAR file needs to be added to the SMILA.extensions/eclipse/plugins directory and should be added to the SMILA run/debug configuration when running SMILA.

2.5 Crosswords The crosswords horizontal demo (H-Demo) demonstrates a web application that aims to refine content for named entities based on the feedback from a user. The web application represents a game framework with a classical crossword game implemented on top of the framework and adapted for the requirements of the project. The game framework allows to a developer create new word games based on the framework infrastructure. The game framework provides following functionality: user registration, authentication, authorization and profiling. Namely, the Crosswords game uses existing game framework user account database to allow players to have continuous experience across the framework games and be able to exploit framework services. In the current version of Crossword game content for it is taken from CUbRIK entity R3 CUbRIK Integrated Platform Release

Page 43

D9.4 Version 1.0

repository, from the snapshot of entities requiring metadata improvement. The possibility of correction is built into the game on two stages: on the stage of crossword creation and on the stage of crossword solving

2.5.1

H-Demo vs V-Apps (Fashion and HoE)

This H-demo is not exploited in a V-Apps. As anticipated H-Demos have a twofold goal: provide features for the V-Apps and provide pipelines â&#x20AC;&#x201C;workflow fragments- to be reused by platform adopter, in specific vertical domains. Crossword H-Demo implements the pipeline for the usage of this GWAP to improve the quality of a entity dataset; in this case the dataset has to be stored in CUbRIK entity repository.

2.5.2

H-Demo vs CUbRIK pipeline(s)

This H-demo provides a CuBRIK pipeline for relevance feedback, more details are reported in section 2.5.5

2.5.3

Data set description

CUbRIK entity repository is implemented leveraging on and taking into account Entitypedia Entitypedia is an entity centric Knowledge Base consists of re-known sources GeoNames18 , YAGO19, WordNet20, GeoWordNet21 and MultiWordNet22 where GeoNames, WordNet and GeoWordNet were imported completely on the other hand YAGO and MultiWordNet were imported rather partially. From the GeoNames nearly 7 million location entities and 350 classes thereof and from the YAGO 1.5 million entities of types location, person and organization were taken. The WordNet and MultiWordNet are the sources of English and Italian linguistic knowledge, respectively. In both of them concepts are connected to each other with is-a, part-of or equivalent-to semantic relations. The goal of the Entitypedia is to model the world knowledge in a way more suitable for locating, accessing and enhancing and for ensuring its effective use and reuse

2.5.4

Architecture overview

The crosswords game is implemented as a J2EE application. It has four major components: Web Controller, Crossword User Interface, Crossword API provided by API Controller, and Crossword DB. The game interacts with Entitypedia Web API to access Entitypedia data. The game interacts with Game Framework API to access the Framework data and use its services.Figure 24 shows the system architecture and communication between components

2.5.4.1 Controller All user requests are handled by Web and API controllers. Once a controller gets a request from a user to perform a task, it interacts with Crossword Web API (Crossword Web API may use Entitypedia API) to perform the task. It also interacts with UI to show expected view to the user. UI also interacts (HTTP and AJAX) with it to do operations.

18 Marc Wick and Bernard Vatant. The geonames geographical database. Available from World Wide Web: http://geonames. org 19 Fabian M Suchanek, Gjergji Kasneci, and Gerhard Weikum. Yago: a core of semantic knowledge. In Proceedings of the 16th international conference on World Wide Web, pages 697{706. ACM, 2007

20 Christiane Fellbaum. Wordnet. Theory and Applications of Ontology: Computer Applications, pages 231{243, 2010 21 Fausto Giunchiglia, Vincenzo Maltese, Feroz Farazi, and Biswanath Dutta. Geowordnet: a resource for geo-spatial applications. The Se-mantic Web: Research and Applications, pages 121{136, 2010.

22 Alessandro Artale, Bernardo Magnini, and Carlo Strapparava. Wordnet for italian and its use for lexical discrimination. AI* IA 97: Advances in Artificial Intelligence, pages 346{356, 1997

R3 CUbRIK Integrated Platform Release

Page 44

D9.4 Version 1.0

Figure 24: Crossword application system architecture

2.5.5

Crosswords workflow for the Feedback Acquisition and Processing

The main aim of the Crossword is to improve CUbRIK entity repository content. This is done via the userâ&#x20AC;&#x2122;s feedback that is collected by the framework. Crossword game provides two possibilities for a player to correct the data: during the crosswords creation activity and during the solving crosswords. Figure 25 represents the use case when a user during the crossword solving activity can complain about the correctness of the clue. This can be done by clicking to the exclamation sign on the side of highlighted clue. In the appeared dialog box the user can write corrections regarding the clue. This information is aggregated, weighted and applied to the entity repository. This feedback functionality is in development and will be part of the next release. To attract wider audience of players, Crosswords game allows manual crossword editing and provides assistance in this process. The assistance is provided through suggestions for current selected slot of the grid. The content for suggestions comes from CUbRIK entity repository and this step also can be used for gather player feedback in a similar way to gathering feedback during the process of solving the crossword. This feedback functionality is in development and will be part of the next release.

R3 CUbRIK Integrated Platform Release

Page 45

D9.4 Version 1.0

Figure 25. Crosswords gameplay page

2.5.6

How to install the H-Demo

Installation of Crosswords requires the following prerequisites: - JDK v1.6u25 or superior available at http://www.oracle.com/technetwork/java/javasebusiness/downloads/java-archivedownloads-javase6-419409.html#jdk-6u45-oth-JPR - PostgreSQL 8 or superior available at http://www.postgresql.org/download/ - Apache Tomcat v7 available at http://tomcat.apache.org/download-70.cgi with servlet 3 0 support and JSTL extensions In order to install Crosswords H-Demo it is required to do the following steps:

2.5.6.1 Data base installation and preparation 1. Download and install PostgreSQL. We suggest to do it throw the pgAdmin application that you find at URL http://www.pgadmin.org/ 2. In PostgreSQL create "crosswords" user with password "1111" (see also models/db/000.sql.txt). 3. If you are in a windows OS, edit the file models/sb/recreate-db.cmd and set the variable PSQL to the path to your psql.exe (generally something like â&#x20AC;&#x153;c:\Program Files\ PostgreSQL\9.3\bin\psql.exeâ&#x20AC;?) 4. Run the command recreate-db (recreate-db.cmd in windows OS, recreate-db.sh otherway) that you find inside model/db/ folder and insert the password when needed. It will (re)create the database with the needed structure and the test data 5. In PostgreSQL create "ui_crosswords" database owned by "crosswords" user R3 CUbRIK Integrated Platform Release

Page 46

D9.4 Version 1.0

6. In Postgres create the schema and populate the database. Execute s001.sql file from models/db by the following command: "{path to postgres folder}\PostgreSQL\{version}\bin\psql.exe" -d postgres -U postgres -f s001.sql , then execute test-data.sql, or use models/db/recreate-db.cmd script. The users in the database should be synchronized with the users in the Crosswords API server.

2.5.6.2 Web Application server installation and configuration 1. Download and Install Apache Tomcat v7 2. You have to integrate the Tomcat Installation downloading some needed jar inside its “lib” folder. You can download them from the following pages: http://mvnrepository.com/artifact/javax.servlet/javax.servlet-api/3.0.1 http://mvnrepository.com/artifact/org.glassfish.web/javax.servlet.jsp.jstl/1.2.1 http://mvnrepository.com/artifact/javax.servlet.jsp.jstl/javax.servlet.jsp.jstlapi/1.2.1 3. Checkout into Apache Tomcat “conf” folder the content from the svn CuBRIK svn at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/Crosswords_UNITN/certificates/ 4. Edit the Apache Tomcat etc/conf/server.xml, deactivate the APR library loader commenting or deleting the line <Listener className="org.apache.catalina.core.AprLifecycleListener" SSLEngine="on" /> And enable the SSL connector as follow: <Connector port="8443" protocol="HTTP/1.1" SSLEnabled="true" maxThreads="150" scheme="https" secure="true" clientAuth="false" sslProtocol="TLS" keystoreFile="conf/server.jks" keystorePass="password"/> 5. Save server.xml 6. Checkout into “webapps” folder of your Apache Tomcat installation the crosswordsui.war file that is in the CubRIK svn at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/Crosswords_UNITN/webapp/

2.5.6.3 How to run the h-demo In order to run the application: 1. Start the Apache Tomcat server using the bin/startup command 2. Open in your browser the URL http://localhost:8080/crosswords-ui/ 3. When requested accept the certificate even if your browser says the url could be unsafe (it just depends to the fact that your server is not the machine for which the certificate was made to) 4. When request to log in, use as user “pgf2” and as password “user” 5. Enjoy your crosswords game

2.5.7

How to exploit the H-demo source code

This H-Demo is released according the architecture diagram depicted in Figure 24. In the src directory of the repository (https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/H-Demos/Crosswords_UNITN/src) the following directories are reported: - cubrikui: the user interface R3 CUbRIK Integrated Platform Release

Page 47

D9.4 Version 1.0

- entitypedia-crosswords-client-1.0.8: crossword-application - entitypedia-crosswords-common-1.0.7: crossword-application - entitypedia-games-client-1.0.9: - entitypedia-games-common-1.4.11: game-framework - entitypedia-games-common-model-1.0.9: game-framework - entitypedia-games-parent-8: game-framework Crossword exploits some external features, in detail APIs of external framewroks are used. For more information about Crosswords APIs the contact point is Fausto Giunchiglia, Università di Trento, fausto@disi.unitn.it .

2.6 Accessibility aware Relevance feedback The Accessibility aware Relevance feedback H-Demo aims at demonstrating the capabilities of the automatic re-ranking of the multimedia data retrieved as query results by the search engine. The goal is to promote those query results that are most accessible for the user. For this purpose, the H-Demo is split in 4 independent sub-tasks, as described below. Initially, the H-Demo utilizes a test for the creation of the profile for each user which is then maintained and updated when needed following the progression of the possible impairments of the user. In particular, this personalized profile is inferred via indirect Q/A test during their registration. It initially contains some approximate values regarding the user’s impairments. Another important task of the H-Demo is the accessibility related dataset Annotator. The accessibility component aims to analyse the multimedia objects (i.e. currently only images analysis is supported) contained into the QDB and enhances their metatag after having extracted the appropriate accessibility-related attributes (i.e. colour, contrast, etc.) . This task is considered to run in the background (i.e offline) and enhance the metadata of the multimedia objects with accessibility related features, firstly by extracting them and secondly by relating them onto specific impairments. In the regular case of a search engine, the most relevant query results would be returned usually following a descending relevance order, once the user submits a query. Hereby, the the metatag will be used in order to re-rank the images when they are presented to the users by combining both their relevance to the submitted query but also their accessibility level according to the user’s profile. This way, the reranking process will result in a set of Paretooptimal result rankings, from which one will be automatically selected as optimal for the specific user profile. However, the user will have the chance to interfere in the selection of the re-ranked query results set (i.e. relevance feedback). Specifically, this task includes methods that allow the user to provide feedback concerning the accessibility of the presented results. This feedback is used in order to fine-tune the user impairment profile, so that the new profile is closer to the user’s actual impairments and next rerankings are more accurate. The accessibility-related feedback is implemented in two separate ways: by providing feedback about the accessibility of each presented result or by manually selecting among the available Pareto-optimal rankings.

2.6.1

H-Demo vs V-Apps (Fashion and HoE)

The Accessibility aware Relevance feedback H-Demo resembles the corresponding component that will be integrated in the Fashion V-App in most of its functionalities. However, the main difference is the input data provided by the user as query submission. In the web interface developed for the exhibition of the current H-Demo a reqular text-based query is used, while the V-app will be based on a more interactive architecture where the user will be guided through specific option based menu. This way, the current components will have to be adjusted accordingly

R3 CUbRIK Integrated Platform Release

Page 48

D9.4 Version 1.0

2.6.2

H-Demo vs CUbRIK pipeline(s)

The Accessibility H-Demo implements the following pipelines: 1. Accessibility Annotation Pipeline: this pipeline extracts low level features from the input images and calculates accessibility scores from them, by estimating low-level feature (brightness, contrast etc.). The accessibility annotation is added to the input record. A more detailed description of the input and output of the pipeline can be found in Section 2.6.6 2. Accessibility Annotation Demo Pipeline: this pipeline is used for demonstrating the accessibility annotation extracted through the accessibility annotation pipeline. It adds some extra pipelets in the end of the accessibility annotation pipeline, which use the extracted annotation in order to produce images which illustrate the annotation 3. CUbRIK Add to index Pipeline: This pipeline adds records to the CUbRIK index. Additionally to the implementation of the standard SMILA indexing add pipeline, the record passes through the accessibility annotation pipeline, before sending the input record to the indexing pipeline in order to be enhanced with accessibility annotation. 4. Accessibility Profile Extraction Pipeline: The profile extraction pipeline accepts as its input the results of specific impairment tests that are performed via a web interface. The tests measure the capability of the user to accomplish some vision-related tasks (e.g. color-blindness tests). The pipeline processes this information in order to extract an impairment profile for the user, for the various supported types of disabilities. The input and output of the pipeline are described in more detail in Section 2.6.6 5. CUbRIK Search Pipeline: The CUbRIK search pipeline accepts a query input and returns a set of results which are retrieved from the CUbRIK index. Additional to the standard SMILA search pipeline the H-Demo search pipeline is configured to return those fields from the result that are related to accessibility 6. Accessibility Filtering Pipeline: The accessibility filtering pipeline consists a core component of the accessibility H-Demo. It accepts a set of search results as its input and calculates Pareto-optimal re-rankings of these results, with respect to the various supported types of disabilities. The pipeline also uses the user's impairment profile in order to find the optimal ranking for the specific user. The input and output of this pipeline are presented in more detail in Section 2.6.6. 7. Accessibility Profile Update Pipeline: The accessibility profile update pipeline is an important part of the accessibility H-Demo. The pipeline receives the user feedback as input and uses and updates and fine-tunes the user impairment profile, accordingly. The user feedback is either in the form of an evaluation of each result as accessible or not, or in the form of a selection of the most accessible ranking of the ones that accessibility filtering provided. The input and output of this pipeline are presented in more detail in Section 2.6.6 The H-Demo pipelines which are exploited by the V-App are the following: 1. AccessibilityAnnotationPipeline 2. AccessibilityExtractProfilePipeline 3. AccessibilityFilteringPipeline 4. AccessibilityUpdateProfilePipeline

2.6.3

Data set description

The dataset exploited in the Accessibility aware Relevance feedback H-Demo currently consists of 330622 fashion-related images from the Flickr website, each associated with various metadata. Images type: jpeg, (png) For each image, the following information is available: The title of the image. The URL of the image in Flickr. R3 CUbRIK Integrated Platform Release

Page 49

D9.4 Version 1.0

URLs for small, medium, original and large versions of the image. Text comments about the image, along with their author and the comment date/time. Text tags describing the image, along with their author. The geo-coordinates of the image (latitude, longitude and their accuracy). The person who uploaded the image. The date/time the image was taken. The date/time the image was uploaded. The name and the type of the context where the image belongs (e.g. a collection). Licence information about the image.

Although no certain type of Metadata needed, the attached metadata should fulfil the following requirements: Alt attributes Header tags Absence of â&#x20AC;&#x153;hard-codedâ&#x20AC;? text size An offline pre-processing of all objects in the dataset takes place towards their evaluation in terms of a multidimensional accessibility factor. Based on the personalized profile of each registered user, the V-app initially utilizes the search engine of CUbRIK and the returned results will be re-ranked and re-sorted, so as to meet the accessibility needs of the profile, accordingly. In particular, the extracted accessibility-related features includes accessibility related descriptors/features concerning the visual objects of the multimedia objects (i.e. colour histogram, areas with high luminance values, the image resolution & size, shape descriptors of the image, contrast of the image and texture, etc.) so as to address for vision related disabilities

2.6.4

Architecture overview

The architecture of the current H-Demo is provided herein. The overall architecture and workflow diagram of the H-Demo are presented in Figure 26

R3 CUbRIK Integrated Platform Release

Page 50

D9.4 Version 1.0

Figure 26 architectural block diagram of the H-Demo In the same concept, the subcomponents of each module presented in the aforementioned architectural block diagram can be found in Figure 27, Figure 28 , Figure 29 and Figure 30 respectively. In particular, the cubrikDataFetcher Job consists of two workers, namely the cubrikDataFetcher and the updatePusher Job, which provide input to the cubrikIndexUpdate Job, as shown in Figure 27. The latter consists of SMILA built-in bulkBuilder worker which feeds the accessibilityAnnotation pipeline and the cubrikAddPipeline. The specific pipelets that form accessibilityAnnotation pipeline can be found underneath the corresponding block in the diagramm and each of the implements a specific image processing task as indicated by its name, i.e. •

AccessibilityScorePipelet

•

ObjectDominantColorCombinationPipelet

•

ImageColorListPipelet

•

ImageColorSaturationPipelet

•

ImageContrastPipelet

•

ImageDominantColorCombinationPipelet

•

ObjectBrightnessPipelet ImageDominantColorPipelet

•

ObjectDetectionPipelet

•

ImageDominantColorHuePipelet

•

ObjectColorListPipelet

•

ObjectRedPercentagePipelet

•

ImageBrightnessPipelet

•

ImageRedPercentagePipelet

•

ObjectTexturePipelet

•

ObjectDominantColorHuePipelet

R3 CUbRIK Integrated Platform Release

Page 51

D9.4 Version 1.0

•

ObjectContrastPipelet

•

ObjectColorSaturationPipelet

Figure 27: flow chart of (a) the Data Fetcher and (b) the Updating of the Index Jobs Similarly the profile extraction (Figure 28) workflow consists of just one pipeline which takes input either via the web interface described in the following Section or the appropriate POST request.

Figure 28: flow chart of Extraction of the users’ profile workflow The accessibility related filtering workflow is fed with a query provided either by the web interface or by an appropriate POST request and consists of two pipelines as illustrated in Figure 29, namely the built-in cubrik search and the accessibility filtering pipeline providing as output difference ranking possibilities.The latter pipeline consists of two pipelets, i.e. (the paretoRanking pipelet and the optimalRanking pipelet)

R3 CUbRIK Integrated Platform Release

Page 52

D9.4 Version 1.0

Figure 29: flow chart of Accessibility related filtering workflow Finally, the architecture of the profile update worlkflow is depicted in Figure 30 and accepts a record containing the query results, the user profile, the possible rankings and the users’ preferences as input and returns as output an estimation of the updated profile

Figure 30: flow chart of user profile updating

2.6.5

Third party library

The H-demo leverages on the OpenCV 23library.

2.6.6

Components integrated in CUbRIK R3

For the realization of this H-demo some components were specifically developed as components integrated in SMILA; these components are listed below: 1. Profile Registration: The profile registration component takes the results of the impairment profile tests as its input and produces a vector of values consisting the impairment profile of the user. The component’s input consists of the following fields: • color blindness test results (array of values) • image brightness test result The component’s output, i.e. the user impairment profile, consists of the following fields: • protanopia • deuteranopia • total color blindness • sensitivity to bright backgrounds Each output field takes values in the [0, 1] range, with 0 meaning that the user does not have the respective disability, while 1 means that the user has the dis-ability at the maximum possible degree. As the system evolves, more vision impairments will be integrated and the above input and output records will contain more fields. 2. Accessibility Annotation: The accessibility annotation component calculates accessibility scores for the im-ages of the database. There are accessibility scores 23 http://docs.opencv.org/doc/tutorials/introduction/desktop_java/java_dev_intro.html R3 CUbRIK Integrated Platform Release

Page 53

D9.4 Version 1.0

for all supported types of impairments, taking values in the [0, 1] range. The 0 value means that the respective image cannot be viewed properly by a user having the respective disability, while the value 1 means that the respective image can be viewed properly by a user having the respective disability. Before the calculation of the final accessibility scores, a set of low level features are extracted from the images, which are related to vision impairments. These features are also returned as output of the component. The input to the accessibility annotation component consists of the following fields: • title • image URL The output of the accessibility annotation component consists of the following fields • image brightness • image color list • image color saturation • image contrast • image dominant color • image dominant color combination • image red percentage • brightness per object • color list per object • saturation per object • contrast per object • dominant color per object • dominant color combination per object • accessibility scores for the various impairments (array of values) 3. Accessibility Filtering: The accessibility filtering component accepts a query from the user and, as a re-sult, it produces a set of results and a set of rankings of the results, which are trade-offs among the various disabilities. It also accepts the user impairment profile, in order to find the optimal ranking for the specific user. The input to the component is a record consisting of the following fields: • query string • user impairment profile (array of values) The output of the component contains the following fields: • Pareto-optimal rankings (array of arrays) • optimal ranking index (index to the above array) 4. Profile Update: The profile update component uses feedback from the user, in order to update his/her impairment profile, so as to be closer to the real profile of the user. The feedback can be in the form of an evaluation of each result as accessible or not, or either a selection of one of the Pareto-optimal rankings presented. The component accepts the following fields as its input: • search results (array of objects - results) • user impairment profile (array of values) • Pareto-optimal rankings (array of arrays) • User feedback per result (array of values) • User feedback per ranking (single value) The component’s output is a new user profile, consisting of new values for the profile fields: • Protanopia • Deuteranopia • total color blindness • sensitivity to bright backgrounds

R3 CUbRIK Integrated Platform Release

Page 54

D9.4 Version 1.0

2.6.7

How to install the H-Demo bundle

The Accessibility aware Relevance feedback H-Demo is provided as SMILA bundle. The following steps describe as proceed with its installation and configuration: 1. Download the content of JAR folder at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/AccessibilityAwareRelevanceFeedback_CERTH/JAR and copy it under the /plugins directory of your SMILA installation 2. Download the content of other folder at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/AccessibilityAwareRelevanceFeedback_CERTH/others and copy it under the root folder of your SMILA installation. 3. Download the contents of https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/AccessibilityAwareRelevanceFeedback_CERTH/configuration folder into the configuration directory of your SMILA installation Regarding the configuration of this H-Demo, in case you have an installation of SMILA with no other features installed, you can just skip step 4 and proceed with step 5; otherwise proceed with the step below: 4. In order to update SMILA configuration files with specific configurations of Accessibility H-Demo proceed as follow: (if requested overwrite the existing files – except for standard SMILA configuration files) • Open the jetty.xml.accessibility file under configuration/org.eclipse.smila.http.server/jetty.xml.accessibility folder you downloaded at step2 and copy the content between  and  in the jetty.xml file of your SMILA installation • Open the jobs.json.accessibility, workers.json.accessibility and workflows.json.accessibility files under configuration\org.eclipse.smila.jobmanager you download at step 2 and for each of these files copy the content between {"_comment":"START Accessibilty H-Demo Specific configuration"}, and ,{"_comment":"END Accessibilty H-Demo Specific configuration"} in the correspondent files of your SMILA installation • Open the deploy.xml.accessibility file under the configuration/org.eclipse.smila.processing.bpel/pipelines/ configuration folder you downloaded at step2 and copy the content between <!-START Accessibilty H-Demo Specific configuration --> and  in the deploy.xml file into the configuration\org.eclipse.smila.processing.bpel\pipelines directory of your SMILA installation • Open the solr.xml.accessibility file under configuration/org.eclipse.smila.solr/ configuration folder you downloaded at step2 and copy the content between  and  into the configuration/org.eclipse.smila.solr/solr.xml file of your SMILA installation • Open the config.ini.accessibility file under configuration/ folder you downloaded at step2 and copy the content between #START Accessibility H-Demo servlet and # END Accessibility H-Demo servlet in the config.ini file of your SMILA installation append it just after the end of the last line of config.ini file 5. After the download performed at steps 1,2 and 3, for the configuration of this H-demo you need only to rename the files ending with “.accessibility” in order to overwrite the SMILA original ones.

R3 CUbRIK Integrated Platform Release

Page 55

D9.4 Version 1.0

2.6.7.1 How to inject the content data set The content data set two different types of data files are needed. Namely the file that contains the dataset itself and the corresponding indexing file. The data of the database to be indexed (image urls, tags etc.) need to be in a text file. Each row of the text file corresponds to an image and consists of the following fields, separated by tabs: •

record ID

•

title

•

flickr image URL

•

small image URL

•

medium image URL

•

original image URL

•

large image URL

•

latitude

•

longitude

•

accuracy

•

number of tags

•

tag 1

•

tag 2

•

tag 3

•

...

•

...

The record id is a unique identifier for the image. There are four URLs available for the image: the URL of the Flickr webpage containing the image, and three URLs for the actual image, in different sizes. The latitude and longitude fields contain the geographical coordinates of the image, if available. The accuracy of the coordinates is contained in the accuracy field. If any of the above fields is not available, it can be set to the value "null". Each image is associated by a set of tags describing it. The final fields of each record are the multiple tags. The field just before the list of tags contains the number of tags. A small example of a database file suitable for testing purpose is available at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/AccessibilityAwareRelevanceFeedback_CERTH/others/DATA_SMALL.txt

2.6.7.2 How to run the H-Demo Once installed the Accessibility H-demo, run the SMILA.exe file in your SMILA root folder. The software will start loading the bundles. Please wait until you read something like reported in the screenshot below:

R3 CUbRIK Integrated Platform Release

Page 56

D9.4 Version 1.0

Figure 31 SMILA console screenshot Now the server is ready to interact with you. The functionalities of the accessibility related annotation and the corresponding filtering parts of the demo are exhibited either via the web interface of REST interface via POST requests; in the latter case, the Accessibilty H-demo can be can be tested using a REST client like Resty24, RESTClient firefox25 add-ons or Chrome Simple REST Client26. In the following description it is assumed that test of H-demo is done on the same machine hosting the H-Demo; otherwise, remember to replace “localhost” with the name or the IP of the server running the H-Demo.

2.6.7.2.1 Test Accessibility Filtering Before using accessibility filtering, the data of the database need to be indexed, according to the steps described below: •

Using a REST client, start the cubrikIndexUpdate job by submitting an empty POST request to http://localhost:8080/smila/jobmanager/jobs/cubrikIndexUpdate/

•

Using a REST client, Start the cubrikDataFetcher job by submitting an empty POST request to http://localhost:8080/smila/jobmanager/jobs/cubrikDataFetcher/

Now the cubrikDataFetcher job creates records read from the database file and sends them to the cubrikIndexUpdate job. The latter adds accessibility annotation to the records and indexes them in the CUbRIKCore index. The status of the cubrikDataFetcher and the cubrikIndexUpdate jobs can be monitored from http://localhost:8080/smila/jobmanager/nameOfTheJob Moreover, you can improve the data set also providing an image as input via through the accessibility filtering feature via Web Interface as follow: • Visit http://localhost:8080/accessibility/annotate • A text input field is presented. Copy an image URL in this field and press "Annotate". A series of images appear, showing various image characteristics related to vision impairments. Moreover in this way you can also test the Accessibility Annotation feature. Once the data base was indexed, you can exploit the Accessibility Filtering feature as follow: • • • • •

Visit http://localhost/accessibilityhdemo/login.html Login using username: user001 and password: 1234 (these credential are used for testing purpose and are defined in the users.txt file) Perform the vision impairment tests. After the tests are completed, the user is transferred to the Accessibility H-Demo search page. Enter a term in the search field and press "Search" The search results appear. In the right, a Pareto diagram is shown. The user can click on the points on the Pareto diagram, in order to re-rank the results, according to

24 https://github.com/micha/resty 25 https://addons.mozilla.org/En-us/firefox/addon/restclient/ 26 https://chrome.google.com/webstore/detail/simple-rest-client/fhjcajmcbmldlhcimfajhfbgofnpcjmb R3 CUbRIK Integrated Platform Release

Page 57

D9.4 Version 1.0

• •

2.6.8

dif-ferent trade-offs among the available impairments. The user can submit the selected ranking by clicking on the "Submit feedback" button below the Pareto diagram. The user selection is used to update the user impairment profile. Alternatively, the user can select those results that he/she thinks are accessible, by checking the checkbox next to each result. The user can submit this feedback by clicking on the "Submit feedback" button below the results.

How to exploit the H-Demo source code

2.6.8.1 CUbRIK environment configuration (SMILA) 2.6.8.1.1 Core components •

Pipelets o Import the cubrikproject.pipelet.certh.Accessibility bundle inside the SMILA workspace as existing project in archive file selecting the bundle jar. o In Eclipse, go to Run -> Run Configurations... and check the cubrikproject.pipelet.certh.Accessibility bundle.

•

Pipelines o Copy the following BPEL files: AccessibilityAnnotationPipeline.bpel AccessibilityFilteringPipeline.bpel AccessibilityUpdateProfilePipeline.bpel AccessibilityExtractProfilePipeline.bpel from the configuration/org.eclipse.smila.processing.bpel/pipelines SVN folder to the SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines folder of the workspace o Copy the file others/config.xml from the SVN to the SMILA.application project folder. o Add the following pipeline definitions inside SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/d eploy.xml:

<process name="proc:AccessibilityAnnotationPipeline"> <in-memory>true</in-memory> <provide partnerLink="Pipeline"> <service name="proc:AccessibilityAnnotationPipeline" port="ProcessorPort" /> </provide> </process> <process name="proc:AccessibilityFilteringPipeline"> <in-memory>true</in-memory> <provide partnerLink="Pipeline"> <service name="proc:AccessibilityFilteringPipeline" R3 CUbRIK Integrated Platform Release

Page 58

D9.4 Version 1.0

port="ProcessorPort" /> </provide> </process> <process name="proc:AccessibilityUpdateProfilePipeline"> <in-memory>true</in-memory> <provide partnerLink="Pipeline"> <service name="proc:AccessibilityUpdateProfilePipeline" port="ProcessorPort" /> </provide> </process> <process name="proc:AccessibilityExtractProfilePipeline"> <in-memory>true</in-memory> <provide partnerLink="Pipeline"> <service name="proc:AccessibilityExtractProfilePipeline" port="ProcessorPort" /> </provide> </process>

For example, see the configuration/org.eclipse.smila.processing.bpel/pipelines/deploy.xml.accessibility file from the SVN

2.6.8.1.2 Accessibility Annotation (REST interface) â&#x20AC;˘

Setup the FileDataFetcher worker The FileDataFetcher worker creates records from the database file entries. The installation of the FileDataFetcher worker follows. o Import the cubrikproject.worker.certh.AccessibilityHDemo bundle inside the SMILA workspace as existing project in archive file selecting the bundle jar o In Eclipse, go to Run -> Run Configurations... and check the cubrikproject.worker.certh.AccessibilityHDemo bundle. Set the Auto-Start to true. o Add the following worker definition inside SMILA.application/configuration/org.eclipse.smila.jobmanager/workers.json: { "name": "FileDataFetcher", "taskGenerator": "runOnceTrigger", "parameters":[ { "name":"databaseFile" } ], "output": [{ "name": "outputRecords", "type": "recordBulks" }] }

R3 CUbRIK Integrated Platform Release

Page 59

D9.4 Version 1.0

For example, see the SMILA.application/configuration/org.eclipse.smila.jobmanager/workers.json file in SVN. â&#x20AC;˘

Setup the search engine index o Copy the configuration/org.eclipse.smila.solr/CUbRIKCore folder from SVN inside SMILA.application/configuration/org.eclipse.smila.solr folder of the workspace. o

Add the following entry inside the cores field of SMILA.application/configuration/org.eclipse.smila.solr/solr.xml:: core name="CUbRIKCore" instanceDir="CUbRIKCore"/>

For example, see solr.xml.accessibility file from the SVN. â&#x20AC;˘

Setup the indexing pipelines o Copy the following BPEL files CUbRIKAddPipeline.bpel CUbRIKDeletePipeline.bpel from the configuration/org.eclipse.smila.processing.bpel/pipelines SVN folder inside the SMILA.application/configuration/ org.eclipse.smila.processing.bpel/pipelines folder of the workspace o Add the following pipeline definitions inside SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/d eploy.xml, in the same folder: <process name="proc:CUbRIKAddPipeline"> <in-memory>true</in-memory> <provide partnerLink="Pipeline"> <service name="proc:CUbRIKAddPipeline" port="ProcessorPort" /> </provide> <invoke partnerLink="AccessibilityAnnotationPipeline"> <service name="proc:AccessibilityAnnotationPipeline" port="ProcessorPort" /> </invoke> </process> <process name="proc:CUbRIKDeletePipeline"> <in-memory>true</in-memory> <provide partnerLink="Pipeline"> <service name="proc:CUbRIKDeletePipeline"

R3 CUbRIK Integrated Platform Release

Page 60

D9.4 Version 1.0

port="ProcessorPort" /> </provide> </process>

For example, see the SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/d eploy.xml. accessibility file from the SVN. â&#x20AC;˘

Coping the file for dataset injection o Copy the others/DATA_SMALL.txt file from SVN inside the SMILA.application folder of the workspace

â&#x20AC;˘

Setup the workflows and jobs o Add the following workflow definitions inside SMILA.application/configuration/org.eclipse.smila.jobmanager/workflows.json { "name":"cubrikIndexUpdate", "modes":[ "standard" ], "parameters":{ "pipelineRunBulkSize":"1" }, "startAction":{ "worker":"bulkbuilder", "output":{ "insertedRecords":"addBucket", "deletedRecords":"deleteBucket" } }, "actions":[ { "worker":"pipelineProcessor", "parameters":{ "pipelineName":"CUbRIKAddPipeline" }, "input":{ "input":"addBucket" } }, { "worker":"pipelineProcessor", "parameters":{ "pipelineName":"CUbRIKDeletePipeline"

R3 CUbRIK Integrated Platform Release

Page 61

D9.4 Version 1.0

}, "input":{ "input":"deleteBucket" } } ] }, { "name":"cubrikDataFetcher", "modes":[ "runOnce" ], "parameters":{ "pipelineRunBulkSize":"20" }, "startAction":{ "worker":"FileDataFetcher", "output": { "outputRecords": "tempBucket" } }, "actions": [ { "worker": "updatePusher", "input": { "recordsToPush": "tempBucket" } } ] }

For example, see the SMILA.application/configuration/org.eclipse.smila.jobmanager/workflows.json . accessibility file from the SVN. Add the following job definitions inside SMILA.application/configuration/org.eclipse.smila.jobmanager/jobs.json: { "name":"cubrikIndexUpdate", "workflow":"cubrikIndexUpdate", "parameters":{ "tempStore":"temp" } }, { "name":"cubrikDataFetcher",

R3 CUbRIK Integrated Platform Release

Page 62

D9.4 Version 1.0

"workflow":"cubrikDataFetcher", "parameters":{ "tempStore":"temp", "databaseFile": "DATA_SMALL.txt", "jobToPushTo": "cubrikIndexUpdate" } }

Substitute path_to_database_file with the absolute path of the database file described above in step 1. For example, see the SMILA.application/configuration/org.eclipse.smila.jobmanager/jobs.json. accessibility file from the SVN.

2.6.8.1.3 Accessibility Annotation (Web interface) The following is for the demonstration of the accessibility annotation procedure through a web interface. â&#x20AC;˘

Setup needed pipelets o Import the cubrikproject.pipelet.certh.AccessibilityDemo bundle inside the SMILA workspace as existing project in archive file selecting the bundle jar o In Eclipse, go to Run -> Run Configurations... and check the cubrikproject.pipelet.certh.AccessibilityDemo bundle.

â&#x20AC;˘

Setup needed pipelines o Copy the following BPEL file AccessibilityAnnotationDemoPipeline.bpel from the configuration/org.eclipse.smila.processing.bpel/pipelines SVN folder:inside the SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines folder of the workspace o Add the following pipeline definition inside SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/d eploy.xml in the same folder: <process name="proc:AccessibilityAnnotationDemoPipeline"> <in-memory>true</in-memory> <provide partnerLink="Pipeline"> <service name="proc:AccessibilityAnnotationDemoPipeline" port="ProcessorPort" /> </provide> </process>

For example, see the SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/d R3 CUbRIK Integrated Platform Release

Page 63

D9.4 Version 1.0

eploy.xml. accessibility file from the SVN. â&#x20AC;˘

Install demonstration servlet o

Copy the files config.xml users.txt from SVN folder others inside SMILA.application folder of the workspace

Import the cubrikproject.servlet.certh.AccessibilityAnnotationDemo bundle inside the SMILA workspace as existing project in archive file selecting the bundle jar In Eclipse, go to Run -> Run Configurations... and check the cubrikproject.servlet.certh.AccessibilityAnnotationDemo bundle. Set the Auto-Start to true. Copy the configuration/cubrikproject.servlet.certh.AccessibilityAnnotationDemo SVN folder inside the SMILA.application/configuration folder of the workspace Add the following handler entry inside SMILA.application/configuration/org.eclipse.smila.http.server/jetty.xml:

<Item> <New class="org.eclipse.jetty.webapp.WebAppContext"> <Set name="contextPath">/accessibility</Set> <Set name="resourceBase"><SystemProperty name="org.eclipse.smila.utils.config.root" default="configuration"/>/cubrikproject.servlet.certh.Accessibi lityAnnotationDemo/accessibility/webapp</Set> <Set name="descriptor"><SystemProperty name="org.eclipse.smila.utils.config.root" default="configuration"/>/cubrikproject.servlet.certh.Accessibi lityAnnotationDemo/accessibility/webapp/WEB-INF/web.xml</Set> <Set name="defaultsDescriptor"><SystemProperty name="org.eclipse.smila.utils.config.root" default="configuration"/>/org.eclipse.smila.http.server/webdefa ult.xml</Set> <Set name="parentLoaderPriority">true</Set> </New> </Item>

For example, see SMILA.application/configuration/org.eclipse.smila.http.server/jetty.xml. accessibility in the SVN.

R3 CUbRIK Integrated Platform Release

Page 64

D9.4 Version 1.0

2.6.8.1.4 Accessibility Filtering â&#x20AC;˘

Setup search pipeline o Copy the following BPEL file CUbRIKSearchPipeline.bpel from the configuration/org.eclipse.smila.processing.bpel/pipelines SVN folder inside the SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines folder of the workspace o Add the following pipeline definition inside SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/d eploy.xml, in the same folder: <process name="proc:CUbRIKSearchPipeline"> <in-memory>true</in-memory> <provide partnerLink="Pipeline"> <service name="proc:CUbRIKSearchPipeline" port="ProcessorPort" /> </provide> </process>

For example, see SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/d eploy.xml. accessibility in the SVN. â&#x20AC;˘

Install the accessibility filtering servlets o Import the cubrikproject.servlet.certh.AccessibilityHDemo bundle inside the SMILA workspace as existing project in archive file selecting the bundle jar. o In Eclipse, go to Run -> Run Configurations... and check the cubrikproject.servlet.certh.AccessibilityHDemo bundle. Set the Auto-Start to true. o Copy the configuration/cubrikproject.servlet.certh.AccessibilityAnnotationDemo SVN folder inside the SMILA.application/configuration folder of the workspace o Add the following handler entry inside SMILA.application/configuration/org.eclipse.smila.http.server/jetty.xml: <Item> <New class="org.eclipse.jetty.webapp.WebAppContext"> <Set name="contextPath">/accessibilityhdemo</Set> <Set name="resourceBase"><SystemProperty name="org.eclipse.smila.utils.config.root" default="configuration"/>/cubrikproject.servlet.certh.Accessibi lityHDemo/accessibility/webapp</Set> <Set name="descriptor"><SystemProperty name="org.eclipse.smila.utils.config.root" default="configuration"/>/cubrikproject.servlet.certh.Accessibi

R3 CUbRIK Integrated Platform Release

Page 65

D9.4 Version 1.0

lityHDemo/accessibility/webapp/WEB-INF/web.xml</Set> <Set name="defaultsDescriptor"><SystemProperty name="org.eclipse.smila.utils.config.root" default="configuration"/>/org.eclipse.smila.http.server/webdefa ult.xml</Set> <Set name="parentLoaderPriority">true</Set> </New> </Item>

For example, see the SMILA.application/configuration/org.eclipse.smila.http.server/jetty.xml file in the SVN

2.6.8.2 Configuration of data set’s files path like image/video files path The proper configuration required in order to run the H-Demo regards the specification of the paths of both the indexing file and the corresponding file that includes the database. The servlet files inside cubrikproject.servlet.certh.AccessibilityHDemo define a variable named "usersFile", which contains the path to a local file containing user registration and profile information. The file is named "users.txt" and is included in the SVN. This file must be copied in the local machine and the absolute path of the file be set as the value of variable "usersFile" inside the servlets. In addition, the pipelet files inside cubrikproject.servlet.certh.AccessibilityAnnotationDemo define a variable named "TEST_IMAGES_PATH", which contains the path to a local folder used for storing the images produced by the demo. The folder is named "testImages". This folder must be created in the local machine inside SMILA.application/configuration/cubrikproject.servlet.certh.AccessibilityAnnotationDemo/acce ssibility/webapp and the absolute path of the folder be set as the value of variable "TEST_IMAGES_PATH" inside the pipelets.

2.7 Context-aware automatic query formulation Context-aware automatic query formulation aims at providing the search engine (SE) user with query suggestions, by predicting information needs on the basis of the user session’s contextual features. Automatic query suggestion typically regards as user context, the queries that have been issued so far, within the present search session. Within CUbRIK, the notion of user context is broadened so as to include further user session characteristics, such as spatiotemporal ones. Therefore, our goal is to develop an automatic query formulation mechanism that takes into account both the global and personal history query log, as well as session temporal and spatial characteristics. To this end, during Y2 of the project, a context-aware query formulation method has been developed, which is capable to provide query suggestions upon user login, prior to the submission of any query to the SE. Our developed method takes into account temporal and spatial characteristics both of global and personal user sessions (i.e. past sessions that belong to all users or past sessions of the specific, currently logged in user, respectively) that have been recorded within the SE log. In essence, the user’s current session is matched to personal or global log sessions, through a comparison of session spatiotemporal characteristics, such as their time of day (encoded in time zones), day of week, day of month, user city and country. Moreover, our developed algorithm is enhanced with further personalized information, through estimation of user mood for every query that users have submitted through interaction with the SE interface. User mood is calculated herein based on assumptions about how user mood could be influenced by events taking place during SE usage. Events are extracted from user click-through data recorded in the SE logs. Thus, our algorithm is eventually capable to rank the formulated query suggestions according to their associated R3 CUbRIK Integrated Platform Release

Page 66

D9.4 Version 1.0

mood, since queries with associated “negative mood”, are regarded as queries that “disappointed” the user who submitted them. Summarizing, the context aware query formulation H-APP developed in Y2 is capable to classify and cluster personal and global sessions recorded in the SE log, on the basis of their temporal and spatial characteristics, so as to provide the currently logged in user with query suggestions, formulated by taking into account log sessions that share similar spatiotemporal characteristics. The ranking of suggestion results is further facilitated through the use of our query mood estimation algorithm. In the present document, the basic components of the HDemo and its conceptual architecture, in terms of CUbRIK pipelines are described. These pipelines will be finalized and integrated in the CUbRIK V-App during year 3.

2.7.1

H-Demo vs. V-Apps (Fashion and HoE)

The Context-aware automatic query formulation H-Demo is released as pipeline that did not converged in any V-App yet. It is planned to be integrated in the Fashion V-App in most of its functionalities in next version. However, the main difference is the input data provided by the user as query submission. In the web interface developed for the demonstration of the current H-Demo (i.e. a Google-like search engine interface), a regular text-based query is used, while the V-app will be based on a more interactive architecture where the user will be guided through specific option based menus. This way, the current components will have to be adjusted accordingly. For instance, our tool, integrated in the V-App, should manage (during training and inference) and provide as suggestions, instead of queries in the form of textual strings, queries that consist of URLs pointing to images of the CUbRIK framework, which will be used therein as user input

2.7.2

H-Demo vs CUbRIK pipeline(s)

The Context-aware automatic query formulation H-Demo implements the following pipelines: 1. CUbRIK Query Pipeline The CUbRIK search pipeline accepts a query input and returns a set of results which are retrieved from the CUbRIK index. Additional to the standard SMILA search pipeline, the HDemo search pipeline is configured to provide suggestions upon user login, which are formulated through the query suggestions formulation pipeline. Moreover, for each query submitted, temporal and spatial parameters of the user session are recorded in the server log, along with characteristics of the user’s experience during SE usage, such as results selected, navigation within results list etc. 2. Session extraction and classification pipeline The session extraction and classification pipeline first of all extracts search engine usage sessions from the server’s log. Each session consists of at least two queries (maximum three) submitted to the SE, which have a between-submission time span of five minutes or less. For each session, its spatiotemporal characteristics, along with features of the SE user interface usage during their duration are stored. Thereafter, the session extraction and classification pipeline classifies sessions identified, on the basis of their spatiotemporal features. In particular, it takes into account features such as their “time of day” (in terms of time zones, e.g. Early Afternoon), “day of week”, “day of month”, “country” and “city”, and assigns sessions recorded in the server log in the respective classes. 3. Session clustering pipeline Upon user login, the session clustering pipeline identifies sessions that belong to the same contextual classes as the current user’s session, in terms of 1) all log sessions (global sessions) and 2) sessions that belong to the same logged in user of the present session (personal sessions). Both global and personal sessions are thereafter clustered on the basis of their terms co-occurrence frequency, as calculated within the SE log. The result is a global and a personal cluster of sessions, which contain candidate queries to be provided as suggestions to the user. 4. Mood estimation pipeline

R3 CUbRIK Integrated Platform Release

Page 67

D9.4 Version 1.0

The mood estimation and query annotation pipeline is responsible to estimate, based on the user’s SE usage experience recorded in the respective session logs, a mood score for each query of the global and personal clusters. Each query is this way annotated with a mood score, which is calculated on the basis of assumptions over how features of the SE usage could reflect the mood of the SE user. 5. Query suggestions formulation pipeline The query suggestions formulation pipeline is responsible to formulate and provide, upon user login, a ranked list of query suggestions. These suggestions derive from the queries included in the personal and also the global session clusters. From each session of these clusters, the query that has been annotated with the highest mood score is added to the suggestions list. Queries of the personal cluster have higher priority. Eventually, at the current implementation of the query formulation H-Demo, context-aware suggestions are provided to the user via the search engine UI, as soon as s/he logs in. The H-Demo pipelines which can be exploited by the V-App are the following: 1. SessionExtractionAndClassificationPipeline 2. SessionClusteringPipeline 3. MoodEstimationPipeline 4. QuerySuggestionsFormulationPipeline

2.7.3

Data set description

The dataset that will be utilized in the context-aware automatic formulation H-Demo currently consists of 1324 fashion-related queries collected by real users through the Google-like search engine that has been built as part of the H-Demo, each associated with various metadata. It should be noted that the number of queries in the log is not stable, but rises, as recruited users log in and submit queries to our H-Demo search engine. For each query, the following metadata is available: - Query Submission o query submitted to search engine o timestamp of query o ip of the user that submitted the query o number of results received - Navigation within search results o scrolling of user during navigating within results o page change while navigating through results - Selection of Search Results o search result selected o rank of selected result o page of selected result Moreover, at random time intervals, our SE user interface presents that user with a “mood” self-assessment slider, in the range [-100,100]. The answer of the user is also recorded in the server log. Through off-line processing occurring at pre-defined intervals (e.g. on daily basis) or on demand, through such a server management option offered to our search engine’s administrator, sessions are extracted through the server logs and classified according to their spatiotemporal characteristics. Through this off-line processing, the context-aware query formulation system is re-trained, acquiring new knowledge from the newly recorded search sessions. This knowledge is thereafter utilized though the rest of the automatic context-aware query formulation process, which is conducted upon user login and results to the provision of context-aware query suggestions. The image dataset used for the search engine of the H-Demo, currently consists of 330622 fashion-related images from the Flickr website, each associated with various metadata.

R3 CUbRIK Integrated Platform Release

Page 68

D9.4 Version 1.0

Images type: jpeg, (png) For each image, the following information is available: The title of the image. The URL of the image in Flickr. URLs for small, medium, original and large versions of the image. Text tags describing the image. The geo-coordinates of the image (latitude, longitude and their accuracy).

2.7.4

Architecture overview

The conceptual architecture of the present H-Demo is provided herein. The overall architecture and workflow diagram of the H-Demo are presented in Figure 32. Context-aware automatic query formulation H-Demo Architecture

Search Engine

User Login â&#x20AC;&#x201C; session context

UserProfiler User Logs Offline Logs Processing Session Extraction and Classification

Query mood estimation

Provision of query suggestions

Session Clustering

Estimation of query mood scores

Personal and Global session clusters

Query Suggestions formulation

Figure 32. Overall Architecture of the context-aware query formulation H-Demo The main subcomponents of each module presented in the above architectural block diagram, which constitute the context-aware automatic query formulation H-Demo can be found in Figures 33-35. In particular, starting from the Google-like search engine interface of the H-Demo, a basic job related to context-aware automatic query formulation can be found, that is the userProfiler (Figure 33), responsible to record queries submitted to the SE from all users, as well as parameters of the SE interface usage (through the userActionsLogger worker). Through the userProfiler, all user actions of interest while interacting with the SE interface are recorded in the serverâ&#x20AC;&#x2122;s log.

R3 CUbRIK Integrated Platform Release

Page 69

D9.4 Version 1.0

Figure 33. Flow chart of the logging Job At pre-defined time-intervals (or on demand, through the SE interface), information stored in the server logs is used for off-line training of the context-aware automatic query formulation system. This training process is realized through the queryFormulationSystemTrainer job, as shown in Figure 34. This job consists of a worker (trainInitiator) whose purpose is in essence to initialize two pipelines that run in cascade, performing session extraction (sessionExtraction pipeline) and session classification (sessionClassification pipeline), on the basis of spatiotemporal features of the sessionsâ&#x20AC;&#x2122; context. The sessionExtraction pipeline consists of pipelets responsible of a) identifying query sessions within the server logs (sessionIdentificationPipelet), b) annotating sessions with spatiotemporal characteristics extracted through processing of their timestamps and user IPs (sessionTemporalAnnotationPipelet and sessionSpatialAnnotationPipelet respectively). Once user sessions are extracted and annotated, they are forwarded to the sessionClassification pipeline, which consists of pipelets responsible for the temporal and spatial classification of annotated sessions (sessionTemporalClassificationPipelet and sessionSpatialClassificationPipelet respectively). Through the above pipelines, recorded logs are processed toward training the system, providing knowledge over the recorded sessions and their contextual characteristics.

Figure 34. Flow chart of the off-line training Job The querySuggestionsProvider job, depicted in Figure 35, is responsible to provide, upon user login, context-aware query suggestions. Suggestion provision is herein realized through the use of the sessionClustering pipeline, the queryMoodEstimation pipeline and querySuggestionsFormulation pipeline.

Figure 35. Flow chart of the context aware query formulation Jobs As shown in Figure 35, the sessionClustering pipeline consists of two pipelets, each responsible to generate personal or global clusters respectively, of sessions that share common spatiotemporal context characteristics to the currently logged-in user. Through the queryMoodEstimation pipeline, the queries of these clustersâ&#x20AC;&#x2122; sessions are annotated in terms R3 CUbRIK Integrated Platform Release

Page 70

D9.4 Version 1.0

of user mood, based on features related to user experience during interacting with the SE, through the application of our mood estimation algorithm. Eventually, the querySuggesitonsFormulation pipeline is responsible to extract from the mood-annotated session clusters, queries that derive from personal user sessions (personalQuerySuggestionsFormulationPipelet), from global sessions (globalQuerySuggestionsFormulationPipelet), and finally, to fuse these suggestions (suggestionsFusionPipelet), so as to provide the results back to the search engine user interface.

2.7.5

Third party library

For the determination of the geographical location of the users from their IP address, IP geolocation data from the MaxMind “GeoLite City” DB27 are used, which are accessible to the H-Demo, via a custom internal REST-based web service interface developed from CERTH.

2.7.6

Components integrated in CUbRIK R3

For the demonstration of the context-aware query formulation H-Demo, a Google-like search engine interface has been implemented as a SMILA servlet, which accepts user input in the form of HTTP requests and performs the functionalities of context-aware query formulation, described in Section 2.7.4.

2.7.7

How to install the H-Demo bundle

The context-aware query formulation H-Demo is provided as SMILA bundle. The following steps describe as proceed with its installation and configuration: 1. Download the content of JAR folder at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/ContextAwareAutomaticQueryFormulation_CERTH/JAR and copy it under the /plugins directory of your SMILA installation 2. Download the content of data folder at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/ContextAwareAutomaticQueryFormulation_CERTH/data in the root folder of your SMILA installation. The data folder you downloaded contains the following zip files: .metadata.zip folder containing a subset of dataset already indexed and ready to be used for H-demo exploit DATA.zip folder containing the complete data set to be indexed queryformulationhdemo_data.zip folder containing specific files related to both user logs and session information required by the context-aware query formulation 3. Download the contents of https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/ContextAwareAutomaticQueryFormulation_CERTH/configuration folder into the configuration directory of your SMILA installation and if requested don’t overwrite the existing files Regarding the configuration of this H-Demo, some of steps reported below are related to the Accessibility H-Demo; this means that if you have already installed it, you have to pass over steps with AAC label, in order to don’t duplicate the installation instructions: 4. Open the jetty.xml.contentaware under configuration/org.eclipse.smila.http.server/jetty.xml.contentaware and copy the content between between  and  in the jetty.xml file of your SMILA installation 5. AAC: Open jobs.json.contentaware, workers.json.contentaware and

27 http://dev.maxmind.com/geoip/legacy/geolite/ R3 CUbRIK Integrated Platform Release

Page 71

D9.4 Version 1.0

workflows.json.contentaware under configuration/org.eclipse.smila.jobmanager/workflows.json.contentaware you download at step2 and for each of these files copy the content between {"_comment":"START Content Aware H-Demo Specific configuration"}, and ,{"_comment":"END Content Aware H-Demo Specific configuration"} "} in the correspondent files of your SMILA installation 6. AAC:Open the deploy.xml.contentaware file under the configuration/org.eclipse.smila.processing.bpel/pipelines// configuration folder you downloaded at step2 and copy the content between  and  in the deploy.xml file into the configuration\org.eclipse.smila.processing.bpel\pipelines directory of your SMILA installation 7. AAC: Open the solr.xml. contentaware file under configuration/org.eclipse.smila.solr/ configuration folder you downloaded at step2 and copy the content between <!-START Content Aware H-Demo Specific configuration --> and  into the configuration/org.eclipse.smila.solr/solr.xml file of your SMILA installation 8. Open the config.ini.contentaware file under configuration/ folder you downloaded at step2 and copy the content between # START Content Aware H-Demo servlet and # END Content Aware H-Demo servlet in the config.ini file of your SMILA installation append it just after the end of the last line of config.ini file

2.7.7.1 Configuration of data set’s files path and output files path The operation of the context-aware query formulation system requires the existence of specific files, containing the user logs and session information, in specific folders. These files once downloaded are contained in folder queryformulationhdemo_data that has to be put under the SMILA root The queryformulationhdemo_data folder is further composed of the following subfolders: •

• •

logs: Contains the log files of the users. Each log file contains information about the actions performed by a user (submitted queries, selected results, interactions etc.) during the use of the search engine. The user logs are used by the training system, in order to classify the user sessions and provide query recommendation. nums: An accessory directory, containing, for each user, the number of queries submitted so far to the search engine. This information is used for the determination of the proper time for the appearance of the interactive mood bar to the user. util: Contains the files produced by the training process and used online by the query formulation system. These files are the following: o log.txt: A concatenation of the log files of all the users. o sessions.txt: Contains the query sessions (groups of two or three queries) performed by the users. o unique.txt: Contains the unique terms that the users have used for their search sessions. o frequencies.txt: Contains information about how frequently a pair of unique terms appeared in the same search session

2.7.7.2 How to inject the content data set The image dataset used by the Context-Aware Query Formulation H-Demo is the same as the one used in the Accessibility H-Demo that is described in section 2.6.7.1, so the content data injection does not need to be performed if the indexing has already been done for the Accessibility H-Demo(section 2.6.7.2). Otherwise refer to information reported below. The data.zip file provided in the SVN at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/ContextAwareAutomaticQueryFormulation_CERTH/data/DATA.zip contains two files:

R3 CUbRIK Integrated Platform Release

Page 72

D9.4 Version 1.0

- DATA.txt: contains the overall data-set (about 330000 entries) - DATA_SMALL.txt: contains a subset of data suitable for testing purpose (about 10000 antries) The data injection is performed by the collaboration of CUbRIKIndexUpdate and cubrikDataFetcher jobs. In order to choose which database is to import, you have to edit the cubrikDataFetcher configuration section inside the file configuration/org.eclipse.smila.jobmanager/jobs.json and change the “databaseFile” property with the name of the file you want to use Since the data set injection is an operation that can take a lot of time depending on the both size of data set to index and hardware resources available, a proper data set (more than 6 thousand entries) has been already indexed and it is available at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/ContextAwareAutomaticQueryFormulation_CERTH/data/.metadata.zip. In order to use this, you need just to uncompress the.metadata.zip file under the /workspace directory of your SMILA installation and overwrite the already existing files. If you have SMILA already running please remember to close it before to unzip the file. If you prefer to don’t use a data set already indexed, please refer to section Indexing below reporting information on how to perform indexing.

2.7.7.3 How to run the H-Demo Once installed the Context-aware automatic query formulation H-demo, run the SMILA.exe file in your SMILA root folder to start loading the bundles. Please wait until you read something like reported in the screenshot below:

Figure 36 SMILA console screenshot In the following description it is assumed that test of H-demo is done on the same machine hosting the H-Demo; otherwise, remember to replace “localhost” with the name or the IP of the server running the H-Demo. 1. Indexing In order to index the dataset you have to start two jobs that prepare the indexes you will use. The first adds annotations to the records and indexes stored in the CUbRIKCore index, as soon as the index change. The latter creates records read from the database file (the txt file you should have in your SMILA root folder) and, as consequence of that, the first work updating the entries. The POST requests to the server are sent through a REST interface, according to the steps described below. •

Using a REST client, start the cubrikIndexUpdate job by submitting an empty POST request to http://localhost:8080/smila/jobmanager/jobs/cubrikIndexUpdate/ - Using a REST client , start the cubrikDataFetcher job by submitting an empty POST request to http://localhost:8080/smila/jobmanager/jobs/cubrikDataFetcher/ Now the cubrikDataFetcher job creates records read from the database file and sends them to the cubrikIndexUpdate job. The latter adds indexes the records in the CUbRIKCore index. The status of the cubrikDataFetcher and the cubrikIndexUpdate jobs can be monitored from hhttp://localhost:8080/smila/jobmanager/nameOfTheJob

R3 CUbRIK Integrated Platform Release

Page 73

D9.4 Version 1.0

2. Automatic Query Formulation webapp In the following, localhost should be replaced by the IP of the server running SMILA. Before using accessibility filtering, the data of the database need to be indexed, according to the steps of Indexing, above. - Run SMILA. - Visit http://localhost/queryformulationhdemo/search - Login using e.g. username: user007 password: mniDpih - After the user logs in, a Google-like webpage is presented, where the user can type a query and submit it to the search engine. Below the search field, a set of recommended queries are presented, which are calculated online by the query formulation system, using the log information of the current user and all other users. - In the same page, the user can also press the “Train recommendation system” button to initiate the training process of the query formulation system, using the current user logs. - After the user submits a query by typing it in the search field and pressing “Search”, a set of results is presented. The user can navigate in the result list by scrolling up or down, or by selecting another page from the navigation area at the bottom. - The user can select a result in order to view it in more detail, by clicking on its title or thumbnail. The corresponding Flickr page is opened. - Occasionally, after the user submits a new query to the search engine, a page containing a mood bar is presented, where the user is asked to submit his/her mood at the moment, concerning the previous query session. The user can move the slider to the left (negative mood) or to the right (positive mood) and press “Submit” to submit his/her mood. - The user can log out by pressing the “Logout” link at the top right of the page. Each action performed by the user (query submission, navigation, result selection, mood submission) is transparently logged by the system.

2.7.8

How to exploit the H-demo source code

2.7.8.1 CUbRIK environment configuration (SMILA) 2.7.8.1.1 Indexing (REST interface) Note: Indexing does not need to be done if it has already been performed for the Accessibility H-Demo. The Context-Aware Automatic Query Formulation H-Demo can use the same index as the Accessibility H-Demo. •

Setup the FileDataFetcher worker The FileDataFetcher worker creates records from the database file entries. The installation of the FileDataFetcher worker follows. o Import the cubrikproject.worker.certh.AccessibilityHDemo bundle inside the SMILA workspace as existing project in archive file selecting the bundle jar.. o In Eclipse, go to Run -> Run Configurations... and check the cubrikproject.worker.certh.AccessibilityHDemo bundle. Set the Auto-Start to true. o Add the following worker definition inside SMILA.application/configuration/org.eclipse.smila.jobmanager/workers.json: { "name": "FileDataFetcher", "taskGenerator": "runOnceTrigger",

R3 CUbRIK Integrated Platform Release

Page 74

D9.4 Version 1.0

"parameters":[ { "name":"databaseFile" } ], "output": [{ "name": "outputRecords", "type": "recordBulks" }] }

For example, see the SMILA.application/configuration/org.eclipse.smila.jobmanager/workers.json. contentaware file in SVN. â&#x20AC;˘

Setup the search engine index o Copy the SVN folder SMILA.application/configuration/org.eclipse.smila.solr/CUbRIKCore inside the workspace one SMILA.application/configuration/org.eclipse.smila.solr o Add the following entry inside the cores field SMILA.application/configuration/org.eclipse.smila.solr/solr.xml

For example, see configuration/org.eclipse.smila.solr/solr.xml,contentaware file from the SVN. â&#x20AC;˘

Setup the indexing pipelines o Copy the BPEL files l CUbRIKAddPipeline.bpel CUbRIKDeletePipeline.bpel from the SVN folder configuration/org.eclipse.smila.processing.bpel/pipelines Inside the workspace folder: SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines o Add the following pipeline definitions inside SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/d eploy.xml, in the same folder: <process name="proc:CUbRIKAddPipeline"> <in-memory>true</in-memory> <provide partnerLink="Pipeline"> <service name="proc:CUbRIKAddPipeline" port="ProcessorPort" /> </provide> </process> <process name="proc:CUbRIKDeletePipeline"> <in-memory>true</in-memory>

R3 CUbRIK Integrated Platform Release

Page 75

D9.4 Version 1.0

For example, see the SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/d eploy.xml. contentaware file from the SVN. â&#x20AC;˘

Setup the workflows and jobs o Unpack from SVN the file others/DATA.zip inside the workspace folder SMILA.application o Add the following workflow definitions inside SMILA.application/configuration/org.eclipse.smila.jobmanager/workflows.json : { "name":"cubrikIndexUpdate", "modes":[ "standard" ], "parameters":{ "pipelineRunBulkSize":"1" }, "startAction":{ "worker":"bulkbuilder", "output":{ "insertedRecords":"addBucket", "deletedRecords":"deleteBucket" } }, "actions":[ { "worker":"pipelineProcessor", "parameters":{ "pipelineName":"CUbRIKAddPipeline" }, "input":{ "input":"addBucket" } }, { "worker":"pipelineProcessor", "parameters":{ "pipelineName":"CUbRIKDeletePipeline" }, "input":{

R3 CUbRIK Integrated Platform Release

Page 76

D9.4 Version 1.0

"input":"deleteBucket" } } ] }, { "name":"cubrikDataFetcher", "modes":[ "runOnce" ], "parameters":{ "pipelineRunBulkSize":"20" }, "startAction":{ "worker":"FileDataFetcher", "output": { "outputRecords": "tempBucket" } }, "actions": [ { "worker": "updatePusher", "input": { "recordsToPush": "tempBucket" } } ] }

For example, see the SMILA.application/configuration/org.eclipse.smila.jobmanager/workflows.json . contentaware file from the SVN. Add the following job definitions inside SMILA.application/configuration/org.eclipse.smila.jobmanager/jobs.json: { "name":"cubrikIndexUpdate", "workflow":"cubrikIndexUpdate", "parameters":{ "tempStore":"temp" } }, { "name":"cubrikDataFetcher", "workflow":"cubrikDataFetcher", "parameters":{

R3 CUbRIK Integrated Platform Release

Page 77

D9.4 Version 1.0

"tempStore":"temp", "databaseFile": "path_to_database_file", "jobToPushTo": "cubrikIndexUpdate" } }

Substitute path_to_database_file with DATA.txt or DATA_SMALL.txt For example, see the configuration/org.eclipse.smila.jobmanager/jobs.json file from the SVN.

2.7.8.1.2 Context-aware automatic query formulation webapp â&#x20AC;˘

Setup search pipeline Note: This step does not need to be done, if it has already been performed for the Accessibility H-Demo. o Copy the following BPEL file CUbRIKSearchPipeline.bpel from the configuration/org.eclipse.smila.processing.bpel/pipelines SVN folder to SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines workspace folder o Add the following pipeline definition inside SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/d eploy.xml, in the same folder: <process name="proc:CUbRIKSearchPipeline"> <in-memory>true</in-memory> <provide partnerLink="Pipeline"> <service name="proc:CUbRIKSearchPipeline" port="ProcessorPort" /> </provide> </process>

For example, see SMILA.application/configuration/org.eclipse.smila.processing.bpel/pipelines/d eploy.xml. contentaware in the SVN. â&#x20AC;˘

Install the automatic query formulation servlet o Import the cubrikproject.servlet.certh.QueryFormulationHDemo bundle from inside the SMILA workspace as existing project in archive file selecting the bundle jar. o Unpack the file queryformulationhdemo_data.zip from SVN folder into the workspace folder SMILA.application o In Eclipse, go to Run -> Run Configurations... and check the cubrikproject.servlet.certh.QueryFormulationHDemo bundle. Set the AutoStart to true. o Copy the the folder

R3 CUbRIK Integrated Platform Release

Page 78

D9.4 Version 1.0

configuration/cubrikproject.servlet.certh.QueryFormulationHDemo. from SVN into the workspace folder SMILA.application/configuration/ Add the following handler entry inside SMILA.application/configuration/org.eclipse.smila.http.server/jetty.xml: <Item> <New class="org.eclipse.jetty.webapp.WebAppContext"> <Set name="contextPath">/queryformulationhdemo</Set> <Set name="resourceBase"><SystemProperty name="org.eclipse.smila.utils.config.root" default="configuration"/>/cubrikproject.servlet.certh.QueryForm ulationHDemo/queryformulation/webapp</Set> <Set name="descriptor"><SystemProperty name="org.eclipse.smila.utils.config.root" default="configuration"/>/cubrikproject.servlet.certh.QueryForm ulationHDemo/queryformulation/webapp/WEB-INF/web.xml</Set> <Set name="defaultsDescriptor"><SystemProperty name="org.eclipse.smila.utils.config.root" default="configuration"/>/org.eclipse.smila.http.server/webdefa ult.xml</Set> <Set name="parentLoaderPriority">true</Set> </New> </Item>

For example, see the SVN file configuration/org.eclipse.smila.http.server/jetty.xml.contentaware.

2.7.8.2 Configuration of the 3rd party library and components integrated in SMILA for CUbRIK R3 For the indexing of the image dataset, the FileDataFetcher worker, implemented in the cubrikproject.worker.certh.AccessibilityHDemo bundle of the Accessibility H-Demo is used.

2.8 Search Engine Federation With R3 there is provided a fully-fledged engine for federated search. There are used components to index and retrieve images, textual documents, Web sources and arbitrary other sources. Moreover, a Pipeline for context-aware automatic query formulation is implemented as an Hdemo.

2.8.1

H-Demo vs. V-Apps (Fashion and HoE)

Even if the H-Demo is not fully integrated in HoE V-App, an integration was experimented successfully; the pipeline used two components: Expansion through documents and expansion through images. Detailed description of both can be found in D6.2 28.

2.8.2

H-Demo vs. CUbRIK pipeline(s)

Search engine federation uses both components “Expansion through documents” and “Expansion through images” in order to get both result sets. The components were experimented in parallel for the History of Europe Vertical Application. The component Expansion through documents consists of several Pipelines and Workflows 28 D6.2 R2 Pipelines for query processing R3 CUbRIK Integrated Platform Release

Page 79

D9.4 Version 1.0

(â&#x20AC;&#x2DC;asynchronous Pipelinesâ&#x20AC;&#x2122;). They are described in Section 2.8.4.1. The used Pipelets and Workers which make up the Pipelines and Workflows are described in Table 2 or Table 3, respectively.

2.8.3

Data set description

The dataset contains the CVCE collection which is formed of PDF documents in three languages: English, French and German. For concept modelling the CUbRIK entity repository (Entitypedia) is used. These concepts are taken to build an ontology which is stored in a MongoDB.

2.8.3.1 Web Pages (MongoDB Collection "webpages"): The Web pages are stored in MongoDB. For each Web page there is stored the following set of properties: String id; String httpUrl; String domainUrl; String title; String docTitle; String snippet; int numConcepts; String summary; String concepts; String keywords; double pageRelevance;

2.8.3.2 Domain Pool (MongoDB Collection "domainreduced") The Domain Pool contains all Web pages which were retrieved with GCS. The URLs are grouped by Web domain. The grouping is started by REST call domainpool/webpages/performMapReduceOnDomainPool An example for an analysed Web site is given in the following: { "_id" : "www.aquamarinepower.com", "value" : { "pages" : [{ "id" : "GCS-26-h7hostie", "url" : "http://www.aquamarinepower.com/about-us/the-team/", "title" : "Aquamarine Power - The team", "numConcepts" : 16.0, "pageRelevance" : 4.8071220206680962 }], "pageCount" : 1.0, "numConcepts" : 16.0, "siteRelevance" : 4.8071220206680962 } }

R3 CUbRIK Integrated Platform Release

Page 80

D9.4 Version 1.0

Map/Reduce functions are called via a Service (DomainPoolService).

2.8.3.3 Black List / White List (MongoDB Collections "blacklist", "whitelist") The White list is used to start Web crawling. The Black list contains all sites which are not of interest. Both lists are structured as Domain Pool (see Section above).

2.8.3.4 Co-Location Results Results of Map/Reduce process are pushed to the StoreCoOccurrenceResultWorker. It stores them in collection cooccurrences. The schema is as follows: { "_id" : ObjectId("506ed94f969b2aab798a1495"), "key" : ["con::SAT/Concept/construction/500942fe34d25cf8a327f4f1"], "value" : NumberLong(3663) } { "_id" : ObjectId("506ed907969b2aab7984e5b7"), "key" : ["con::SAT/Concept/material/5009663a34d25cf8a327f566", "nl::TWI"], "value" : NumberLong(1267) }

2.8.3.5 Ontologytree In order to get the structure of the tree which is defined by classes and concepts (and taxonomy relations) the concepts are given as tree structure to the Web application. An example is given here: { String "_id" : "SAT/Concept/context/500968b834d25cf8a327f599", String "_class" : "com.empolis.ias.sat.persistence.OntologyTreeNode", String "parentId" : "SAT/Class/5009407634d25cf8a327f4b5", String "ontostoreId" : "Concepts/500968b834d25cf8a327f599", String "label" : "low tidal", String "type" : "concept", String "path" : "context*low tidal", Long "level" : 1 }

2.8.3.6 Expansion through images Dataset We considered the use of images obtained from a popular online community, namely Flickr, a community centered around user generated content. The platform provides a rich sample of comments and associated metadata for a variety of content objects from a large pool of users. We used the Flickr search engine to create the collection by formulating textual queries. Our set of seed queries was selected from the CVCE collection. The top-300 results for each R3 CUbRIK Integrated Platform Release

Page 81

D9.4 Version 1.0

query were collected. For each selected image, we gathered the comments (if available), along with contextual metadata, including authors and timestamps. We also collected metadata such as title, tags, description, upload date as well as statistics provided by Flickr such as overall number of comments, and number of favourites. In addition, for each image we gathered information related to the uploader profile including photo albums, groups joined, friends lists and friends information. The complete collection had a final size of 10,991 images.

2.8.4

Architecture overview

The whole component consists of several sub-components. Figure 32 gives an overview. The Search component is provided by IAS via a JSON/REST interface. It provides access to the built index. Examples how to access the index are given in D6.2 deliverable in the Expansion through documents section.

R3 CUbRIK Integrated Platform Release

Page 82

D9.4 Version 1.0

Figure 37 Search Engine Federation Architectural overview The two Workflows for crawling domains and crawling (arbitrary) documents are executed in parallel. The search Pipeline (of IAS) access both results from domain crawl and document crawl. In the following subsections the elements of the component are described in detail.

2.8.4.1 Pipelines and Workflows Pipelines consist of Pipelets. Workflows consist of Pipelets and Workers. The Pipelines and Workflows are given as sequence of the parts in BPEL diagrams.

R3 CUbRIK Integrated Platform Release

Page 83

D9.4 Version 1.0

2.8.4.1.1 lomUpdateAndQueryTerms The Pipeline reads the ontology from the MongoDB store and pushes the Lexical Ontology to the Lexical Ontology Matcher (which is used for textual analyses). Modelled classes and concepts are then pushed to the query record which is used to generate Google Custom Search queries.

Figure 38 BPEL for Pipeline lomUpdateAndQueryTerms

R3 CUbRIK Integrated Platform Release

Page 84

D9.4 Version 1.0

2.8.4.1.2 CalculateQueryTermCount The Pipeline is used to calculate the number of queries to be generated

Figure 39 BPEL for Pipeline CalculateQueryTermCount

R3 CUbRIK Integrated Platform Release

Page 85

D9.4 Version 1.0

2.8.4.1.3 DomainPoolInsertionPipeline This Pipeline is used to drop URLs which do not contain two or more concepts. These URLs are then inserted in the Domain Pool.

Figure 40 BPEL for Pipeline DomainPoolInsertionPipeline

R3 CUbRIK Integrated Platform Release

Page 86

D9.4 Version 1.0

2.8.4.1.4 Workflow Generate Domain Pool The workflow generates an overview of the found URLs which form the so-called Domain Pool. The Domain Pool is the set from where the selection of URLs to visit starts. The Workflow consists of two sub-Workflows. The first one is for pushing the found URLs to a list. In this list they wait for the analysis which is done in the second part of the Workflow.

Figure 41 BPEL for Workflow Generate Domain Pool, part 1 In the second part of the Workflow the URL is fetched and a textual analysis is performed on the Web page (this component is part of IAS). This way relevant information of the Web page is extracted and shown to enable the user to have a quick view on the page without having to read it completely

R3 CUbRIK Integrated Platform Release

Page 87

D9.4 Version 1.0

Figure 42 BPEL for Workflow Generate Domain Pool, part 2

2.8.4.1.5 Workflow Crawl Domains The Workflow is divided in two parts: The first part is for controlling the crawling, the second one is for analysis. In the first part of the Workflow the URLs are called. With each URL all links are analysed, in this case only links which do not leave the host (e.g. www.cubrikproject.eu in contradiction to domain cubrikproject.eu). The found Web pages are then pushed to a set where the analysis starts from

R3 CUbRIK Integrated Platform Release

Page 88

D9.4 Version 1.0

Figure 43 BPEL for Workflow Crawl Domains, part 1 The second part of the Workflow starts with an explicit URL, i.e. a Web page. The page is analysed, especially the concepts which are modelled are extracted. Moreover, analyses regarding distance between these terms are performed. The output is an index. The nonrelevant records in the index, i.e. the ones with less than a specified number of relevant concepts, are deleted.

R3 CUbRIK Integrated Platform Release

Page 89

D9.4 Version 1.0

Figure 44 BPEL for Workflow Crawl Domains, part 2

2.8.4.1.6 Workflow Crawl Documents The Workflow for crawling the documents is the same as for web crawling without part 1. The two (asynchronous) Workflows are performed in parallel accessing the same index. Instead of a list of URLs this Workflow works on a set of files, in this case a collection of PDF documents in different languages, namely the CVCE collection.

R3 CUbRIK Integrated Platform Release

Page 90

D9.4 Version 1.0

2.8.4.2 Pipelets and Workers Pipelets are the parts of an implemented Pipeline or Workflow. Pipelets run synchronously. Table 2 provides a list of all Pipelets in the component. com.empolis.ias.sat.pipelets.domainpool.DomainPoolPageWriterPipelet Writes metadata of a single page into the domain pool com.empolis.ias.sat.pipelets.domainpool.WebcrawlingUrlsPipelet Retrieves all domains from a domain collection and generates a webcrawling record for each url referred by the domain. The url is written to a property httpUrl. com.empolis.ias.sat.pipelets.DropRecordsPipelet Drops all records, which have a certain attribute. Its value is never read. Its existence in a record is sufficient for dropping. com.empolis.ias.sat.pipelets.ontostore.FingerPrintPipelet Analyses the keywords of a document for corresponding concepts on the OntoStore. A pageRelevance will be calculated as the weights of the keywords with matching concepts summed up and divided by the maximum number of possible keywords per document. com.empolis.ias.sat.pipelets.googlesearch.GCSPipelet Performs a Google Custom Search com.empolis.ias.sat.pipelets.googlesearch.GCSQueryGeneratorPipelet Generates query strings by logically combining the keys of the concepts in specified classes. All possible combinations of query arguments will be built. One query argument consists of all keys of a concept or-combined. One query consists of one and-ed argument per specified non-merged class, one argument for each merged set of classes and the context, or-generated from all keys of all concepts of all or-classes. The Pipelet returns a map 'queries' with the individual query strings as a sequence called 'entities'. Specify only the class names via the class parameters. The URI prefix is automatically added. com.empolis.ias.sat.pipelets.googlesearch.GCSQueryCounterPipelet Estimates the number of queries, the GCSQueryGeneratorPipelet is going to create. Thus this Pipelet takes the exact same input as the GCSQueryGeneratorPipelet com.empolis.ias.sat.pipelets.ontostore.LOMMapperPipelet Maps OntoStore Results to LOM schema for writing with the LOMWriterPipelet. com.empolis.ias.sat.pipelets.ontostore.LOMWriterPipelet Writes Ontologies to LOM. com.empolis.ias.sat.pipelets.ontostore.OntoStoreConceptResolverPipelet Reads the found concepts in a webpage and writes counts and labels back to an attribute 'concepts'. com.empolis.ias.sat.pipelets.ontostore.OntoStoreReaderPipelet Reads all classes and concepts from an OntoStore into the attribute 'ontologyresult'. com.empolis.ias.sat.pipelets.ontostore.OntologyTreeWriterPipelet Reads all classes and concepts from an OntoStore into the attribute 'ontologyresult' and writes a tree structure as annotated node-parent-pairs into a specified collection on the mongodb. The annotations have a node type:

R3 CUbRIK Integrated Platform Release

Page 91

D9.4 Version 1.0

•

"class" for concept classes

•

"concept" for recognizable concepts

•

"taxonomy" for modelled concepts that only serve structuring purposes

Other annotations are the increasing level in the tree, where the actually inexisting root would have level "-1" the label of the concept and a path key generated from the concept label and all parent labels up to the root. This data structure can be used for quick sorted visualization of the ontology and selections. com.empolis.ias.sat.pipelets.workflow.PushRecordsToWorkflowPipelet Sends all current records to a new job and appends the workflow run info to the records. This is an improvement over the PushRecordsPipelet. Table 2: Implemented Pipelets Workers are the parts of an implemented Workflow. Workers run – in contradiction to Pipelets – asynchronously. The next Table provides a list of all Workers in the component. MongoMapReduceWorker Use MongoDBs JavaScript based map-reduce mechanism to carry out calculations and recombinations (like joins) on collections. Used extensively for the DeltaAnalysis workflow. MongoOrderedGroupingWorker Updates a selection of documents in a collection by splitting the ordered set into a defined number of groups and writing the associated group number into the document to a specified property. MongoOrderWriterWorker Updates an ordered selection of documents in a collection by writing the line number from within the selection result to a specified property. MongoStoreRecordWorker Simply stores a record in a MongoDB collection using a specified attribute as the id. StoreCoOccurrenceResultWorker A complex worker that probably should be split up making use of the other Mongo-Workers. Its purpose is writing the results of the IAS-MapReduce-Workflow for Colocation-Analysis into separate MongoDB collections for occurrences and co-occurrences, calculating metrics over them, create a relevance enriched ontology tree and pre-process the result for visualization. The input is a record store, the output are five new MongoDB collections specific to this result. Table 3: Implemented Workers

2.8.5

Third party library

For indexing processes of documents and Web sources Empolis IAS functionality is used. Moreover, queries are fired against IAS JSON/REST API which provides access to a built index.

2.8.6

How to install

The H-demo is available in CuBRIK svn at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/SearchEngineFederation_EMPOLIS The Pipelets run in the SMILA environment. In order to use IAS functionality – as done for R3 CUbRIK Integrated Platform Release

Page 92

D9.4 Version 1.0

component “Expansion through documents” – proprietary software IAS has to be installed-

2.8.6.1 Getting IAS "Information Access System” IAS is provided by Empolis Information Management GmbH. Point of contact is Simone Conrad, inside CUbRIK project Mathias Otto (E-mail mathias.otto@empolis.com). There will be provided a download (from FTP server) along with a license file.

2.8.6.2 Installation of IAS The installation procedure is available in the CUbRIK svn at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/SearchEngineFederation_EMPOLIS/CI_26092013/doc/releases/1.0 The file service.ini has to be adapted with the following entries: 'tmeserver': ... 'languages': - 'english' - 'german' - 'spanish' - 'french' 'languages_ontology': - 'english' - 'german' - 'spanish' - 'french' This way the recognized languages are defined.

R3 CUbRIK Integrated Platform Release

Page 93

D9.4 Version 1.0

Pipelines of R3: CUbRIK Vertical Applications

History of Europe (HoE) and SME Innovation (Fashion) are two CUbRIK Vertical Applications (V-Apps) implementing real application scenarios with the scope to validate the platform features in real world conditions and for vertical search domains. The HoE demonstrator was introduced as a domain of practice to demonstrate the value of CUbRIK technology to help businesses achieve their goals; while SME Innovation demonstrator focuses on the Fashion domain on supporting SMEâ&#x20AC;&#x2122;s to exploit the potential of the new technology to get feedback from and learn about the needs of fashion consumers. Real application scenarios for both HoE and Fashion were defined through the construction of user stories according the outlined approach described in D2.2 29. Each user story is further implemented through different use cases with dedicated pipelines and components. About the CUbRIK V-Apps â&#x20AC;&#x201C; first prototype- Release 3 : - HoE demonstrator is build on the user story "Who is this person? Tell me more" with the goal to browse, annotate and enrich a large image-based corpus related to nd people, events and facts about the history of Europe after the 2 World War - SME Innovation demonstrator is build on the user story for Trend Analysis, in Fashion domain. The specific purpose is to annotate and enrich images fashion related, crawled from the Social Network, in order to extract trends on the images and the preferences of the Social Network users Further details on these user story are in D2.3 30deliverable. Both HoE and Fashion provide for this release the CuBRIK pipelines for multimodal content analysis & enrichment that are further detailed in D5.231. The Table below shows information about the CUbRIK H-demos that are reused in the CuBRIK V-Apps

HoE

Fashion

Contextaware automatic query formulation

Search Query Federation

Like Lines

Crosswords

Accessibility aware Relevance feedback

Media Entity Annotation

People Identification

H_Demo News history

V-App

Table 4: H-Demo vs V-Apps

29D2.2 Requirements Specifications 30 D2.3 Revised Requirements Specifications 31 D5.2 R2 Pipelines for multimodal content analysis & enrichment R3 CUbRIK Integrated Platform Release

Page 94

D9.4 Version 1.0

3.1 Search for SME innovation (Fashion) application The Fashion V-App consists in: • •

Trend Analyzer pipeline trend analyzer component hosted in CERTH server farm contains the results of trend analysis performed. Results are stored in CUbRIK storage component. • Fashion V-App User Interface. Tend analyzer relies on a dataset that is permanently updated with fresh data periodically crawled and processed. The section below described is strictly related to the main pipeline exploiting trend analyzer. Moreover the data visualization is implemented. Section below reports info about its installation, configuration and running.

3.1.1

How to install the Fashion V-App bundle

The installation of Fashion V-App consists in the installation and configuration of a SMILA bundle. Assuming you have already set-up the proper deployment environment (section 1.6.2), the following steps describe as proceed with its installation and configuration: 1. Download the content of JAR folder at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Fashion_VApp/JAR and copy it under the /plugins directory of your SMILA installation 2. Download the content of configuration folder at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Fashion_VApp/configuration and copy it Into the configuration\org.eclipse.smila.processing.bpel\pipelines directory of your SMILA installation 3. Download the TrendAnalysisWebApp from the GUI folder at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Fashion_VApp/GUI/TrendAn alysisWebApp and copy it under the configuration directory of your SMILA installation 4. Open the deploy.xml.trendanalysis file that is under the configuration folder you downloaded and and copy the content between “” and “” content ant the same level in the deploy.xml file into the configuration\org.eclipse.smila.processing.bpel\pipelines directory of your SMILA installation 5. Open the jetty.xml.trendanalysis file that is in the configuration folder you downloaded and copy the content between “” and “<!-END Trend Analysis -->” at the same level in the jetty.xml file inside the configuration\org.eclipse.smila.http.server directory of your SMILA installation Note that jetty that is the HTTP server embedded in SMILA uses as port default the 8080 value. If needed you can change the default value editing the configuration/org.eclipse.smila.clusterconfig.simple/clusterconfig.json file and modifying the value of the parameter “httpPort” with the port value number chosen.

3.1.1.1 How to run the Fashion V-App Once installed the Fashion V-App, run the SMILA.exe file in your SMILA root folder. The software will start loading the bundles; you can browse Fashion V-App at http://localhost:8080/trendanalysis, the image below shows the screenshot. In description above we assume that you are testing the demo on the same machine that hosts the platform using the 8080 as port default value. If you are testing the application from another computer, please replace “localhost” with the name or the IP of the server running the platform. If jetty is not using the default port value, Note that SMILA uses as port default the 8080 value, remember to replace 8080 with the port value configurated. In order to view the Fashion V-app run smila.exe that is under /<SMILA> and browse the URL http://localhost:8080/trendanalysis/

R3 CUbRIK Integrated Platform Release

Page 95

D9.4 Version 1.0

Figure 45: Fashion V-App screenshot

3.2 History of Europe application The History of Europe consists in: • •

• •

A data service (DS) component to store resources (HoE photos and portraits); 7 SMILA pipelines (Photo Processing, Portrait Processing, Face Matching, Crowd Face Validation Result, Crowd Face Add Result, Crowd Keypoint Tagging Result, Crowd Face Identification Result), which are implemented as SMILA pipelines. 3 servlets (GetCollection, processCollection, addIdentification) A conflict resolution manager (CRM) which is implemented according to two different archiectures: o A crowd based architecture (Crowd Searcher), which sends identification tasks

R3 CUbRIK Integrated Platform Release

Page 96

D9.4 Version 1.0

to a list of experts. A crowd-based architecture (Microtask), which performs operations by means of a generic crowd (face validation and face add) An Entitypedia service to interact with the Entitypedia knowledge based server o

â&#x20AC;˘

A detailed description of CUbRIK conflict resolution manager (CRM) and related frameworks is provided in D3.3. Section below reports info about HoE installation, configuration and running.

3.2.1

How to install the History of Europe bundle

The HoE V-App is released as a bundle that requires installation and configuration steps in order to run properly. Assuming that deployment environment is already set-up properly (section 1.6.2), the following steps describe how to proceed with HoE installation and configuration: 1. Download the content of JAR folder at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HoE_VApp/JAR and copy it under the /plugins directory of your SMILA installation 2. Download the content of resources folder at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HoE_VApp/resources and copy it under the SMILA root folder 3. Download the content of edited folder at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HoE_VApp/edited and copy it under a directory of your PC 4. Download the content of configuration folder at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HoE_VApp/configuration and copy it Into the configuration\org.eclipse.smila.processing.bpel\pipelines directory of your SMILA installation 5. Download the content of ThirdParty_components folder at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HoE_VApp/ThirdParty_comp onents and copy it under a directory of your PC. This folder contains CUbRIK components that use the ThirdPartyTool face detection and recognition component that is under commercial licence. In the following section a guideline is reported for fake configuration and also in case you have installed these components. 6. Download the content of ExpertCrowdEntityVerification folder at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HoE_VApp/ExpertCrowdEnti tyVerification/ that is the User Interface of Crowd Sercher

3.2.1.1 Third party components installation and configuration HoE V-Apps requires the installation and configuration of two components using the ThirdPartyTool face detection and recognition component developed by KeeSquare 32a SpinOff of Politecnico di Milano. Since this tool is under commercial licence, you have to contact KeeSquare for getting licence at info@keesquare.com. The following steps describes how to get KeeSquare component licence: 1. Download the content of folder at and run the "setup.exe" file that is under https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HoE_VApp/ThirdParty_comp onents/MorpheusFR_SDK/ 2. In order to perform licence activation: Assuming you are using a windows machine, goes to Start->Programs>Morpheus->MorpheusXX SDK->Show Machine ID 32 http://www.keesquare.com/htmldoc/ R3 CUbRIK Integrated Platform Release

Page 97

D9.4 Version 1.0

Save the MorpheusXX_SDK_xxx-yyy.dat file that is generated and sent it by email to support@keesquare.com Once the "lservrc" file will be received, copy it under both Demo and Delivery folder for example C:\Program Files\Kee Square\MorpheusXX SDK\Demo and C:\Program Files\Kee Square\MorpheusXX SDK\Delivery The lservrc" file has to be copied under Face Detection and Face Identification directories you download at step 5 of previous section. 3. Face Identification and Face detection components are released as part of HoE so you have only to perform download and configuration as described in the previous sections.

3.2.1.2 Crowd Searcher and Task Execution framework Installation The CrowdSearcher is a crowd-management system and it is part of the Search Computing (Seco) 33project funded by the European Research Council (ERC) "IDEAS Advanced Grants". The software licence is proprietary; for information and condition the contact point is Marco Brambilla marco.brambilla@polimi.it . The CrowdSearcher (CS) OSGI service is used to invoke the CrowdSearcher Web Application to create new tasks, send task invitation and retrieve responses. The Task Execution framework is the CrowdSearcher GUI. The following items need to be properly installed and configurated: .NET Framework 2.0 Software Development Kit (SDK) (x86) available at http://www.microsoft.com/en-us/download/details.aspx?id=19988 mongo DB available at http://www.mongodb.org/downloads node.js available at http://nodejs.org/download/ The installation guideline of both Crowd Sercher and TEF are available at: http://crowdsearcher.searchcomputing.org/gitlab/crowdsearcher/cs/blob/www2014/README.md and at: http://crowdsearcher.searchcomputing.org/gitlab/crowdsearcher/tef/blob/develop/README.md but instead of using git repository, please download both Crowd Sercher and TEF at the following links: http://crowdsearcher.searchcomputing.org/gitlab/crowdsearcher/tef/repository/archive?ref=feature%2Feng http://crowdsearcher.searchcomputing.org/gitlab/crowdsearcher/cs/repository/archive?ref=eng

3.2.1.2.1 Configuration After having installed the software and the extension and restarted your machine when requested, you have to prepare mongo-db running instance an configure the CrowdSearcher and TEF application running on the right port. a. Preparing and running mongo-db 1. Create a folder where mongo-db will store the needed files for the V-App. 2. Open a promt (Start menu -> Execute…-> type ”cmd” and press OK) 3. Browse to your mongo-db\bin program folder 33 http://crowdsearcher.search-computing.org/ R3 CUbRIK Integrated Platform Release

Page 98

D9.4 Version 1.0

4. Start the mongo-db server using the command bin\mongod.exe --dbpath c:\path\to\hoe-dbdata 5. Now your mongo-db service is running. Don’t close the prompt until you want stopping the service.

Figure 46 using prompt to start the mongo-db for History of Europe b. Preparing and starting CrowdSearcher 1. Delete the file CrowdSearcher/config/override.json if exists. 2. Edit the file CrowdSearcher /conf/configuration.json and change the value of “hostname” to “localhost” and the value of “port” to 2100. 3. Open a prompt and browse to the CorwdSearcher folder 4. Start the CrowdSearcher application using the command “npm start” Don’t close the prompt until you want stopping the service.

Figure 47:How start the CrowdSearcher service 5. You will see in mongo-db prompt that a connection is settled and after few seconds the prompt of CrowdSearcher will change since to show you that the service is running on configured port.

R3 CUbRIK Integrated Platform Release

Page 99

D9.4 Version 1.0

Figure 48: The CrowdSearcher service is running and ready c. Preparing and starting TEF (Task Execution Framework) 1. Delete the file TEF/config/override.json if exists. 2. Edit the file TEF/conf/configuration.json and change the value of “hostname” to “localhost” and the value of “port” to 8100. 3. Copy the content of ExpertCrowdEntityVerification folder.into the TEF/installation/default folder and if requested OVERWRITE the existing files. 4. Open a prompt and browse to the TEF folder 5. Start the TEF application using the command “npm start” NB: Don’t close the prompt until you want stopping the service.

Figure 49: The TEF is running and ready

3.2.1.3 History of Europe configuration This section reports configuration information in case of fake configuration and also in case you have obtained licence for both Face Detection and Face Identification. In case of Keeysquare licence: Rename the My_Bundle_SMILA\plugin\cubrikproject.service.polmi.FaceDetection_1.0.0.jar.keesq uare deleting the keesquare extension and delete the cubrikproject.service.polmi.FaceDetection_1.0.0.jar.fake file In case of Fake configuration (no Keeysquare licence): Rename the My_Bundle_SMILA\plugin\cubrikproject.service.polmi.FaceDetection_1.0.0.jar.fake deleting the keesquare extension and delete the cubrikproject.service.polmi.FaceDetection_1.0.0.jar.keesquare Please proceed with the following steps: Open the config.ini file that is under the configuration folder of your SMILA installation and add at the of the file but before org.eclipse.smila.http.server@5:start the following configuration: cubrikproject.service.polmi.CrowdPollManager@1:start, \

R3 CUbRIK Integrated Platform Release

Page 100

D9.4 Version 1.0

cubrikproject.service.polmi.Crowdsearch@1:start, \ cubrikproject.service.polmi.Microtask@1:start, \ cubrikproject.pipelet.polmi.Crowd@4:start, \ cubrikproject.pipelet.polmi.FaceMatching@4:start, \ cubrikproject.pipelet.polmi.PhotoProcessing@4:start, \ cubrikproject.pipelet.polmi.Portrait@4:start, \ cubrikproject.service.polmi.FaceDetection@4:start, \ cubrikproject.service.polmi.Utils@4:start, \ cubrikproject.servlet.polmi.hoe@4:start, \ Opens the /configuration\cubrikproject.service.polmi.FaceDetection\facedetector.propertiesfile and modify parameters with their relative path:

#path of executable detector.path=/home/FaceDetection/FaceGraph.exe #where to save templates template.path=/path #where to save descriptor descriptor.path=/path #where to save edited photos editedPhoto.path=/path #if the component has to save templates template.enable=true #if the component has to save descriptions descriptor.enable=false #if the component has to save edited photos editedPhoto.enable=true resources.path=/path_to_file_fakerespnses_ks_txt (example:resources/fakeresponses/fakeresponse_ks.txt) fake.path=/path_to_file_fakerespnses_txt resources/fakeresponses/fakeresponse.txt)

(example:

Open the configuration\cubrikproject.service.polmi.FaceDetection\ facematcher.properties file modify parameters with their relative path: #path of executable matcher.path=/home/FaceDetection/FaceMatcher.exe Open the \configuration\cubrikproject.service.polmi.Crowdsearch\ crowdsearcher.properties file and e modificare il path relativo al parametro resources.path: resources.path=/path_to_root_folder_BUNDLE_SMILA Open the resources\fakeresponses\fakeresponses.txt file and modify paths related of images with the directory edited you downloaded, for example: , 00008mycollection_0 00008mycollection â&#x20AC;Ś.. none /path_edited_folder/00008_0.jpg

R3 CUbRIK Integrated Platform Release

Page 101

D9.4 Version 1.0

Open the \configuration\org.eclipse.smila.importing.crawler.web\ webcrawler.properties file and modify path of socketTimeout parameter: socketTimeout=300000 Open the /configuration\org.eclipse.smila.jobmanager\jobs.json.hoe file and copy the content between {"_comment":"START Accessibilty H-Demo Specific configuration"},and {"_comment":"END Accessibilty H-Demo Specific configuration"} in the same position of jobs.json of your SMILA installation Open the configuration\org.eclipse.smila.jobmanager\ workflows.json.hoe file and copy the content between: {"_comment":"START Accessibilty H-Demo Specific configuration"}, and {"_comment":"END Accessibilty H-Demo Specific configuration"} in the same position of workflows.json of your SMILA installation Open the configuration\org.eclipse.smila.processing.bpel\ processor.properties file and modify the pipeline.timeout parameter as follow: pipeline.timeout=6000 Open the configuration\org.eclipse.smila.processing.bpel\pipelines\deploy.xml.hoe  and in the same position of deploy.xml file of your SMILA installation Open the configuration\org.eclipse.smila.processing.bpel\pipelines\ FaceMatchingPipeline.bpel file and modify the follwing section:

Open the configuration\org.eclipse.smila.processing.bpel\pipelines\ FaceMatchingPipeline.bpel and modify the following section: <proc:configuration> <rec:Val key="numProposedMatches">10</rec:Val> <rec:Val key="enableEmailInvitation">true</rec:Val> <rec:Val key="emailService">my_ emailService</rec:Val> <rec:Val key="emailPassword">*******</rec:Val> <rec:Val key="emailAddress">my_email@emailservice.com</rec:Val> </proc:configuration>

Where: - numProposedMatches specifies the number of matched sent to the expert - enableEmailInvitation specifies whether the CrowdSearcher should send an email to registerd performers to notify new tasks. - emailService specifies the service used to send the invitation emails. Enabled services: DynectEmail, Gmail, hot.ee, Hotmail, iCloud, mail.ee, Mail.Ru, Mailgun, Mailjet, Mandrill, Postmark, QQ, SendGrid, SES, Yahoo, yandex, Zoho. More to come. - emailPassword: the password for the email account - emailAddress: the email to be used to send invitations In case a new performers needs to be added, a REST call has to be done as follow: POST request to: http://localhost:2100/api/performer Body: {

R3 CUbRIK Integrated Platform Release

Page 102

D9.4 Version 1.0

“username”:”my_user”, “birthdate”:”my_ birthdate”, “name”:”my Name”, “email”:”my_email”, }, {… },

3.2.1.4 How to run the HoE V-App Once the HoE was properly installed and configurated, you can test your HOE installation as follow: 1. Run the smila.exe that is under your SMILA installation and waiting for the massage: HTTP server started successfully on port 8080. 2. Run the HoEJobStarter with double click on \plugins\cubrikproject.launcher.HoE.jar of your SMILA installation and at this point the following interface will be opened:

Figure 50: SMILA jobs Starter 3. Click on Start Jobs for run all SMILA Jobs 4. Click on Init Store and wait for some second, then click on Start Crawling Portraits button to upload the data collection 5. Once the HoE Application was properly initialized, you can browse the HoE V-App GUI at http://localhost:8080/SMILA/hoe/index.html R3 CUbRIK Integrated Platform Release

Page 103

D9.4 Version 1.0

Figure 51: HoE GUI 6. At this point you can proceed with the process of a new collection processing. Clicking on Process New Folder button the following image will be showed

R3 CUbRIK Integrated Platform Release

Page 104

D9.4 Version 1.0

Figure 52: Process a New Folder Here you can put the folder name that for fake modality has to be mycollection and the Folder Paths has to be the path of Folder \resources\HoEPhotos\LowRes folder In case you have Face Detection and Face Identification components installed in your PC, the folder name can be chosen according your preference, for example if the folder name is New Collection follows the image below:

Figure 53: Process a New Folder 2 At this point, all the photos of the collection processed are shown in the result page below:

R3 CUbRIK Integrated Platform Release

Page 105

D9.4 Version 1.0

Figure 54: Collection exploration In figure above, for the collection already processed a summary of the process automatically performed in showed in section Collection Info. Moreover, in the Collection Images section are reported all images processes. Clicking on each image it is possible to view the results of the face detection automatically process since the Face identification component returns a list of potential identities for the faces in the image.

R3 CUbRIK Integrated Platform Release

Page 106

D9.4 Version 1.0

Figure 55: Face detail At this point, the user can refine the results of face detection for example deleting false results related to a wrong detection. Moreover, the validation of automatic process can be performed by: â&#x20AC;˘ â&#x20AC;˘

A Q&A (Crowd Searcher), which sends identification tasks to a list of experts.. A crowd-based architecture (Microtask), which performs operations by means of a generic Clickworkers crowd The task is automatically opened when a new collection is processed and if some faces are detected by Face detection component. Regarding the Crowd Searcher, once the task has been created going at http://localhost:8100/jobs the interface is the one reported by the following image:

R3 CUbRIK Integrated Platform Release

Page 107

D9.4 Version 1.0

Figure 56: Jobs Execution Clicking on the Face identification Job, the job detail page is showed:

Figure 57: Jobs details page Clicking further on the â&#x20AC;&#x153;Face Identificationâ&#x20AC;? the task detail page is showed:

R3 CUbRIK Integrated Platform Release

Page 108

D9.4 Version 1.0

Figure 58: Face Identification Task Clicking on â&#x20AC;&#x153;Run Taskâ&#x20AC;? blue box allows to test the execution of the task, where given a face and a list of suggestions (e.g., the matches with available portraits of people with known identity), the crowd worker domain expert should identify the person in the picture

R3 CUbRIK Integrated Platform Release

Page 109

D9.4 Version 1.0

Figure 59: CrowdSearcher UI for Expert Identification The crowd worker domain expert identifies the person in the picture and clinking on send his contribution will be registered and reported in the collection Info section.

R3 CUbRIK Integrated Platform Release

Page 110

D9.4 Version 1.0

Figure 60: Image Validation performed Clicking on “Answers” green box of the Face Identification Task GUI (Figure 58) a list of all the answers for the task is showed:

R3 CUbRIK Integrated Platform Release

Page 111

D9.4 Version 1.0

Figure 61: Answers list Regarding the Microtask platform, a crowd of clickworkers on the Microtask platform reviews face positions in two ways: to identify false positives the crowd members evaluate whether the bounding boxes placed on an image actually cover a face or not. For false negatives the crowd is able to annotate missing faces in the image. Overall a face position is verified after three independent confirmations from different clickworkers.

Figure 62 Crowd Face position validation In order to use the Microtask platform it is needed to request the creation of a workspace; the MIcrotask contact point is: otto.chrons@microtask.com R3 CUbRIK Integrated Platform Release

Page 112

D9.4 Version 1.0

After the Crowd Validation the following images will be showed:

Figure 63: Face detail after Crowd validation After the crowd validation tree faces are identified on respect of the only one discovered by the automatic process. Moreover, once the user selects a face, the â&#x20AC;&#x153;Crowd validationâ&#x20AC;? checkbox enable to switch between results before and after crowd entity verification

R3 CUbRIK Integrated Platform Release

Page 113

D9.4 Version 1.0

Figure 64: Face detail after Crowd validation

R3 CUbRIK Integrated Platform Release

Page 114

D9.4 Version 1.0

R3 Components

This chapter addresses all CUbRIK components belonging to R3 of the CUbRIK platform. Components for V-Apps are reported in order to give an overview of the components set. V-App Components

History of Europe

Accessibility

Fashion X

Content Provider Tool

Connection to the CVCE collection

Crowd face position validation

Descriptor Extractor

Entity verification & annotation

Entitypedia integration and data provisioning

Expansion through documents

Expansion through images

Expert crowd entity verification

Face detection

Face identification

GWAP Sketchness (Clothing Item Identification)

Image Extraction from Social Network

Implicit Feedback Filter LikeLines

License checker

Lower & Upper body parts detector

Media harvesting and upload

Object Store

Provenance checker

Query for entities

Social graph creation

Social Graph network analysis toolkit

Trend Analyser

Visualisation of the social graph

R3 CUbRIK Integrated Platform Release

Page 115

D9.4 Version 1.0

4.1 Components Specification This section reports the identity card of each components belonging R3, moreover it reports the main information for the given component. In details: • •

Responsibilities is about the purpose or the job description of the component Provided Interfaces is related to the interfaces vs. other components which the component provides its functionalities (exposed services) • Dependencies/Required Interfaces concern the interfaces from other components which the component requires its functionalities (exploited services) • CUbRIK svn Url is the link to the official CUbRIK svn where it is possible to download the component This identity card is an excerpt of the component specification template that was defined for the specification of all CUbRIK components. Each component specification template was fulfilled by components owner and it is available in the CUbRIK svn in each component folder released under https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components

4.1.1

Accessibility

Component Name

Accessibility component

Responsibilities

In the context of the accessibility-related annotation, the purpose of the accessibility component is to provide methods for extracting accessibility-related annotation from images, in the form of a vector of accessibility scores for the various supported impairments. These accessibility scores will eventually be used in order to rerank the search results according to how accessible they are for a specific user.

Provided Interfaces

accessibilityAnnotation calculateAccessibilityScores

Dependencies / Required Interfaces

The implementation of the component’s methods does not depend on any interface exposed by other CUbRIK components. All information needed for the execution of the component’s methods is passed to them via their arguments.

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/ Accessibility

R3 CUbRIK Integrated Platform Release

Page 116

D9.4 Version 1.0

4.1.2

Content Provider Tool

Component Name

Content Provider Tool (CPT)

Responsibilities

The Content Provider Tool (CPT) enables the content provider to reliably express and bind license / permission information (CC and more restrictive licenses) to local content, thus creating datasets that can be processed by the CUbRIK platform. It also ensures nonrepudiation and integrity of the provided license / permission information. The CPT is a standalone application for preparing data sets (media files and metadata). Trust is established by signing the information by the content provider/rights holder

Provided Interfaces

CLI usage example: java -jar content-provider-tool-1.0.0.one-jar.jar test doe.jks changeit doe - 'test': dataset source folder containing media files, metadata files and license information (XML, conform to CPT-schema) - 'doe.jks' : keystore containing a private key which is used for signing the license information - 'changeit': keystore password - 'doe' : alis (id) of the private key to be used for signing

Dependencies / Required Interfaces

None – standalone application.

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/C ontentProviderTool

4.1.3

Connection to the CVCE collection

Component Name

Connection to CVCE collection

Responsibilities

Provided service

web

Dependencies / Required interfaces

•

The provision of historical (i.e. trustable) texts related to person names, places and time for integration in Entitypedia.

•

The generation of enriched versions of texts resulting from the annotation, extraction and verification process.

•

triggerBatchEntityExtraction(Text[])

•

extractEntities(Text) component.

•

storeEnrichedTexts(Text[]) (Alfresco based)

from

entity

annotation

from

CVCE

backend

extraction system

Y2: Provision of around 10.000 historical documents containing persons, places, organisations, events names and periods of time

sftp://cubrik@cubrikfactory.eng.it:21/.HUGE_DATA/projects/cubrik/CONTENT_ARCHIVE/doc uments_cvce/cvce_docs_de_fr.zip Y3: Connection to the CVCE back-end (for entities extraction of new documents)

R3 CUbRIK Integrated Platform Release

Page 117

D9.4 Version 1.0

4.1.4

Component Name

Responsibilities

The Content Aware Crawler (CAC) is responsible for fetching multimedia content from various sources alongside with license information and store it using the Object Store or the local file system..

Provided Interfaces

The CAC is usable either by a command line client or a public API. It is available as an OSGI Plugin to be called from SMILA Currently it supports the crawling of • Flickr images by tag search • Tumblr images by tag search or blog URL • Wikimedia Commons Content by URL • Podcasts by URL Currently supported Storage Options: • Store to Object Store (OS) Store to file system

Dependencies / Required Interfaces

As for the current standalone version: none

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/ CopyrightAwareCrawler

4.1.5

Crowd face position validation

Component Name

Crowd Face Position Validation

Responsibilities

This component validate the results of the automatic face detection It is based on the CrowdSearcher (CS) OSGI service is used to invoke the CrowdSearcher Web Application to create new tasks, send task invitation and retrieve responses

Provided Interfaces

The CS is used for the “Crowd Identify Face” task: given a face and a list of suggestions (e.g., the matches with available portraits of people with known identity), the crowd worker (typically, a domain expert) should identify the person in the picture Once the pipeline finds the top-k matches for a face, it opens a new task on the CrowdSearcher platform and sends a notification to a list of experts. Every time a new expert executes the task, CS sends the new identification result to SMILA

Dependencies / Required Interfaces CUbRIK svn Url

This component is part of the Search Computing (Seco) project funded by the European Research Council (ERC) "IDEAS Advanced Grants". The software licence is proprietary; for information and condition please contact Marco Brambilla marco.brambilla@polimi.it as POLIMI point of contact.

R3 CUbRIK Integrated Platform Release

Page 118

D9.4 Version 1.0

4.1.6

Descriptor Extractor

Component Name

Descriptors Extractor

Responsibilities

Extract multimedia descriptors and textual descriptors (optional) for each image retrieved, in order to be used from the trend analyser

Provided Interfaces

extractDominantColor extractColorPalette extractLBP extractOSIFT

Dependencies / Required Interfaces CUbRIK svn Url

4.1.7

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/D escriptorExtractor

Entity verification & annotation

Component Name

Entity Annotation and Verification

Responsibilities

The purpose of this component is allowing users to identify people (faces) in a picture, to verify the identity of the identified people and to annotate relevant data related to the picture and the people present on it (image title, date when it was taken, event, place, personal information about the people, â&#x20AC;Ś).

Provided Interfaces Dependencies / Required Interfaces

The component depends on the JSON/REST Interface to get the results of the automatic face validation/identification

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/E ntityVerification&Annotation

R3 CUbRIK Integrated Platform Release

Page 119

D9.4 Version 1.0

4.1.8

Entitypedia integration and data provisioning

Component Name

Entitypedia Integration and data provisioning

Responsibilities

Provides CRUD and search methods on entities, attributes, and metaattributes.

Provided Interfaces

The component is implemented as a java client for Entitypedia JSON API and provides the following functionality: public long create(Entity entity) throws CubrikApiException; public Entity read(long entityId) throws CubrikApiException; public Entity update(Entity entity) throws CubrikApiException; public void delete(long entityId) throws CubrikApiException; public List<Entity> search(String entityType, String keywordQuery); public String[][] searchWithEql(String eqlQuery);

Dependencies / Required Interfaces CUbRIK svn Url

4.1.9

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/Entit ypediaIntegrationAndDataProvisioning

Expansion through documents

Component Name Responsibilities

Expansion through documents •

Retrieval of connected documents

•

Provision of measure

Provided Interfaces

one method that accepts entities

Dependencies / Required Interfaces

Context expander Index

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/E xpansionThroughDocuments

R3 CUbRIK Integrated Platform Release

Page 120

D9.4 Version 1.0

4.1.10

Expansion through images

Component Name

Expansion through Images

Responsibilities

Selected entities are expanded with a list of relevant, diverse images

Provided Interfaces

retrieveImage

Dependencies / Required Interfaces

The component incorporates all necessary dependencies and does not depend on SMILA.

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/ ExpansionThroughImages

4.1.11

Expert crowd entity verification

Component Name

Expert Crowd Entity Verification

Responsibilities

This component verify the entity associated to a face

Provided Interfaces

The component is a SMILA pipeline activated after the Face Identification pipeline receiving has input a SMILA record representing a Face. It can also be invoked outside SMILA with: POST /smila/hoe/requestIdentification create a crowd task for the specified face and send the invitation to the specified expert /smila/job/identify/bulk save an identification for a given face GET: smila/hoe/getHoE used to retrieve detected faces and matches and identifications

Dependencies / Required Interfaces

This component interacts with the Crowdsearcher platform to execute crowd task and with the Face Identification component to update results. It also interact with Entitypedia to obtain a Person Entity associated to the name specified by the crowd.

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/ ExpertCrowdEntityVerification

R3 CUbRIK Integrated Platform Release

Page 121

D9.4 Version 1.0

4.1.12

Face detection

Component Name

Face Detection

Responsibilities

This component automatically detect faces in an image and send results to validation

Provided Interfaces

POST: smila/hoe/processCollection used to start processing a folder of images smila/hoe/processImage used to upload an image to be processed GET: smila/hoe/getCollection used to retrieve detected faces

Dependencies / Required Interfaces

This component interacts with a Face Detection Tool (keesquare) for automatic face detection and with the Crowd Face Position Validation component to validate results. It also interacts with the Entitypedia storage to save faces and photos.

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/F aceDetection

4.1.13

Face identification

Component Name

Face Identification

Responsibilities

This component aims to identify the person represented in a face by performing face similarity between the face and a fixed set of already identified faces.

Provided Interfaces

The component is a SMILA pipeline activated after the Face Detection pipeline receiving has input a SMILA record representing a Face. To retrieve results: GET: smila/hoe/getCollection used to retrieve detected faces and matches

Dependencies / Required Interfaces

This component interacts with a Face Matcher Tool (keesquare) for automatic face similarity and with the Crowd Face Identification component and Crowd Pre-Filtering to validate results. It also interacts with the Entitypedia storage to add matches.

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/Fa ceIdentification

R3 CUbRIK Integrated Platform Release

Page 122

D9.4 Version 1.0

4.1.14

GWAP Sketchness (Clothing Item Identification)

Component Name

Sketchness - GWAP for Segmentation

Responsibilities

The GWAP hereby described is used to exploit a crowd of players in order to segment fashion related images that were difficult to process by the other components involved. The GWAP can be used to check if a particular fashion item is present or not within an image by asking for a confirmation to the crowd in the form of a tag; the image can also be tagged in the case in which it was not previously annotated. The component will also be used to segment the tagged fashion item within the image by asking to the players to trace the contours of the object.

Provided Interfaces

GET: image/all used to retrieve all the images stored in the CMS image/<id> used to retrieve just one image stored in the CMS image/<id>/segment used to retrieve all the segments associated to the image image/<id>/tag used to retrieve all the tags that has been assigned to the image POST: image used to post an image POST/PUT image/<id>/tag used to submit a particular tag for a specific image

Dependencies / Required Interfaces

None. Being based on a RESTFUL webservice, it is the duty of the other component to query the CMS that contains the data generated by Sketchness both for submitting content and to retrieve annotations.

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/G WAPSketcheness(ClothingItemIdentification)

4.1.15

Image Extraction from Social Network

Component Name

Extraction of Images from SN

Responsibilities

The purpose of the component is to retrieve streams of images and metadata from SNs to be used for trend analysis. The SN that we will examine for now is twitter

Provided Interfaces

Data are recorded to a mongoDB database. The components that use the retrieved data get them directly from the database.

Dependencies / Required Interfaces

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/I mageExtractionFromSocialNetwork

R3 CUbRIK Integrated Platform Release

Page 123

D9.4 Version 1.0

4.1.16

Implicit Feedback Filter LikeLines

Component Name

Implicit Feedback Filter (LikeLines)

Responsibilities

given a link to the video, the component delivers the time-codes of interesting keyframes in the video.

Provided Interfaces

takes input from ““Trend analyser and Image extraction from Social Network” and provides output to “Full image Identification”

Dependencies / Required Interfaces CUbRIK svn Url

4.1.17

getNKeyFrames no dependencies other than the existence of an up and running LikeLines server https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/I mplicitFeedbackFilterLikeLines

License checker

Component Name

License Checker (LC)

Responsibilities

The License Checker (LC) is responsible for the mapping of license information (CC / Europeana / proprietary) to platform permissions that can then be easily interpreted by system domains based on requested actions (storage, analysis, modification, presentation, distribution). The current version is provided as standalone application, future versions will become an integrated part of the Object Store (OS).

Provided Interfaces

CLI usage Example: java -jar license-checker-1.0.1-SNAPSHOT.onejar.jar input-custom.xml output.xml - 'input-custom.xml': signed license information document (as created by the CPT) - 'output.xml': platform permissions document, the mapping is defined in the mapping.properties file (current version only contains example mappings)

Dependencies / Required Interfaces

As for the current standalone version: none.

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/Li censeChecker

R3 CUbRIK Integrated Platform Release

Page 124

D9.4 Version 1.0

4.1.18

Lower & Upper body parts detector

Component Name

Lower & Upper Body Part Detector

Responsibilities

Detects upper and lower body parts in an image.

Provided Interfaces

The component offers its service through a REST-based interface (using JSON syntax). A BEPL file will describe the REST-based service and make it available as a SMILA component. Alternatively, direct Java invocation may be provided.

Dependencies / Required Interfaces

The component will package all necessary dependencies and does not depend on SMILA. It can thus be run independently.

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/Lo wer&UpperBodyPartsDetector

4.1.19

Media harvesting and upload

Component Name

Media harvesting and upload

Responsibilities

The media harvesting and upload component is responsible for populating the History of Europe database with content from the CVCE archive and external sources. The external data sources that will be considered for the second year demonstrator will be the Flickr service and the Europeana collections.

Provided Interfaces

startHarvesting()

Dependencies / Required Interfaces CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/Me diaHarvestingAndUploa

Y2: Crawled Europeana content for given queries https://cubrikfactory.eng.it/svn/CUBRIK/WORK/MediaHarvestingAndUpload_CERTH/docs/europeana_t op50persons_json.zip Y3: SMILA pipeline for the Europeana API Integration of the EC Audiovisual library

4.1.20

Object Store

Component Name

Object Store (OS)

R3 CUbRIK Integrated Platform Release

Page 125

D9.4 Version 1.0

Responsibilities

The Object Store (OS) component is responsible for the storage of binary objects and metadata annotations (using MongoDB's GridFS) via REST-API. Metadata can be stored either as JSON-formatted text or as files (arbitrary format). The storage solution also provides the functionality to define relationships between media object files and metadata object files. The OS also provides a HMAC-based access control/authentication mechanism. Future versions of the OS will also integrate functionalities for license checking and provenance checking for provided content.

Provided Interfaces

The complete API description is available by calling the following URL of a running service instance: http://host:port/application.wadl The REST-API includes methods for uploading binary objects, downloading binary objects, updating binary objects, removing binary objects, managing relationships between binary objects, managing metadata of objects and finding objects. The API also includes CRUD-methods for JSON-formatted information. Moreover, there is a client component for the service available (as library for integration into other components, as well as OSGI-Plugin to be called from within SMILA).

Dependencies / Required Interfaces

Prerequisite: running MongoDB installation.

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/ ObjectStore

R3 CUbRIK Integrated Platform Release

Page 126

D9.4 Version 1.0

4.1.21

Provenance checker

Component Name

Provenance Checker (PC)

Responsibilities

The Provenance Checker (PC) component is responsible for analysing the content source, checking integrity and detecting duplicates (exact, perceptual, and based on metadata) and content reuse. Checks whether a provided content can be approved to be used within the platform (based on the existence of trusted license information). The current version is provided as standalone application, future versions will become an integrated part of the Object Store (OS).

Provided Interfaces

CLI usage example: java -jar provenance-checker-1.0.0.one-jar.jar content.mp3 out.xml signed.xml - 'content.mp3': a media file - 'out.xml': output of the PC - the provenance document containing information about the approval status of the media file and whether the media file is an exact or perceptual (not implemented in v1) copy of a reference file (reference file = approved file) - 'signed.xml': license information as created by the CPT (if a valid license information document is provided, the related media file is approved and becomes a 'reference file')

Dependencies / Required Interfaces

The PC requires the existence of a central component (here: CUbRIK_CONTROLLER) and respective method that triggers the content provisioning process.

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/ ProvenanceChecker

4.1.22

Query for entities

Component Name Responsibilities

Query for entities â&#x20AC;˘

Recognition of entities in query

â&#x20AC;˘

Return of results

Provided Interfaces

one method that accepts textual queries via REST Apis

Dependencies / Required Interfaces

Visualization of the social graph Index

CUbRIK svn Url

This component is released as part of Query Federation H-demo so it is available at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/HDemos/SearchEngineFederation_EMPOLIS the component specification is available at https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/ QueryForEntities

R3 CUbRIK Integrated Platform Release

Page 127

D9.4 Version 1.0

4.1.23

Social graph creation

Component Name

Social Graph Creation

Responsibilities

Computes co-occurrence statistics based on verified entities

Provided Interfaces

A single query function with flexible in-/output options

Dependencies / Required Interfaces

Apart from access to verified entities, the component will package all necessary dependencies and does not depend on SMILA.

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/ SocialGraphCreation

4.1.24

Social Graph network analysis

Component Name

Graph Toolkit

Responsibilities

(a) Communities’ generation; (b) vertices’ ranking (HITS, PageRank); (c) compute global features of the graph, expressing the size/diameter and the density of the graph; and (d) calculate centrality measures’ (closeness, eigenvector, betweeness) for vertices, indicating the importance of the vertices in the social graph.

Provided Interfaces

getGraphModelFromFile () getGlobalFeatures() getNodeFeatures() getClustersFeatures() exportFeatures()

Dependencies / Required Interfaces

JDK 1.7 and gephi-tookit.jar

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/ SocialGraphNetworkAnalysisToolkit

R3 CUbRIK Integrated Platform Release

Page 128

D9.4 Version 1.0

4.1.25

Trend Analyser

Component Name

Trend Analyser

Responsibilities

Provide trends of fashion items for the specified time windows

Provided Interfaces

configParams getColorTrends getTextureTrends getPopImages

Dependencies / Required Interfaces

Image extraction from S.N. Descriptors extractor Lower & Upper body parts detector Sketchness GWAP

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/ TrendAnalyser

4.1.26

Visualization of the social graph

Component Name

Visualization of the social graph

Responsibilities

The purpose of this component is displaying the co-ocurrences between entities in two different visualizations: in an overview graph and also focused on a specific node. It is also designed to navigate through the entities switching between the graphs.

Provided Interfaces

visualize() displayNode()

Dependencies / Required Interfaces

JSON/Rest API provided by the Query Component

CUbRIK svn Url

https://cubrikfactory.eng.it/svn/CUBRIK/PACKAGES/R3/Components/ VisualisationOfTheSocialGraph

R3 CUbRIK Integrated Platform Release

Page 129

D9.4 Version 1.0

Conclusion

M18

2.0

Reference of the Artefact Delivery

In detail

What is delivered

Release version

Delivery Date

The document provides a comprehensive overview of V-Apps and H-Demos, related pipelines and exploited components that belong the third release of CUbRIK Platform. Moreover a step by step guideline is provided for the setting up of proper environments for both platform deployment and for platform extension; the latter leverages on open source code. From Section 1.4.1 - Release 3 in a Nutshell- the summary table of what is release as R3 is reported and extended with additional References. The latter refers to the sections of this document where the corresponding actual Artefact releasing is described.

Social network analysis, trust & people search techniques

Ranking strategies for Multimedia objects that make use of social features obtained from Web 2.0 platforms.

Part of HoE VApp 3.2 History of Europe application 4.1.10 Expansion through images

Pipelines for relevance feedback

People Identification HDemo

2.2 People Identification H-demo

Like Lines H-Demo:

2.4 Like Lines

Incentive models applied to concrete crowd-sourcing scenarios : GWAP (Sketchness, Crossword), Q&S (CrowdSearcher framework),CrowdTask (MICT platform, CrowdFlower framework )

Sketchness GWAP as part of Fashion V-App 3.1 and 4.1.14 Q&A as part of HoE VApp 3.2 and 4.1.11 Crossword H-Demo 2.5 CrowdTask MICT platform as part of HoE V-App 3.2

Entity game frameworkcrosswords scenario: Crossword for Entity Repository Uncertainty Reduction

Crossword H-Demo for Entity Repository Uncertainty Reduction 2.5

Component for relevance feedbackcrosswords scenario: Crossword

Crossword H-Demo 2.5

Incentive models and algorithms

R3 CUbRIK Integrated Platform Release

Page 130

D9.4 Version 1.0

3.0

Time core services, components and pipelines for preliminary Support to "History of Europe" application This part was further extended to SME Fashion V-App

Pipelines for multimodal content analysis & enrichment

Pipelines for query processing

R3 CUbRIK Integrated Platform Release

Reference of the Artefact Delivery

In detail

What is delivered

Release version

Delivery Date M24

Multimedia similarity indexer for fast multimedia search.

Experimental version

Component for analysis using various algorithmsof the data coming in the databases and those coming from the crawlers in order to find popular content and specific trending topics

fashion v-app / Trend analyzer to extract features from the multimedia content (images) and use the trend analyser to compute trends 3.1 and 4.1.25

Component for text query and a geolocation and crawls data from the web

Text query Fashion VApp crawler as part of Fashion V-App 3.1 Crawler for picasa, panoramio, flickr, European and geolocalization part of Media Entity 2.3

News History H-Demo: NCH Extraction

News Content History HDemo 2.1

History of Europe V-App

HoE pipeline 3.2

Search for SME Innovation application (Fashion) VApp

Fashio V-App pipeline 3.1

News Content History HDemo

News Content History H-Demo 2.1

People Identification HDemo

As part of People Identification H-Demo installed and tested 2.2

Media Entity Annotation HDemo (time related query extension)

As part of Media Entity Annotation H-Demo Installed and tested 2.3

Context-aware automatic query formulation H-Demo

As part of Context Aware automatic query formulation â&#x20AC;&#x201C;installed and tested- 2.7

Page 131

D9.4 Version 1.0

Component and pipeline support services

Reference of the Artefact Delivery

In detail

What is delivered

Release version

Delivery Date

Pipelines for relevance feedback

Search engine federation H-Demo

Search Engine Federation H-Demo – Exploiting IAS EMP Framework - 2.8

Crosswords H-Demo

Crossword H-Demo 2.5

Media Entity Annotation HDemo (relevance feedback - CrowdFlower extension)

Media Entity Annotation H-Demo 2.3

Accessibility Aware Relevance feedback

Accessibility Aware relevance feedback HDemo 2.6

HoE and Fashion V-App components

All the components reported in chapter 4, R3 Components

Platform integration is an articulated process involving different steps including but not limited to testing and deploying. CUbRIK testing procedure is essentially organised in two phases: Artefact testing, with the goal to find the defects of each artefact of the CUbRIK platform. Integration testing –as part of Packaging testing- essentially a testing that has the goal to find defects when artefacts are integrated into a functional chain; it includes also of an “installability” check in order to assure a smooth process without unrelated dependencies. Once the tests process is complete the H-Demos and V-Apps are deployed properly. Table below reports all H-Demos and V-Apps installed as result of integration, testing and deployment on ENG infrastructure; the relative reference URI are provided. News Content History H-Demo

http://cubrik1.eng.it:8080/newshistory-webapp/index.html

People Recognition H-Demo

http://cubrik2.eng.it:8083/static/index.html http://cubrik2.eng.it:8080/smila/pipeline/ImportPhotoPipeline/process/ http://cubrik2.eng.it:8080/smila/pipeline/GetFaceLabelPipeline/proces s/ http://cubrik2.eng.it:8080/smila/pipeline/SetFaceLabelPipeline/proces s/ http://cubrik2.eng.it:8080/smila/pipeline/ValidateFacePipeline/process /

Media Entity Annotation H-Demo

http://cubrik1.eng.it/smila/pipeline/MediaEntityAnnotation/process/ http://cubrik1.eng.it/smila/pipeline/MediaEntityAnnotationFlickr/proces s/ http://cubrik1.eng.it/smila/pipeline/MediaEntityAnnotationTwitter/proce

R3 CUbRIK Integrated Platform Release

Page 132

D9.4 Version 1.0

ss/ http://cubrik1.eng.it/smila/pipeline/MediaEntityAnnotationPanoramio/p rocess/ http://cubrik1.eng.it/smila/pipeline/MediaEntityAnnotationPicasa/proce ss/ http://cubrik1.eng.it/smila/pipeline/MediaEntityAnnotationEuropeana/p rocess/ http://cubrik1.eng.it/smila/pipeline/CrowdFlowerFeedback/process/ http://cubrik1.eng.it/smila/pipeline/CrowdFlowerMediaAnnotation/proc ess/ http://cubrik1.eng.it/smila/pipeline/EntitySearch/process/ Accessibility aware Relevance feedback H-Demo

http://cubrik1.eng.it/accessibilityhdemo/login.html http:// cubrik1.eng.it /accessibility/annotate

Crosswords H-Demo

http://cubrik1.eng.it:8080/crosswords-ui/

Like Lines H-Demo

http://cubrik1.eng.it:8082/examples/demo.html

Context-aware automatic query formulation H-Demo

http://cubrik1.eng.it/queryformulationhdemo/search

Fashion Trend Analysis V-App

http://cubrik1.eng.it /TrendAnalysis/index.html

History Of Europe VApp

http://cubrik1.eng.it:8084/SMILA/hoe/index.html

News Content History and Crossword H-Demos leverages on proprietary components and were requested to be deployed under pwd protection. Credentials for accessing are reported below for internal usage and Review check: News Content History H-Demo

username=nchuser password=CUbRIK01!

People Recognition H-Demo

username=crosuser password=CUbRIK02!

R3 CUbRIK Integrated Platform Release

Page 133

D9.4 Version 1.0