Content Analysis Pipelines Release 2 by CUbRIK Project

R2 PIPELINES FOR MULTIMODAL CONTENT ANALYSIS & ENRICHMENT Human-enhanced time-aware multimedia search

CUbRIK Project IST-287704 Deliverable D5.2 WP5

Deliverable Version 1.0 – 31/08/2013 Document. ref.: cubrik.D52.POLMI.WP5.V1.0

Programme Name: ...................... IST Project Number: ........................... 287704 Project Title: .................................. CUbRIK Partners: ........................................ Coordinator: ENG (IT) Contractors: UNITN, TUD, QMUL, LUH, POLMI, CERTH, NXT, MICT, FRH, INN, HOM, CVCE, EIPCM, EMP Document Number: ..................... cubrik.D52.POLMI.WP5.V1.0 Work-Package: ............................. WP5 Deliverable Type: ........................ Accompanying Document Contractual Date of Delivery: ..... 31 August 2013 Actual Date of Delivery: .............. 31 August 2013 Title of Document: ....................... R2 Pipelines for Multimodal Content Analysis & Enrichment Author(s): ..................................... Dionisio, Pasini, Tagliasacchi, Martinenghi, Fraternali (POLMI), Weigel, Aichroth (FRH), Croce, Lazzaro (ENG), Semertzidis (CERTH)

Approval of this report ............... Summary of this report: .............. History: .......................................... Keyword List: ............................... Availability .................................... This report is: public

This work is licensed under a Creative Commons Attribution-NonCommercialShareAlike 3.0 Unported License. This work is partially funded by the EU under grant IST-FP7-287704

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

D5.2 V1.0

Disclaimer This document contains confidential information in the form of the CUbRIK project findings, work and products and its use is strictly regulated by the CUbRIK Consortium Agreement and by Contract no. FP7- ICT-287704. Neither the CUbRIK Consortium nor any of its officers, employees or agents shall be responsible or liable in negligence or otherwise howsoever in respect of any inaccuracy or omission herein. The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7-ICT-2011-7) under grant agreement n째 287704. The contents of this document are the sole responsibility of the CUbRIK consortium and can in no way be taken to reflect the views of the European Union.

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

D5.2 V1.0

Table of Contents EXECUTIVE SUMMARY

HISTORY OF EUROPE 1.1 ARCHITECTURE AND WORKFLOW DESCRIPTION 1.2 PIPELINES 1.2.1 Photo Processing: 1.2.2 Portraits Processing 1.2.3 Face Matching 1.2.4 Crowd Face Validation Results 1.2.5 Crowd Face Identification Result 1.2.6 Crowd Face Add Result 1.2.7 Crowd Keypoint Tagging Results 1.3 SERVLETS 1.3.1 ProcessCollection Servlet 1.3.2 GetCollection Servlet 1.3.3 FaceTagResult Servlet 1.4 DATA MODEL 1.5 USER INTERFACE

FASHION – TREND ANALYSIS 2.1 ARCHITECTURE AND W ORKFLOW DESCRIPTION 2.2 PIPELINES 2.2.1 Image extraction from Social Networks 2.2.2 Entity Recognition and Extraction 2.2.3 Accessibility Annotation Pipeline 2.2.4 Trend Analyzer 2.2.5 Trend Analizer SMILA pipeline 2.3 DATA MODEL 2.4 USER INTERFACE

4 6 6 7 8 8 8 9 10 11 11 13 14 15 17 22 25 26 26 27 29 29 30 32 33

NEWS CONTENT HISTORY

3.1 ARCHITECTURE DESCRIPTION 3.2 PIPELINES 3.2.1 Query by text 3.2.2 Query by video 3.3 DATA MODEL 3.4 USER INTERFACE

37 ERRORE. IL SEGNALIBRO NON È DEFINITO. 37 38 39 40

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

D5.2 V1.0

Executive Summary This deliverable contains a detailed description of the content analysis pipelines included in the CUbRIK release R3. The work is mainly the outcome of Task 5.1 (Pipelines for feature extraction and multimodal metadata fusion) and Task 5.2 (Pipelines for crowd sourced content tagging). Specifically, this deliverable illustrates both the “History of Europe” and “Fashion – Trend Analysis” pipelines, which combine different components (cfr. D8.2) to perform the analysis of multimedia content. In addition, the deliverable also illustrates the pipelines underlying the News Content History H-demo.

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 1

D5.2 V1.0

History of Europe

Purpose

Browse, annotate and enrich a large image-based corpus related to people, nd events and facts about the history of Europe after the 2 World War.

Description

For a detailed description, see “Who is this person?” use case in WP10.

Query formulation

There is no explicit query formulation. The user is presented with an interface that allows him to browse through the image collection to study social relationships among people, as well as their participation to events and facts that marked important milestones in the recent history of Europe.

Data collection

A collection of images taken in the last fifty years. Images depict: i) portraits of individuals that played a role in the history of Europe (e.g., prime ministers, officials of the EC, etc.). Portraits have associated metadata indicating the identity of the person. Each person appearing in the whole collection might have zero, one, or more than one portrait; ii) group pictures, taken in public events, which contain two or more (up to dozens) of individuals. Each individual might appear in more than one group picture. Some group pictures have metadata indicating the identities of some of the people, without explicit reference to the face position. The initial dataset contains about 4000 images. A subset of 1005 images was manually annotated to indicate the bounding boxes of all the faces appearing in each image. An additional annotation campaign is in progress with the support of experts, in order to indicate, for each bounding box, the identity of the person. Images are quite diverse in terms of size, quality (e.g., noise, blur), color, etc. The visual appearance of people is affected by several factors, including: age, facial expression, variable illumination, perspective, viewpoint, etc.

Result

The output of the system is a rich representation of the analysed image data set, which includes: - The identities of people appearing in group pictures. When this is not possible, clues that might drive expert finding (e.g., nationality of recognized people). - The social relationships (people-to-people) that can be inferred from the analysis of co-occurrences of individuals in group pictures. - Hints on the possible venue/location of a group picture.

Indexing pipeline

The History of Europe (HoE) pipeline receives as input a set of images as described above in “Data collection”. The pipeline executes the following sequence of tasks: - Images are divided in two classes: portraits and group pictures. -

Face detection. o Input: group picture o Output: set of bounding boxes indicating the position and size of the detected faces. Each bounding box is associated a confidence level in the [0, 1] interval. o A pre-processing step might be applied to perform image denoising and rotate the original image, so as to detect faces that are not upright. o A post-processing step might be applied to merge the bounding boxes found at different rotation angles, yet

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 2

D5.2 V1.0

referring to the same face.

Query processing pipeline

Crowd face position validation (human task) o Validate detected bounding boxes to eliminate false positives o Identify undetected faces to increase recall

Face identification o Input: a set of portrait images of an individual; bounding boxes detected in group pictures o Output: for each bounding box, a list of individuals, whose portraits are similar to the detected face. The list is ranked in decreasing order or similarity and only the top-k most similar individuals are returned.

Expert crowd verification (human task) o For each unknown face, consider the top-k most similar individuals. Identify the person if listed. Otherwise, propose a different identity.

Building graph of group pictures / people o Input: Validated face identities in group pictures o Output: A bipartite graph with two kinds of nodes: i) people; ii) group pictures. An edge connects a person to a group picture if his face is detected.

Building social graph of people o Input: graph of group pictures / people o Output: social graph of people (see Figure 1). Each node represents a person. An edge connecting two persons is weighted based on their social affinity, measured based on the co-occurrences detected in group pictures.

There is no pipeline executed at query time. The results produced by the content processing pipeline are presented to the user by means of a GUI that enables browsing and data discovery. -

Extensions

Off-the-shelf components

Humancomputing components

Enrich the image collection with additional portrait images (e.g., from Google Image Search) Clean textual annotations (e.g., named entity resolution) Face detection: KeeSquare FF SDK (commercial license) Face identification: KeeSquare FR SDK (commercial license) Knowledge Based System (Entitypedia)

Humans can be involved to accomplish the following tasks: Validate detected bounding boxes Add missing bounding boxes Validate face identification Add missing identities

Figure 1 - People-to-people social graph

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 3

D5.2 V1.0

1.1

Architecture and workflow description

Figure 2 and Figure 3 illustrate the software architecture developed to implement the content analysis pipeline described above. The architecture depicted in Figure 3 includes: A data service (DS) component to store resources (HoE photos and portraits); 7 SMILA pipelines (Photo Processing, Portrait Processing, Face Matching, Crowd Face Validation Result, Crowd Face Add Result, Crowd Keypoint Tagging Result, Crowd Face Identification Result), which are implemented as SMILA pipelines. 3 servlets (GetCollection, processCollection, addIdentification) A conflict resolution manager (CRM) which is implemented according to two different archiectures: o A crowd based architecture (Crowd Searcher), which sends identification tasks to a list of experts. o A crowd-based architecture (Microtask), which performs operations by means of a generic crowd (face validation and face add) An Entitypedia service to interact with the Entitypedia knowledge based server

Figure 2 - Legend for the architecture graph

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 4

D5.2 V1.0

Figure 3 - Diagram of the software architecture implementing the HoE pipeline

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 5

D5.2 V1.0

1.2

Pipelines

1.2.1

Photo Processing:

Description: this SMILA pipeline uploads of new photos onto the dataset and returns all the faces found by the FD component. Actions: 1. Upload the photo to the DS 2. Save the photos into SMILA storage and into the â&#x20AC;&#x153;Photoâ&#x20AC;? Solr index 3. For all the pictures, the FD component calculates the BB and the confidence score of detected faces 4. Reject faces with low confidence 5. Save faces into SMILA storage and Solr Index 6. Send faces to Microtask for Crowd Face Add 7. Send faces to Microtask for Crowd Face Validation

Figure 4 - Sequence diagram for the Photo Processing job

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 6

D5.2 V1.0

1.2.2

Portraits Processing

Description: the SMILA Pipeline reads the content of a â&#x20AC;&#x153;Portraitsâ&#x20AC;? folder and uploads all the pictures and their metadata (the entity associated to the person in the picture). Each portrait found is processed into the FD component to detect faces. If the FDT does not detect a face, then the portrait will be rejected, if the FD component detects more than one face the face with the highest confidence score will be saved, all the others will be rejected. Actions: 1. Monitoring the portraits directory 2. When a new photo is found, parse its metadata and store the entity of the person represented in the picture 3. Apply FD component to detect faces 4. if the FD component detects at least one face, then save the photo and the face with the highest confidence score, and upload the picture to the DS.

Figure 5 - Sequence diagram for the Populate Portraits job

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 7

D5.2 V1.0

1.2.3

Face Matching

Description: given a list of new Faces from the HoE dataset, computes the match with all faces in the portraits dataset Actions: 1. Calculate matches between each new face and all the portraits 2. Send the face to the CRM for face identification (using top-k matches as suggestion) 3. Save matches into SMILA storage and the Solr index

Figure 6 - Sequence diagram for the Face Matching job

1.2.4

Crowd Face Validation Results

Description: Once a validation task is finished the Microtask crowd-based architecture sends the results to the PollManager that sends them to the Crowd face validation result pipeline. The pipeline updates the crowd confidence score for the face in the SMILA storage and into the Solr index. This pipeline can be also activated by external sources for post processing face validation. Actions: 1. update face confidence score in the SMILA storage and the Solr index

1.2.5

Crowd Face Identification Result

Description: The CrowdSearcher crowd-based architecture (or other identification services) sends the notification of an identified face to this pipeline. It invokes the Entitypedia service, if the crowd just specify the name and not the entity, and adds the identification into the SMILA storage and the Solr Index. Actions: 1. If the identification is not yet associated to an Entity from Entitypedia, then Entitypedia is invoked to find the Entity associated to the specified name. 2. Add the Identification into the SMILA storage and the Solr index

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 8

D5.2 V1.0

Figure 7 - Sequence diagram for the Crowd Face Identification Result job

1.2.6

Crowd Face Add Result

Description: Once a “Crowd Face Add” task is completed, the Microtask crowd-based architecture sends the results to the PollManager, which, in turn, sends them to this pipeline. The pipeline clusters all the bounding boxes corresponding to the same face and computes the average bounding box for that face. The average bounding boxes are then sent to the Microtask crowd-based architecture for keypoint tagging and saved into the SMILA storage and the face Solr index. Actions: 1. Cluster BBs obtained from the “Crowd Face Add” task 2. Compute the average BB for each face 3. Send the average BB to Microtask for “Crowd KP Tagging” 4. Save BBs into SMILA storage and Solr indexs

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 9

D5.2 V1.0

Figure 8 - Sequence diagram for the Crowd Face Add Result job

1.2.7

Crowd Keypoint Tagging Results

Description: Once a “Crowd KP Tagging” task is completed, the Microtask crowd-based architecture sends the results to the PollManager, which, in turn, sends them to this pipeline. The pipeline detects the new face by computing the average position of each face keypoint (left/right eye, mouth, chin), so that a biometric template can be computed by the face identification component and used in the “Face Matching” pipeline to be matched with portraits. Actions: 1. Cluster keypoints 2. Use FD component to detect the face using four keypoints 3. Compute the biometric template. 4. Save the face and the corresponding template.

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 10

D5.2 V1.0

Figure 9 - Sequence diagram for the Face KP Tagging Result SMILA Pipeline

1.3

Servlets

The HoE SMILA pipeline offers different access points through REST APIs. Most of these access points are implemented using SMILA REST API to control jobs. The architecture includes 3 servlets: ProcessCollection GetCollection FaceTagResult

1.3.1

ProcessCollection Servlet

Given a collection name and a folder path, this servlet starts processing the folder into the SMILA pipeline, waits until the end of the processing and then sends results extracted from the Solr index. POST smila/hoe/processCollection?collectionPath=<path>&collectionName=<name> Parameters collectionPath

Path of the folder that contains the collection to process

collectionName

Name of the collection

On Error or Not found Returns HTTP/1.1 500

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 11

D5.2 V1.0

Response: { "photos": [ { "photoDSURI": "http:\/\/89.97.237.243:82\/historyofeurope\/DATA\/chiara\/DATA\/photos\/00008mycollection\/00 008.jpg", "timestamp": 1375798713960, "faces": [ { "identificationIds": [ "00008mycollection_03885" ], "crowdStatus": "DONE", "matches": [ "00008mycollection_0_03885_0" ], "faceId": "00008mycollection_0", "source": "FDT", "bottom": 117.0, "left": 337.0, "confidenceFDT": 0.2248743772506714, "confidenceCrowdValidation": 0.0, "right": 355.0, "top": 86.0 } ], "name": "00008mycollection" }, … ], "matches": [ { "portraitId": "03885_0", "timestamp": 1375798725280, "faceId": "00008mycollection_0", "matchId": "00008mycollection_0_03885_0", "confidenceFDT": 0.10972849279642105 }, … ], "identifications": [ { "portraitId": "03885_0", "faceId": "00008mycollection_0",

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 12

D5.2 V1.0

"expertCount": 1, "personEntity": 3885, "identificationId": "00008mycollection_03885" }, â&#x20AC;Ś ], "portraits": [ { "photoDSURI": "http:\/\/89.97.237.243:82\/historyofeurope\/DATA\/chiara\/DATA\/portraits\/03885\/03885.jpg", "timestamp": 1375798635806, "faces": [ { "faceId": "03885_0", "bottom": 260.0, "left": 141.0, "confidenceFDT": 0.7648965716362, "right": 277.0, "top": 47.0 } ], "name": "03885", "personEntity": 3885, "personName": "Romano Prodi" }, â&#x20AC;Ś ] }

1.3.2

GetCollection Servlet

This servlet returns results for an already processed collection. GET smila/hoe/getCollection?collection=<name> Retrieve face and match results for the specified collection. Resource URL http://smilaAddress:smilaPort/smila/hoe/getCollection?collection=<name> Parameters collection

Name of the collection to retrieve

On Error or Not found Returns HTTP/1.1 500 Response: same response as the ProcessCollection servlet. CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 13

D5.2 V1.0

1.3.3

FaceTagResult Servlet

This servlet receives as input a set of keypoints, sends them to the FaceTagResult job and once the pipeline finishes to compute the matches for the new face it returns the new face and all the associated matches. POST smila/hoe/FaceTagResult/ Body parameter: a SMILA bucket containing: _recordid: if the user performs the tagging on an existing face this field represents the id of the face, otherwise it’s just a unique identifier for the tagging batch photoId: id of the photo which contains the new face x_right_eye: double y_ right_eye: double x_left_eye: double y_left_eye: double x_mouth: double y_mouth: double x_chin: double y_chin: double On Error or Not found Returns HTTP/1.1 500 Internal server error

body: {"_recordid":"00008", "x_right_eye":45456, “y_ right_eye”:5556 … }

Example Response: { "face": { "identificationIds": [ ], "matches": [ "00008mycollection_0c_03885_0", ... ], "faceId": "00008mycollection_0c", "source": "CROWD", "crowdStatus": "UNNEEDED", "bottom": 17.0, "left": 0.0,

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 14

D5.2 V1.0

"confidenceFDT": 0.2248743772506714, "right": 0.0, "top": 16.0 }, "matches": [ { "portraitId": "03885_0", "timestamp": 1375973972913, "faceId": "00008mycollection_0c", "matchId": "00008mycollection_0c_03885_0", "confidenceFDT": 0.10972849279642105 }, ... ] }

1.4

Data Model

The data model is illustrated in Figure 10. Below, details are provided for each of the entities. Photo: a picture in the data set. A photo can be: o HoEPhoto: part of the HoE data set, specified by an id, name, dimension, format, collection and additional metadata, if available. o Portrait: a photo that contains only one Face and refers to one and one only named Entity in Entitypedia. Initially the dataset of portraits is uploaded off line into a folder. Face: a bounding box of a photo that represents a face detected by an automatic face detection component (FD). A face detected by the FD has its own template that can be used to perform matches with other faces. The face is also annotated with the bounding box (BB) and the confidence score obtained from the FD. When a Face is validated by the crowd, it is annotated also with a crowd confidence score. A face detected in a portrait identifies one and only one Entity (PortraitFace).

Keypoint: A detected face is associated to a set of four keypoints (left/right eye, mouth, chin), which are necessary to generate the biometric template used by the face identification component.

Match: a match found between a HoEPhoto and a portrait by means of an automatic face identification component.

Identification: an identification for a face. A face can be associated to a portrait or to an existing Entitypedia Entity which has no portrait in the available data set. For every Identification, the pipeline stores the number of experts who provided it.

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 15

D5.2 V1.0

Figure 10 - Diagram of the Data Model adopted for the "human-enhanced" HoE pipeline" This Data Model can be mapped on the CUbRIK data model where: a photo is a ContentObject and its id, remotePath, name and collection are part of the ContentDescription as defined in the Content Description Model. A face is a SpatialSegmentDescription of a photo and the face template is an Annotation with as AnnotationConfidence the FDTConfidence parameter. A Face is also annotated with all the possible matches, also in this case the AnnotationConfidence is the FDTConfidence parameter. Both the face and the match annotations generate a Conflict (as described in the Conflict Resolution Model). The resolutions of conflicts generate a ManualAnnotation, for the face validation is represented by the CrowdConfidence and for the match conflict is represented by the Identification. For more details about the CUbRIK data model see Section 1 of the document â&#x20AC;&#x153;D2.1 DATA MODELS - Human-enhanced time-aware multimedia searchâ&#x20AC;?.

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 16

D5.2 V1.0

1.5

User Interface

The starting page shows (Figure 11): 1. A set of pre-computed collections 2. A form to process a new collection

Figure 11 - Starting page Once a user processes a new folder or click on an existing collection all the photos of the collection are shown in the result page (Figure 12)

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 17

D5.2 V1.0

Figure 12 - Collection exploration By selecting a photo the photo explorer opens and shows the photo, the faces bounding boxes and a checkbox (Crowd validation).

Figure 13 - Face detail The two states of the checkbox enable to highlight the impact of the crowd on the face detection task (new faces appear, while others are removed) as the results of “Crowd Face Validation” and “Crowd Face Add” are received.

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 18

D5.2 V1.0

Figure 14 - Face detail after Crowd validation More in detail, after the crowd validation the UI shows (Figure 14): - The faces added by the crowd with the â&#x20AC;&#x153;Crowd Face Add Taskâ&#x20AC;? - The faces detected by the FD component, with high confidence score, such that they did not require validation - The faces not yet validated, with medium confidence score - The faces already validated for which the crowd confidence is >0.5 (the majority of performers says its a valid face) At the initial instant of the timeline only the faces detected by the FD component are shown (Figure 13).

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 19

D5.2 V1.0

Figure 15 - Face identification before crowd Once the user selects a face, the top-k possible identities for that individual appear. The â&#x20AC;&#x153;Crowd validationâ&#x20AC;? checkbox enable to switch between results before and after crowd entity verification. Once set on the first state (Figure 15) only the possible identities provided by the automatic tool are shown. Once switched to the second state (Figure 16) an aggregation of the identities provided by the tool and by the crowd is shown. The first positions of the ranking are occupied by the identities provided by the crowd while the last positions are occupied by the identifications performed by the automatic tool. If the identities provided by the crowd correspond to a character that is part of the portraits dataset, a portrait of the character is shown. However, it can also happen that an identity provided by the crowd does not have an associated portrait.

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 20

D5.2 V1.0

Figure 16 - Face identification post crowd

Figure 17 - Microtask UI for Face Validation

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 21

D5.2 V1.0

Fashion – Trend analysis

Purpose

Annotate and enrich images fashion related crawled from the Social Network in order to extract trends on the images and the preferences of the Social Network users.

Description

The pipeline implements the trend analysis process. Images analysis in terms of body parts detection is implemented according to mixed approach humans & automatic tasks. The process is complemented with preliminary images collection crawling and with colour and texture trend analysis. The body parts detection is performed in automatic way, as first; for all the images that are presenting a low level of accuracy a CrowdSearch approach is used. In particular a gamification strategy is put in place exploiting the Sketchenss GWAP.

Query formulation

The SME user is presented with an interface that allows him to browse through aggregates information about the trending fashion items, trending colors or textures and in general information about fashion in the selected time periods or geographic areas.

Data collection

A collection of images related to a specific topic (fashion items in the context of Fashion V-App) and the corresponding metadata are crawled from social networks. The crawling is not an one off procedure that collects some content and terminates but there is an online procedure that runs forever along with the rest of the Fashion V-App components and feeds the pipelines with fresh contents to be processed. The retrieved images are fetched only minutes after they are shared online, enabling the extraction of knowledge in near real time. About 600000 images per week are crawled.

Result

The output of the system is a rich representation of the analysed image data set, which includes: - Relevant meta results (e.g. boundaries, scores) of the the segments extracted related to the upper and lower body part of the body presented in the images - The features of the color and the texture extracted for each segment identifying the category of the clothes present in the images - The trend analysis about the fashion categories that were examined

Content Enrichment pipeline

The Fashion pipeline executes the following sequence of tasks: - Image extraction from Social Networks. o Input: selected set of keywords o Output: images + all available tweets that match any of the selected keywords related to fashion. - Entity recognition and extraction: it involves the following components:  Upper and Lower body parts detector: o Input: Image crawled o Output: relevant meta results (e.g. boundaries, scores) of the body parts detected in the original image  Sketchness GWAP(human task): o Input: images that were difficult to process (confidence score medium) by Upper and Lower body parts detector component o Output: the list of tags that has been defined by the crowd

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 22

D5.2 V1.0

for a particular image and body parts segmented Descriptors extraction o Input: body parts after the automatic or crowd powered segmentation step o Output: dominant colors and texture Accessibility annotation: o Input: images o Output: accessibility related metadata ď&#x192;&#x2DC;

Query processing pipeline

Extensions Off-the-shelf components Humancomputing components

Trend Analyzer: o Input: category of clothes selected by SME user o Output: trend Analysis performed

Humans can be involved to accomplish the following tasks: Segment fashion related images Tag images not previously annotated.

The following image depicts the interface that the SME user exploits to browse through aggregates information about the trending fashion items, trending colours or textures and in general information about fashion in the selected time periods or geographic areas.

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 23

D5.2 V1.0

Figure 18 - Fashion Trend Analysis

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 24

D5.2 V1.0

2.1

Architecture and Workflow description

Fashion V-App demo aims to provide a tool, for the fashion SMEs as well as fashion addicts and fans, that aggregates information about the trending fashion items, trending colors or textures and in general information about fashion in the selected time periods or geographic areas. The V-App is built using a set of CUbRIK components and pipelines that provide the required functionalities; Figure 19 - Overall view of the Fashion V-App architecturegives an overview of Fashion V-App architecture.

Figure 19 - Overall view of the Fashion V-App architecture The architecture depicted in Figure 19 includes:: Image extraction from Social Networks task (green color). Entity Recognition and Extraction task (green color) involving also the Human in the loop feature through the Sketchness GWAP Accessibility annotation SMILA pipeline (yellow color) Trend Analizer component (sky blue color) CUbRIK storage system (MongoDB storage and Redis cache system) Trend Analizer SMILA pipeline (yellow color) that communicates with the trend analyzer component to fetch data to be displayed in the User Interface Trend Analizer component (sky blue color) The Fashion V-App User Interface Figure 20 - Details of the Entity Recognition and Extraction task in the Fashion V-App goes deep into the Entity recognition and extraction task responsible for extracting images from Social Network in order to gather information useful to analyse trends. It depicts the workflow implemented by the Entity Recognition and Extraction task that involves the interaction of the following components: CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 25

D5.2 V1.0

  

Descriptors Extraction component Upper and Lower body parts detector Sketchness GWAP

Figure 20 - Details of the Entity Recognition and Extraction task in the Fashion V-App

2.2

Pipelines

2.2.1

Image extraction from Social Networks

Image extraction from Social Networks task is responsible for retrieving images related to a specific topic (fashion items in the context of Fashion V-App) and the corresponding metadata from social networks. The major difference with other crawling components is that the subject component is not an one off procedure that collects some content and terminates. On the contrary, the component operates as an online procedure that runs forever along with the rest of the Fashion V-App components and feeds the pipelines with content to be processed. The retrieved images are fetched only minutes after they are shared online, enabling the extraction of knowledge in near real time. The component is an extension of the twitter crawler pipeline that was implemented in the Media Entity annotation –H-Demo but with modifications to support queuing of the tasks to be processed. The crawler is retrieving all available tweets that match any of the selected keywords related to fashion. The selected set of keywords is the following: ---------- Upper body clothes 1 Shirt 2 T-shirt ---------- Lower Body clothes CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 26

D5.2 V1.0

1 2 3

Trousers Skirt Shorts

---------- Upper and Lower Body clothes 1 Suit 2 Dress The component discards all tweets that do not contain URLs, since images (and in general multimedia) are shared through different photo sharing services by means of URL addresses.

2.2.2

Entity Recognition and Extraction

Entity recognition and extraction task of the pipeline involves the following components: a) Descriptors extraction component b) Upper and Lower body parts detector component c) Sketchness GWAP The aim of the “Entity Recognition and Extraction” task is to analyze each retrieved image to identify the existence or not of people in the image. In the case of existence of people the detector component (Upper and Lower body parts detector) is segmenting the upper and lower body parts of each identified person along with a confidence score. Next, based on this score, a routing decision is taken in order to push the content for processing or discarding. Errore. L'origine riferimento non è stata trovata. presents the flow and the usage of the different components. Descriptors Extraction component The “Descriptors Extraction component” is processing the body parts after the automatic or crowd powered segmentation step. Since the component is targeting the Fashion V-App use case, the algorithms were chosen to answer certain questions about the image content and be of use in the required fashion related trend analysis. The color descriptor algorithm extracts a set of dominant colors of the image content in a sorted order from the most popular color to the least popular one. Following the requirements of the Fashion V-App use case and according to the existing literature, the component extracts only the first 16 most popular colors of each image. The quantization approach follows typical MPEG-7 dominant color descriptors however provides more colors in order to preserve the information and in the same time be robust to illumination variations of the images. The texture descriptor algorithm is based on the extended Local Binary Pattern (LBP) approach. A training process was followed to train nearest neighbor (NN) classifiers for a selected set of fabric patterns in order to boost performance and save processing time that would be necessary in a clustering approach. The patterns specified in the classification are the following: 1. Plaid, Tartan 2. Checked pattern 3. Border tartan 4. Herringbone 5. Houndstooth 6. Striped, pin stripes 7. Floral 8. Paisley 9. Animal print 10. Polka dots CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 27

D5.2 V1.0

11. Argyle 12. Diamond pattern 13. Chevron 14. Toile de Jouy 15. Greek key pattern 16. Camouflage Pattern All extracted information is written to the CUbRIK storage (based on MongoDB system) along with timestamp information and any metadata available from twitter, to be processed and analyzed by the trend analyzer. Upper and Lower body parts detector A component for detecting people along their upper and lower body parts in a JPEG photo. The component is based on research and code of [Poselets: Bourdev, Malik], but was adapted and extended to specifically detect the upper and lower body parts of multiple people within a single JPEG photo and to output the relevant meta results (e.g. boundaries, scores) of the detections in a structured form that can be easily read and further processed by a subsequent component. Poselets are part based detectors working both in the conﬁguration space of keypoints as well as in the appearance space of image patches. The Poselets’ results important for us are certain keypoints (ankles, hips, shoulders) leading to bounding boxes for upper body and lower body, along with their conﬁdence scores. Sketchness GWAP The Sketchness GWAP is used to exploit a crowd of players in order to segment fashion related images that were difficult to process by the other components involved, in particular the upper and lower body part detector. The GWAP can be used to check if a particular fashion item is present or not within an image by asking for a confirmation to the crowd in the form of a tag; the image can also be tagged in the case in which it was not previously annotated. The component will also be used to segment the tagged fashion item within the image by asking to the players to trace the contours of the object.

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 28

D5.2 V1.0

Figure 21 – Sketchness UI The component can be inquired with the use of REST APIs to return: The list of tags that has been defined by the crowd for a particular image The B/W mask computed by aggregating the traces of the players for a particular tag.

2.2.3

Accessibility Annotation Pipeline

The Accessibility annotation pipeline extracts low level features from the input images and calculates accessibility scores from them, by estimating low-level feature (bright-ness, contrast etc.). The accessibility annotation is added to the input record. The input to the accessibility annotation component consists of the following fields: • title • image URL The output of the accessibility annotation component consists of the following fields • image brightness • image color list • image color saturation • image contrast • image dominant color • image dominant color combination • image red percentage • brightness per object • color list per object • saturation per object • contrast per object • dominant color per object • dominant color combination per object • accessibility scores for the various impairments (array of values)

2.2.4

Trend Analyzer

Trend analyzer component is analyzing the results available from the previous components in the workflow to extract meaningful information about the fashion categories that are examined. Map-reduce queries with time and/or space filtering is performed to produce fashion trends for CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 29

D5.2 V1.0

the specified time periods that are requested. The aggregation of information in order to handle the volumes of data to be stored or discarded is also critical in this component, since the data producer (“image extraction from SNs” component) is constantly retrieving and putting in the workflow fresh content to be processed

2.2.5

Trend Analizer SMILA pipeline

This SMILA pipeline is responsible for handling the communication between the trend analyzer component and the Fashion V-App User Interface (the Fashion Portal). The sequence diagram in Figure 22 – Sequence diagram for Trend Analyzer SMILA pipelineshows actions implemented by the SMILA pipeline and Figure 23 - Trend analysis for categorygives a detailed description of Trend Analizer behavioral.

Figure 22 – Sequence diagram for Trend Analyzer SMILA pipeline

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 30

D5.2 V1.0

Figure 23 - Trend analysis for category The communication between the Fashion V-App User Interface and SMILA Pipeline is compliant with JSON standard.

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 31

D5.2 V1.0

2.3

Data Model

The data model the Fashion Trend Analysis pipelines relies on is illustrated in the following Figure 24 - Diagram of the Data Model adopted for the Fashion pipeline.

Figure 24 - Diagram of the Data Model adopted for the Fashion pipeline The model presented in this section has been designed over the models described in document 2.1 to comply specifically with the Fashion pipeline needs. The Fashion pipeline model tries to combine the key entities of The Content Model, Content Description Model and Action Model in order to satisfy the requirement for our scenario in a concise way and avoiding the sophistications of the complete data model which are useless for the purposes of this pipeline. (e.g. in the Content Model we have such objects like AudioVisualObject or VideoObject that are not relevant for our purposes).The data model has to be able to store Images, Image Annotation, Image Segmentation and User Actions. Also the possibility to add Tasks and MicroTasks is required. One of the main requirement for the scenario is the storage of images and image annotations. For storing images we built the ImageObject entity where every single data about image like path, size, width, height, MIME are stored. For storing annotations we built the TextAnnotation entity where every kind of textual labeling/description can be stored. Because we had the requirement to make our application multi-language we designed TextAnnotation entity in such way that it allows to insert the same value for different languages. One of the key aspect of recognizing a garment within an image is the necessity to specify polylines of the traces generated by human contribution or automatic segmentation. For doing CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 32

D5.2 V1.0

this we designed the PolilyneDescription entity where every polyline data are stored. As every image can be annotated more than once with different TextAnnotation or/and PolylineDescriptions we designed a new entity that allows to resolve the n to n problem. We called it ContentDescription and it allows to associate one n images to n text annotations and or polyline descriptions. Every ContentDescription row is unique and keeps track about each associations between an image and text annotation and/or polyline description. All the activities that process the images in our system are stored in a specific entity that we called Action. All Actions should be associated to a Session and a CUbRIK User. Session represent a unique identifier that allows to group series of actions and to identify which CUbRIK users took part to this session. The CUbRIK user is nothing that a mapping between application user id and our internal database list of users.

2.4

User Interface

The pages below depict the interfaces that the SME user exploits to select one of the category listed and to choose a timestamp

Figure 25 - Fashion V-App User Interface

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 33

D5.2 V1.0

Figure 26 - Fashion V-App: the SME users select the option to have a trend insight on a category

Figure 27 - Fashion V-App: the SME users select the option to have a trend insight on a shirt category with a Timespan of 3 months

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 34

D5.2 V1.0

Trend Analysis results corresponding to the query described in previous images are reported as the SME user dashboard. Both colour trend and colour combination trends are available. Moreover the dashboard provides the overview of colour trend in a specific timeline and the trends of more used print & graphics

Figure 28 - Fashion V-App:the report to the SME user through the user interface

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 35

D5.2 V1.0

News Content History

Purpose

Finding news clips that share parts of the same footage

Description

The purpose of the News Content History H-Demo is to find video content that has been reused (and possible cut or modified) by several TV news shows / broadcasters. This is especially interesting for rare, exclusive content provided by news agencies or private persons. The H-Demo application provides a web based interface to query by text or by video & text and displays relationships between news clips that share the same footage / video segments. The H-Demo implements two approaches in order to find these reused video fragments: The first approach employs a dense matching in order to exactly identify video fragments that have been reused thus finding fragment originating from the same camera (record). The second approach uses a more robust descriptor on key frames within previously analysed shots in order to establish relationships between the content.

Query formulation

The query can be formulated as supported text based search on news topics or by uploading a reference video.

Data collection

Basically any available collection of news videos can serve as data collection, and several sets of news clips have been crawled. However, for objective evaluation, data needs to be manually annotated during the development phase for ground truth creation, which turned out to be almost impossible for human annotators if exact clip borders are required. Thus, the decision has been taken to generate synthetic news clips from different recordings, applying different kinds of manipulations. The current data set covers five topics, whereas the number of broadcast stations, amount of reused content and other parameters are configurable. Thus a flexible, fast-tocreate ground truth data set is available.

Result

Depending on the query type, the “News Content History” pipeline visualizes the most relevant relationships within one query topic (text-based query) or with respect to the reference video (content-based query)

Indexing pipeline

The “News Content History” pipeline receives a text query or a video. The pipeline executes the following sequence of tasks: - Feature extraction on the video (video query only) o color and temporal features - Video segment matching (video query only) o Coarse and fine search steps - Data model update (video query only) - Visualization data creation - During result browsing: o Segment validation by crowd o Tag validation and provision by crowd See Figure 29 for details of the tow query use cases.

Query processing pipeline

The results produced by the content processing pipeline are presented to the user by means of a GUI that enables browsing through different views including matrix and chord diagrams, and direct comparison / playback.

Extensions

Human annotation of news clips topic borders.

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 36

D5.2 V1.0

Off-the-shelf components

Humancomputing components

Humans can be involved to accomplish the following tasks: Validate found segment matches Validate topic tags Provide new topic tags Support video content query performance by supporting description Annotate news clip topic border in continuous news shows

3.1

Architecture description

The basic architecture of the “News Content History” pipeline is shown in Figure 29. The System consists of a “fat-client” browser application which communicates with a Java-based server application. The server application controls the SMILA pipelets for feature extraction and segment matching. A data and binary object storage is holds the data model and provides access to media and extraction data.

Figure 29 Architecture of News Content History pipeline (dense segment matching)

3.2

Workflow

The following provides a selection of flows of the h-demo, described as (high-level) activity diagrams.

3.2.1

Query by text

Description: query of related items / overlapping segments via tag / text search

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 37

D5.2 V1.0

Figure 30 NCH Workflow: query by text.

3.2.2

Query by video

Description: query of related items / overlapping segments via content-based video search

Figure 31 NCH Workflow: query by video

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 38

D5.2 V1.0

3.3

Data Model

A simplified data model (here, realized as an ER diagram for direct “translation” from and to a RDBMS used for the demo) is shown in Figure 32. The model is derived from, and is compliant with the metadata model reported in D2.2, however considering several entity name changes and simplifications for the “News Content History” context, including a 1:1 relationship between content objects and content descriptions, which is why both annotations and provenance / provider information have a direct relationship to content objects. For the same reasons, human annotation attributes are included in the annotation entity. In the model, annotations may relate to content objects (e.g. tags to content objects), media segment descriptions (e.g. identification of news clips), segment relationships (e.g. automatic identification of overlapping segments) or other annotations (e.g. validation of tags). The content provenance and content provider entities are relevant in order to track the reuse of news items and their segment overlapping also considering e.g. creation and publication date/time and the content provider.

Figure 32: simplified News Content History data model

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 39

D5.2 V1.0

3.4

User Interface

The â&#x20AC;&#x153;News Content Historyâ&#x20AC;? pipeline starts with a query page in the browser where the user can choose between the two modalities: text-based or video segment similarity search.

Figure 33: News Content History GUI After query submission the system retrieves results, which are displayed as a list.

The user now can choose among three visualization options: CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 40

D5.2 V1.0

The Matrix view, showing all relationships between all relevant content items (i.e. news clips). The view order is freely configurable in terms of number of common segment, air date and other properties. By choosing new sort criteria the view adapts interactively. From the matrix view it is possible to visualize one or more relationship in the detailed chord view (see below). The user selects one item of the matrix (i.e. a relationship between two news clips) and can then enter the compare mode.

Figure 34: NCH matrix view

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 41

D5.2 V1.0

The chord view has two modes. In the overview mode, all relationships and their amount of common segments between relevant content items are shown. This gives an orientation overview which news clips are related and â&#x20AC;&#x153;how muchâ&#x20AC;? they are related. The more segments two clips share with each other the broader is the chord connecting those two segments. By hovering over a connection chord all other chords are faded out for better overview. After selection of two or more items the detailed view (figure below) can be selected showing also the position of the matching segments within the clip (like on a circular time line).

Figure 35: NCH chord overview The detailed view facilitates an in-depth analysis of the shared video content. Each segment that has been matched by the automatic matching and which has been validated by the crowd is displayed here. Not only the amount of common video segments but also the position within the clips is show. Even if there are remaining mismatches, this view gives an good impression of what has been used when in a new clip.

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 42

D5.2 V1.0

Figure 36: NCH chord detail view

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 43

D5.2 V1.0

By selecting two items in the matrix or one of the chord views the user can switch into the compare mode. Annotations such as segment validation and provision of new description tags can be done here. Al segments that have been found by the algorithm are displayed on a linear time line. Navigation by scrolling (i.e. scrubbing) is possible. Each segment can also be navigated directly via buttons. The selected segment is then highlighted and can be validated by the user. Currently the use can give a binary decision about the match by visually checking if the segments start frames do match. He can also validate or provide tags describing the news topic and the air date.

Figure 37 NCH compare and annotation view

CUbRIK R2 Pipelines for Multimodal Content Analysis & Enrichment

Page 44

D5.2 V1.0