Book by YIRUI QIAN

INTRODUCTION

D ATA B A S E DESCRIPTION

P OT E N T I A L U S E S O F T H E D ATA B A S E

P R O P O S E D E VA L U A T I O N P R OTO C O L

JSON: [

BASELINE PCA

{

PERFORMANCE

“FACEID”:

SAN FRANCISCO MOSCONE CENTER WEST

“48CD-

E VA L U A T I O N S

F8C8-841C-4D33-B875-

M AY 2 7 _ 3 0 2 0 1 7

1710A3FC6542”, “FACERECTANGLE”: { “WIDTH”: “HEIGHT”: “LEFT”: “TOP”:

CONCLUSION

228, 228,

460, 125

}, “FACELANDMARKS”: { “PUPILLEFT”: { “X”:

507,

“Y”:

204.9

}, “PUPILRIGHT”: { “X”:

609.8,

“Y”:

175.4

}, “NOSETIP”: { “X”:

596.4,

“Y”:

250.9

}, “MOUTHLEFT”: { “X”:

531.4,

“Y”:

301

}, “MOUTHRIGHT”: { “X”:

629,

“Y”:

273.7

SNR FICIAL IDENTITY

“EYEBROWLEFTOUTER”: { “X”:

463.2,

A FORTHCOMING TECHNOLOGY IS

“Y”:

201.2

P L A N N E D TO H A P P E N S O O N

}, “EYEBROWLEFTINNER”: { “X”:

547.4,

“Y”:

177.1

}, “EYELEFTOUTER”: { “X”:

496.8,

“Y”:

212

ABSTRACT INTRODUCTION DATABASE

DESCRIPTION

2 . 1 Image acquisition: equipment setup and imaging procedure 2 . 2 Naming conventions and final database forming 2 . 3 Image sets details 2 . 3 . 1 Frontal facial mug shots 2 . 3 . 2 Surveillance cameras images (visible light) 2 . 3 . 3 IR night vision mug shots 2 . 3 . 4 Surveillance cameras images (IR night vision) 2 . 3 . 5 Different pose images 2 . 4 Database demographics

Facial Recognition

POTENTIAL PROPOSED

USES OF THE DATABASE EVALUATION PROTOCOL

4 . 1 DayTime tests 4 . 2 NightTime tests 4 . 3 Performance metrics 4 . 4 Training scenario suggestion BASELINE

PCA

PERFORMANCE EVALUATIONS

5 . 1 Image preprocessing 5 . 2 Experimental design 5 . 3 Performance results 5 . 3 . 1 DayTime experiments 5 . 3 . 2 NightTime experiments 5 . 4 Discussion CONCLUSION APPENDIX COLOPHONE

CAMERAS SET UP //

CAMERA

Our database was designed mainly as a means of testing face recognition algorithms in real world conditions. In such a setup, one can easily imagine a scenario where an individual should be recognized comparing one frontal mug shot image to a low quality video surveillance still image. In order to achieve a realistic setup we decided to use commercially available surveillance cameras of varying quality (see Appendix for details). Since two of our surveillance cameras record both visible spectrum and IR night vision images, we decided to include IR imagery in the database as well. Capturing face images took place in Video Communications Laboratory at the Fac-

THE HIGH - QUALITY PHOTO CAMERA FOR

ulty of Electrical Engineering and Computing, University of Zagreb, Croatia. Capture

CAPTURING VISIBLE LIGHT MUG SHOTS

equipment included: six surveillance cameras, professional digital video surveillance

WAS INSTALLED THE SAME WAY AS THE

recorder, professional high-quality photo camera and a computer. For mug shot

INFRARED CAMERA BUT IN A SEPARATE

image acquisition we used a high-quality photo camera. For surveillance camera image acquisition we used five different (commercially available) models of surveillance cameras (see Appendix for details) and for IR mug shots a separate surveillance camera was used. Five surveillance cameras were installed in one room at

ROOM WITH THE STANDARD INDOOR LIGHTING AND IT WAS EQUIPPED WITH ADEQUATE FLASH.

the height of 2.25 m and positioned as illustrated in Figs. 1 and 2. The only source Two out of five surveillance cameras were able to record in IR night vision mode as well. The sixth camera was installed in a separate, darkened room for capturing IR mug shots. It was installed at a fixed position and focused on a chair on which participants were sitting during IR mug shot capturing. The IR part of the database is important since there are research efforts in this direction, e.g. and, yet no available database provides such images. Mug shot imaging conditions are exactly the same as would be expected for any law enforcement or national security use (passport images or any other personal identification document). All six cameras (five surveillance and one IR mug shot) were connected to a professional digital video surveillance recorder, which was recording all six video streams simultaneously all the time on internal hard disk. For storage of images and for controlling surveillance cameras we used Digital Sprite 2 recorder with adequate software (DM Multi Site Viewer) provided from manufacturer for connecting and controlling over personal computer. Settings recommended by the vendor were used for capturing images on Digital Sprite 2, because those settings represent most commonly used parameters set up in real life surveillance systems (see Appendix for details). The surveillance cameras are named cam1, cam2, cam3, cam4 and cam5. Cam1 and cam5 are also able to work in IR night vision mode. We decided to name the images taken by them in the IR night vision mode as cam6 (actually cam1 in night vision mode) and cam7 (actually cam5 in night vision mode). Camera for taking IR mug shots was named cam8. All cameras (surveillance and photo) were installed and fixed to same positions and were not moved during the whole capturing process.

2.25M

of illumination was the outdoor light, which came through a window on one side.

In this paper we describe a database of static images of human faces. Images were taken in uncontrolled indoor environment using five video surveillance cam-

ABSTRACT

eras of various qualities. Database contains 4,160 static images (in visible and infrared spectrum) of 130 subjects. Images from different quality cameras should mimic real-world conditions and enable robust face recognition algorithms testing, emphasizing different law enforcement and surveillance use case scenarios. In addition to database description, this paper also elaborates on possible uses of the database and proposes a testing protocol. A baseline Principal Component Analysis (PCA) face recognition algorithm was tested following the proposed protocol. Other researchers can use these test results as a control algorithm performance score when testing

KEYWORD

their own algorithms on this dataset. Database

V I D EO SURVEILLANCE CAMERAS.

is available to research community through the

F A C EDATABASE.

procedure described at www.scface.org.

F A C ERECOGNITION.

1. DIFFERENT QUALITY AND RESOLUTION CAMERAS WERE USED;

The problem specifically addressed with our da-

2. IMAGES WERE TAKEN UNDER UNCONTROLLED

tabase is the law enforcement person identifi-

ILLUMINATION CONDITIONS; 3 . IMAGES WERE TAKEN FROM VARIOUS DISTANCES;

Interest in face recognition, as a combination of pattern recognition and image analysis is still growing. Many papers are written and many real-world systems are being developed and distributed. As a non-invasive biometric method, face recognition is attractive for national security purposes as well as for smaller scale surveillance systems. However, in order to be able to claim that any face recognition system is efficient, robust and reliable, it must undergo rigorous testing and verification, preferably on real-world datasets. In spite of many face databases currently available to researchers, we feel they lack the real-world settings part. Images in other currently available databases are usually taken by the same camera e.g. Video database of moving faces and people, CMU PIE, FERET and are not taken using proper, commercially available, surveillance equipment (e.g. some multimodal databases that include sets for voice recognition—BANCA, XM2VTSDB). Images in those databases are mainly taken under strictly controlled conditions of lighting, pose, etc., and are of high resolution (high qual-

INTRODUCTION

4. HEAD POSE IN SURVEILLANCE IMAGES IS TYPICAL FOR A COMMERCIAL SURVEILLANCE SYSTEM, I.E.THE CAMERA IS PLACED SLIGHTLY ABOVE THE SUBJECT’S HEAD, MAKING TH E RECOGNITION EVEN MORE DEMANDING; BESIDES, DURING TH E SURVEILLANCE CAMERA RECORDINGS THE INDIVIDUALS WERE NOT LOOKING TO A FIXED POINT; 5. DATABASE CONTAINS NINE DIFFERENT POSES IMAGES SUITA B L E FOR HEAD POSE MODELING AND/OR ESTIMATION; 6. DATABASE CONTAINS IMAGES OF 130 SUBJECTS, ENOUGH TO ELIMINATE PERFORMANCE RESULTS OBTAINED BY PURE COINCIDENCE ( CHANCES OF RECOGNITION BY PURE COINCIDENCE IS LES S THAN 1/130 ≈ 0.8% ) ; 7. B OTH IDENTIFICATION AND VERIFICATION SCENARIOS ARE POSSIBLE, BUT THE MAIN IDEA IS FOR IT TO BE USED IN DIFFICULT REAL - WORLD IDENTIFICATION EXPERIMENTS.

ity capturing equipment is used). Although some

our database in experiments will show that real-world indoor face recognition by using standard video surveillance equipment is far from being a solved problem.

formance of both humans and computers. Since up to now there were no publicly available CCTV face

images

databases,

researchers

were

forced to perform the experiments using their images they captured themselves, which in turn often results in having a low number of subjects and results which have questionable statistical significance. We feel that our SCface database will successfully fill this gap. For this purpose, we collected images from 130 subjects. The database consists of 4,160 images in total, taken in uncontrolled lighting (infrared night vision images taken in dark) with five different quality surveillance video cameras.

distinct distances from the cameras with the out-

Head pose should be as natural as possible. All this mimics a really complex imag-

base mimics the real-world conditions as close as possible and we think that using

(Closed Circuit Television) images, exploring per-

Appendix). Subjects’ images were taken at three

uncontrolled (i.e. coming from a natural source) with varying direction and strength.

solved problem. This was the main motivation for collecting our database. Our data-

forming more and more experiments using CCTV

IR images; for cameras details please see the

from video cameras (of various qualities) on a public place. Illumination should be

resulting in very high recognition rates suggesting that face recognition is almost a

how important this topic is, researchers are per-

output as well (we will address these images as

real-world applications, algorithms should operate and be robust on data captured

reason for a low number of studies on face recognition in such naturalistic conditions,

ject’s cooperation is not expected. Recognizing

(IR) night vision mode so we recorded that IR

assure reproducibility of results) all this is still far from real-world conditions. For

any other national security system. The obvious lack of such image sets is the main

any other identification scenario where the sub-

Two cameras were able to operate in infrared

of the most frequently used databases have a standardized protocol (in order to

ing environment, such as is expected when building an airport security system or

cation from low quality surveillance images or

door (sun) light as the only source of illumination. All images were collected over a 5 days period. This database was collected mainly having identification rather then verification in mind, with the additional motive being the fact that identification is a more demanding recognition problem with its one-to-many comparisons. The proposed protocol and initial testing with PCA will also be along these lines of thought. Reported results will accordingly show rank 1 recognition percentages, but Cumulative Match Score (CMS) curves will be omitted, since such a detailed analysis is beyond the scope of this paper. Here is a short summary of what makes this dataset interesting to face recognition research community. The rest of this paper is organized as follows: Section 2 gives a detailed database description including the acquisition, naming conventions, database forming stage, image sets details and database demographics; Section 3 discusses potential uses of the database; Section 4 proposes evaluation protocols and results presentation; Section 5 describes an ad hoc baseline PCA test on this database in extremely difficult experimental setup where a separate database is used for training; Section 6 concludes the paper.

SURVEILLANCE CAMERAS S P E C I F I C AT I O N S CAM1 BOSCH

LTC0495/51

CAM2 SHANY

WTC-8342

The capturing was conducted over a 5 days period. All participants in this project have passed through the following procedure. First they had to walk in front of the surveillance cameras in the dark and after that they had to do the same in uncontrolled indoor lighting. During their walk in front of the cameras they had to stop at three previously marked

CAM3 J&S JCC-915D

positions. This way 21 images per subject were taken (cam1–7 at distances of 4.20, 2.60 and 1.00 m). After that, participants were photographed with digital photographer’s camera at close range in controlled conditions (standard indoor lighting, adequate use of flash to avoid shades, high resolution of images). This set of images provides nine discrete views of each face, ranging from left to right profile in equal steps of 22.5 degrees. To assure comparable

CCD TYPE

views for each subject, numbered markers were used as fixation points. As a final result, there are nine images per

ACTIVE PIXELS

subject with views from 90 to +90 degrees including another mug shot at 0 degrees.

RESOLUTION

In the end, subjects went into the dark room where high quality IR night vision surveillance camera was installed for

MINIMUM ILLUMI

capturing IR mug shots at close range. In overall that gives 32 images per subject in the database. Facial mug shots

N AT I O N

are high quality static color images, taken in controlled indoor illumination conditions environment using Canon EOS

SNR

10D digital camera. There is one mug shot per subject and those images are labeled with the suffix frontal (e.g.

VIDEO OUTPUT

001_frontal.jpg). Images are in lossless 24 bit color JPEG format with the original size of 3,0722,048 pixels, cropped

COMMENT

to 1,600x1,200 pixels. Cropping was done following the ANSI 385-2004 standard recommendation so that the face PHOTOGRAPHER’S CAMERA WAS CANO N CAM1

EOS 10D MODEL WITH 22.7×15.1 MM CMOS SENSOR, W I T H

6.3 MEGA

occupies approximately 80% of the image. Exact positions of eye centers, tip of the nose and the center of the mouth are manually extracted and stored in a textual file (this was also done for all

SIGMA

images in the database except for the different

752×582

18 – 50 MM F3.5 – 5.6 DC LENSES AN D

pose set). These mug shot images are what you

540 TVL

SIGMA EF 500 DG SUPER FLASH.

would expect to find in a law enforcement data-

1/3

PIXELS,

0,24/0,038 LUX (IR) >

50 DB

VPP, 75 Ω IR

EQUIPPED

WITH

base or when registering to a security system. There are in total 130 frontal facial mug shot

images in the database, one per subject. Cameras 1–5 are visible light cameras and all images with camNum labels cam1–5 represent images taken with surveillance cameras of different qualities.

NIGHT VISION

There are three images per subject for each camera, taken at three discrete distances (4.20, 2.60 and 1.00 m). This gives total of 15 images per subject in this set (1950 in total). This set was

CAM2

designed to test the face recognition algorithms in real-world surveillance setup (we again emphasize that we used commercially

1/3

available cameras of both low and high quality). As can be seen

SONY

from the images differ substantially in quality and resolution. The

SUPER HAD

original images saved from Digital Sprite 2 recorder to PC hard

795×596 480

drive all have the same dimensions (680x556 pixels, 96 dpi and

TVL 0,15 LUX >

50 DB

VPP, 75 Ω

24 bit color). They have been cropped in order to remove as much background as possible. Due to different distances at which the images were taken, new (cropped) images are not all the same size. Cropped images have the following resolutions: 100x75 pixels for distance 1, 144108 for distance 2 and 224x168 for distance

CAM3

3. All the coordinates mentioned previously were labeled and stored in a textual database for this set of images as well. IR night vision

1/3

mug shots were taken in a separate dark room with a resolution of

COLOR

426x320 pixels, grayscale. There is one image per subject in this set,

597×537

yielding a total of 130 of those images in the database.

350 TVL 0,3 LUX >

48 DB

VPP, 75 Ω

DOME CAMERA

EXAMPLE OF ONE IMAGE SET FOR ONE SUBJECT

JSON: [ {

TA B L E 1 R A N K 1 P E R F O R M A N C E

“FACEID”: “48CDF8C8-841C-4D33-B8751710A3FC6542”, {

“FACERECTANGLE”: “WIDTH”:

228,

“HEIGHT”: “LEFT”: “TOP”:

RESULTS

FOR

DAYTIME

IMENTS.

GALLERY:

EXPER-

MUG

SHOT

}, {

204.9

As a non-invasive biometric method, face recognition is attractive

“Y”:

175.4

596.4,

“Y”:

250.9

531.4,

“Y”:

301

DATABASE

RECOGNITION ABRM

“X”:

629,

“Y”:

273.7

{

R A N K 1 R E C O G N I T I O N R AT E %

D I S TA N C E

{

GALLERY:

F R O N TA L

CAM8

researchers, we feel they lack the real-world settings part. Images in other currently available databases are usually taken by the same camera e.g. Video database of moving faces and people,

this is still far from real-world conditions. For real-world applications, algorithms should oper-

strength. Head pose should be as natural as possible. All this mimics a really complex imaging an airport security system or any other na-

{

201.2

CAM6_2

3.1

10.0

CAM6_3

3.9

10.0

CAM7_1

0.7

1.5

CAM7_1

5.4

2.3

CAM7_3

4.6

7.0

}, “EYEBROWLEFTINNER”:

}, “EYELEFTOUTER”:

from a natural source) with varying direction and

ing environment, such as is expected when build-

“Y”:

212

datasets. In spite of many face databases currently available to

Illumination should be uncontrolled (i.e. coming

1.5

496.8,

THE MINIMIZATION

undergo rigorous testing and verification, preferably on real-world

cameras (of various qualities) on a public place.

1.5

“Y”:

DURING

face recognition system is efficient, robust and reliable, it must

ate and be robust on data captured from video

CAM6_1

“X”:

PROCESS.

CHANGES

THE

(in order to assure reproducibility of results) all

463.2,

177.1

MATCHING

LOCAL

REDUCES

veillance systems. However, in order to be able to claim that any

most frequently used databases have a standardized protocol

“X”:

547.4,

SUCH

WHICH

UTILIZES

for national security purposes as well as for smaller scale sur-

(high quality capturing equipment is used). Although some of the

“EYEBROWLEFTOUTER”:

“Y”:

ALGORITHM,

NEC’S

trolled conditions of lighting, pose, etc., and are of high resolution

“X”:

TECHNOLOGY

THE

Images in those databases are mainly taken under strictly con-

DESCRIPTION

}, “MOUTHRIGHT”:

CAUSED

es that include sets for voice recognition—BANCA, XM2VTSDB).

D ATA B A S E D E M O G R A P H I C S

CAMERA / “X”:

CHANGES

available, surveillance equipment (e.g. some multimodal databas-

{

}, “MOUTHLEFT”:

BLINKING

CMU PIE, FERET and are not taken using proper, commercially

{

“X”:

AND

OF THE LOCAL CHANGES GUARANTEES THE

}, “NOSETIP”:

SMILING

OVERALL FACE RECOGNITION ACCURACY.

“PUPILRIGHT”: 609.8,

Interest in face recognition, as a combination of pattern recogniand many real-world systems are being developed and distributed.

THE

}, “X”:

LOCAL

INTENTIONAL

IMPACT

“Y”:

ADVERSE

tion and image analysis is still growing. Many papers are written

THE

125

507,

EYES,

FACE

460,

“X”:

IMPACT

WEARING OF CAPS, HATS AND GLASSES),

METRIC: COSINE ANGLE

“FACELANDMARKS”: “PUPILLEFT”: {

THE

CAUSED AND

FRONTAL IMAGES. DISTANCE

228,

REDUCE

CHANGES (E.G., VARYING FACIAL EXPRESSION

{

tional security system. The obvious lack of such image sets is the main reason for a low number of studies on face recognition in such naturalistic conditions, resulting in very high recognition rates suggesting that face recognition is almost a solved problem. This was the main motivation for collecting our database. Our database mimics the real-world conditions as close as possible and we think that using our database in experiments will show that real-world indoor face recognition by using standard video surveillance equipment is far from being a solved problem.

D ATA B A S E S U B J E C T S ’ A G E 3 5 DISTRIBUTION FROM

TOTAL

TEERS, 16

114

130

WERE

VOLUN-

MALES

AND

FEMALES.

DAYTIME

This should be a straightforward test comparing the mug shot image (visible light, i.e. subjectID_

TESTS

frontal.jpg images) to cam1–5 images. Frontal mug shots represent the gallery of known images and cam1–5 images and three distances are probe sets (the main idea is similar to FRVT tests as in). This scenario gives 15 possible different probe sets, varying both in distances from camera and in camera qualities. Comparing the probe image to one gallery image is the most logical rea world (law enforcement) scenario.

NIGHTTIME

NightTime tests have two possible galleries that can be used. Cam6-7 images (IR night vision) can

TESTS

be compared to mug shot images (visible light, i.e. subjectID_frontal.jpg images) or they can be compared to cam8 (IR night vision mug shot) images. This scenario gives two galleries and six possible

different probe sets, again varying both in distances from camera and in camera qualities.

PERFORMANCE

METRICS

Rank 1 results and Cumulative Match Score (CMS) curves are the usual way to report results in identification scenario (for details see [16]), while ROI plots are the usual way of reporting results in verification scenario. Besides, we suggest the use of some form of statistical significance test like McNemar’s hypothesis test. Error bars and standard deviations of performance results are also something that would help establish the true robustness and performance improvement of any

algorithm. Control algorithm performance scores must accompany performance results of novel algorithms. This control algorithm should be something that is easily implemented and readily available to all, like PCA or correlation. An initial baseline PCA tests

will be provided here, but the researchers are encouraged to expand and improve them to make solid grounds for control algorithms comparisons in the future.

NUMBER OF SUBJECTS

TRAINING

SCENARIO

SUGGESTION

In order for any tests to be close to expected real-world (law enforcement) applications and for the results to be meaningful, we strongly suggest the use of a separate set of face imag-

es for training, especially in the identification scenario. In other words, any experiment should be designed so that images used in algorithm training stage have no overlap with the query images (gallery and probe). We believe that this is the only way to make

fair comparisons and to present fair performance results. Other databases (like FERET or CMU PIE) can be used for training, similar to the setup from. All algorithms will probably experience a serious performance drop with this experimental setup, but it will

actually show their true strength. 20

35 AGE

DATABASE DESCRIPTION

EXAMPLE OF DIFFERENT POSE IMAGES

GENERAL

INFO

Over the last ten years or so, face recognition has become a popular area of research in computer vision and one of the most successful ap-

The problem specifically addressed with our

plications of image analysis and understanding.

database is the law enforcement person identifi-

Because of the nature of the problem, not only

cation from low quality surveillance images or any

computer science researchers are interested in

other identification scenario where the subjectâ&#x20AC;&#x2122;s

it, but neuroscientists and psychologists also.

cooperation is not expected. Recognizing how

It is the general opinion that advances in com-

important this topic is, researchers are perform-

puter vision research will provide useful insights

ing more and more experiments using CCTV

to neuroscientists and psychologists into how

(Closed Circuit Television) images, exploring

human brain works, and vice versa. A general

performance of both humans and computers.

statement of the face recognition problem (in

Since up to now there were no publicly availa-

computer vision) can be formulated as follows:

ble CCTV face images databases, researchers

Given still or video images of a scene, identify

were forced to perform the experiments using

or verify one or more persons in the scene using

their images they captured themselves, which

a stored database of faces.

in turn often results in having a low number of subjects and results which have questionable

WHAT

statistical significance. We feel that our SCface

FACIAL

RECOGNITION?

Like all biometrics solutions, face recognition technology measures and matches the unique

database will successfully fill this gap. For this purpose, we collected images from

characteristics for the purposes of identification

130 subjects. The database consists of 4,160 images in total, taken in uncontrolled

or authentication. Often leveraging a digital or

lighting (infrared night vision images taken in dark) with five different quality sur-

connected camera, facial recognition software

veillance video cameras. Two cameras were able to operate in infrared (IR) night

can detect faces in images, quantify their fea-

vision mode so we recorded that IR output as well (we will address these images

tures, and then match them against stored tem-

as IR images; for cameras details please see the Appendix). Subjectsâ&#x20AC;&#x2122; images were

plates in a database. Face scanning biometric

taken at three distinct distances from the cameras with the outdoor (sun) light as

tech is incredibly versatile and this is reflected in

the only source of illumination. All images were collected over a 5 days period. This

its wide range of potential applications.

database was collected mainly having identification rather then verification in mind, with the additional motive being the fact that identification is a more demanding recognition problem with its one-to-many comparisons. The proposed protocol and

WHERE

FIND

FACIAL

RECOGNITION??

Like all biometrics solutions, face recognition technology measures and matches the unique

initial testing with PCA will also be along these lines of thought. Reported results

characteristics for the purposes of identification

will accordingly show rank 1 recognition percentages, but Cumulative Match Score

or authentication. Often leveraging a digital or

(CMS) curves will be omitted, since such a detailed analysis is beyond the scope of

connected camera, facial recognition software

this paper. Here is a short summary of what makes this dataset interesting to face

can detect faces in images, quantify their fea-

recognition research community.

tures, and then match them against stored templates in a database. Face scanning biometric tech is incredibly versatile and this is reflected in its wide range of potential applications.

IMAGE SETS DETAILS// 1.SCALE, EXPRESS I O N A N D P O S E I N V A R I A N T FACE DETECTION Facial mug shots are high quality static color images, taken in controlled indoor illumination conditions environment using Canon EOS 10D digital camera. There is

2.THE STUDY OF A U T H E N T I C A T I O N A N D R E C O G N I FRONTAL

FACIAL

MUG

SHOTS

TION FOR FACES C A P T U R E D U N D E R C O N T R O L L E D AND UNCONTROLL E D C O N D I T I O N

one mug shot per subject and those images are labeled with the suffix frontal (e.g.

3.ANALYZING IMAG E S C A P T U R E D I N N I G H T M O D E .

001_frontal.jpg). Images are in lossless 24 bit color JPEG format with the original size of 3,072x2,048 pixels, cropped to 1,600x1,200 pixels. Cropping was done fol-

4.OCCLUSION ANAL Y S I S

lowing the ANSI 385-2004 standard recommendation so that the face occupies ap-

5.GAIT ANALYSIS

proximately 80% of the image. Exact positions of eye centers, tip of the nose and the center of the mouth are manually extracted and stored in a textual file (this was also done for all images in the database except for the different pose set). These mug shot images are what you would expect to find in a law enforcement database or when registering to a security system. There are in total 130 frontal facial mug

We have collected face database with variation in pose, scale,

shot images in the database, one per subject.

rotation, expression, illumination, occlusion, which are captured

POSSIBLE DATA USAGE AND AVAILABILITY

under both controlled and uncontrolled conditions in order to fulfil Cameras 1â&#x20AC;&#x201C;5 are visible light cameras and all images with camNum labels cam1â&#x20AC;&#x201C;5 represent images taken with surveillance cameras of different qualities. There are three images per subject for each camera, taken at three discrete distances (4.20, 2.60 and 1.00 m). This gives total of 15 images per subject in this set (1950 in total).

SURVEILLANCE CAMERAS IMAGES SURVEILLANCE CAMERAS IMAGES (VISIBLE LIGHT) (VISIBLE LIGHT)

the requirement of having a challenging and complete database for the task of detecting, authenticating and recognizing faces captured by surveillance camera. Providing such variations in our database, we suggest that this database can be used as follow.

This set was designed to test the face recognition algorithms in real-world surveillance setup (we again emphasize that we used commercially available cameras of both low and high quality). As can be seen from the Fig. 5, the images differ substantially in quality and resolution. The original images saved from Digital Sprite 2 recorder to PC hard drive all have the same dimensions (680x556 pixels, 96 dpi and 24 bit color). They have been cropped in order to remove as much background as possible. Due to different distances at which the images were taken, new (cropped) images are not all the same size. Cropped images have the following resolutions: 100x75 pixels for distance 1, 144x108 for distance 2 and 224x168 for distance 3. All the coordinates mentioned previously were labeled and stored in a textual database for this set of images as well.

In this paper, we have described and proposed a large scale, more challenging and large variability surveillance camera face database that consists of 15,000 and 5,000 images captured under controlled and uncontrolled conditions, respectively and 200 videos. The main objective of producing this database is to tackle the limitations of the available databases that are used in the task of detecting, authenticating and recognizing faces captured by surveillance camera. In order to make our database more challenging, we have maintained the same trends for face detection and recognition, which have been introduced in the available database such as variation in pose, expression and illumination. In future, we will increase the total number of images and length of the recorded videos.

CONCLUSIONS

POTENTIAL USES OF THE DATABASE

SURVEILLANCE IMAGES

(IR

CAMERAS

NIGHT

VISION)

IR night vision surveillance cameras images are completely the same as visible light surveillance images in section 2.3.2 regarding the resolution but are in grayscale. IR surveillance images are labeled as cam6 and cam7, again at three discrete distances.

The first and the most important potential use of this database is

There are six IR night vision surveillance images per subject, thus

to test the face recognition algorithm’s robustness in a real world

780 of them in total.

surveillance scenario. In such a setup a face recognition system should recognize a person by comparing an image captured by a surveillance camera to the image stored in a database. If we postulate a law enforcement use as the most potential scenario, this

DIFFERENT

POSE

IMAGES

database image, to which the surveillance image is compared to, is a

This set of images was taken with the same high quality photo

high quality full frontal facial mug shot. We would specifically like to

camera as frontal facial mug shots and under the same condi-

encourage researchers to explore more deeply the small sample size

tions. Subjects’ pose ranges from 90 (left profile) to +90 degrees

problem (more dimensions than examples). Having one frontal mug

(right profile) in nine discrete steps of 22.5 degrees. Image size

shot per subject in our database addresses this issue adequately.

is 3,072x2,048 pixels and frontal mug shot images of the same

Our surveillance cameras were of various quality and resolution

size are included for the sake of completeness. There are nine

(see Appendix) and this issue is the strong point of this database.

images per subject in this set, which gives 1,170 images in total.

It remains to be seen how will face recognition algorithms perform

The correlation between the angle label in the file name and the

in such difficult conditions and how does the quality of capturing

actual head pose is given in the Appendix.

equipment and subject’s distance from camera influence the results. There is also a potential to test various image preprocessing algorithms (enhancing image quality by filtering techniques), as some of these surveillance images are of extremely low quality and resolution. By including different pose images of subjects, we made it possible to use this database in

DATABASE

DEMOGRAPHICS

The participants in this project were students, professors or employees at the Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia. From total of 130 volunteers, 114 were males and 16 females. All participants were Caucasians, between the ages of 20 and 75. The age distribution is displayed, the textual file containing each subjects date and year of birth will be distributed with the database. This also makes our SCface database unique, because subjects’ birthdays are not available in any other database. Moreover, the textual file will contain the following additional information: gender, beard, moustache, glasses.

face modeling and 3D face recognition. Other potential uses of this database include but are not restricted to: evaluation of head pose estimation algorithms, evaluation of face recognition algorithms’ robustness to different poses, evaluation of natural illumination normalization algorithms, indoor face recognition (in uncontrolled environment), low resolution images influence, etc.

P E R F O R M A N C E R E S U LT S F O R D AY T I M E E X P E R I M E N T S . GALLERY MUG

SHOT

IMAGES. COSINE

FRONTAL DISTANCE

METRIC:

ANGLE

We normalized all images from our SCface database following the standard procedure often described in many face recognition papers. Firstly, all color images were

APPENDIX

IMAGE PREPROCESSING

turned to grayscale. By using eye coordinates information, all images are scaled and rotated so that the distance between the eyes is always the defined number of pixels (32 in our case) and the eyes lie on a straight line (head tilt is corrected). Then the image is cropped to 64x64 pixels and masked with an elliptical mask. Two

CAMERA/

R E C O G N I T I O N R AT E [ % ]

D I S TA N C E

examples of preprocessed images can be seen in Fig. 8. The eyes are at positions left eye, and right eye. Standard histogram equalization (Matlab) is done prior to masking, normalizing the pixel values to [0, 255].

CAM1_1

2.3

The experiment performed is set up in a most difficult manner imaginable, but

CAM1_2

7.7

it mimics what can be expected in real world face recognition applications: the

CAM1_3

5.4

training set is different than the test set. In other words, training and test sets

CAM2_1

3.1

completely independent (images used in training will not be used in gallery or probe

CAM2_2

7.7

CAM2_3

3.9

CAM3_1

1.5

CAM3_2

3.9

CAM3_3

7.7

We will preprocess 501 randomly chosen FERET database images in the same

CAM4_1

0.7

manner as we preprocessed our database images and we will use those images to

CAM4_2

3.9

CAM4_3

8.5

CAM5_1

1.5

CAM5_2

7.7

CAM5_3

5.4

sets). We will use a totally different database for training to insure just that. This is why the recognition rates achieved will seem so low. This kind of setup, although often discussed, is rarely used in research papers, but we feel it is time to push this research area into a new level.

train baseline PCA. PCA algorithm details are well known and are beyond the scope of this paper. They will not be discussed here and an interested reader is referred to. Gallery and probe images will be from our SCface database but with no overlap between the two sets (no images from the gallery are in the probe set and vice versa). By training the PCA with one image per person (the frontal mug shot) we will also address the small sample size problem (more dimensions than examples). After the training, we kept 40% of the eigenvectors with highest eigenvalues, thus forming 200 dimensional face (sub)space in which the recognition will be done. Using the fact that once the PCA is trained the distance between any pair of projected images is constant, we were able to perform virtual experiments. We projected all images from our database onto the derived 200 dimensional subspace and calculated distance matrix (distance matrix being the size n x n, where n is the number of images in our database, and containing the distances between all images). All experiments were then performed on those matrices by changing the list items in probe and gallery sets and without running the algorithms again. Distance metric used was the cosine angle (COS).

EXPERIMENTAL DESIGN

INDEPENDENT COMPARATIVE STUDY OF PCA, ICA, AND LDA ON THE FERET DATA SET

Face recognition is one of the most successful applications of image analysis and understanding and has gained much attention in recent years. Various algorithms

ABSTRACT

were proposed and research groups across the world reported different and often contradictory results when comparing them. The aim of this paper is to present an independent, comparative study of three most popular appearance-based face recognition projection methods (PCA, ICA, and LDA) in completely equal working conditions regarding preprocessing and algorithm implementation. We are motivated by the lack of direct and detailed independent comparisons of all possible algorithm implementations (e.g., all projectionâ&#x20AC;&#x201C;metric combinations) in available literature. For consistency with other studies, FERET data set is used with its standard tests (gallery and probe sets). Our results show that no particular projection metric combination is the best across all standard FERET tests and the choice of appropriate projection metric combination can only be made for a specific task. Our results are compared to other available studies and some discrepancies are pointed out. As an additional contribution, we also introduce our new idea of hypothesis testing across all ranks when comparing performance results. VC 2006 Wiley Periodicals, Inc. Int J Imaging Syst Technol, 15, 252â&#x20AC;&#x201C;260, 2005; Published online in Wiley InterScience.

KEYWORD F A C E RECOGNITION; PCA; ICA;

1. APPEARANCE - BASED, WHICH USES HOLISTIC TEXTURE

Among many approaches to the problem of face recognition, appearance-based

LDA;

FEATURES AND IS APPLIED TO EITHER WHOLE - FACE

subspace analysis, although one of the oldest, still gives the most promising results.

F E R E T;

OR SPECIFIC REGIONS IN A FACE IMAGE;

Subspace analysis is done by projecting an image into a lower dimensional space

S U B S PACE ANALYSIS METHODS

2. FEATURE - BASED, WHICH USES GEOMETRIC FACIAL

(subspace) and after that recognition is performed by measuring the distances

FEATURES ( MOUTH, EYES, BROWS, CHEEKS ETC. ) AND GEOMETRIC RELATIONSHIPS BETWEEN THEM.

Over the last ten years or so, face recognition has become a popular area of research in computer vision and one of the most successful applications of image analysis and understanding. Because of the nature of the problem, not only computer science researchers are interested in it, but also neuroscientists and psychologists. It is the general opinion that advances in computer vision research will provide useful insights to neuroscientists and psychologists into how human brain works, and vice versa. A general statement of the face recognition problem can be formulated as follows (Zhao et al., 2003): Given still or video images of a scene, identify or verify one or more persons in the scene using a stored database of faces. A sur vey of face recognition techniques has been given by Zhao et al., (2003). In general, face recognition techniques can be divided into two groups based on the face representation they use:

INTRODUCTION

between known images and the image to be recognized. The most challenging part of such a system is finding an adequate subspace. In this paper, three most popular appearance-based subspace projection methods for face recognition will be presented, and they will be combined with four common distance metrics. Projection methods to be presented are: Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Linear Discriminant Analysis (LDA). PCA (Turk and Pentland, 1991) finds a set of the most representative projection vectors such that the projected samples retain most information about original samples. ICA (Bartlett et al., 2002; Draper et al., 2003) captures both second and higherorder statistics and projects the input data onto the basis vectors that are as statistically independent as possible. LDA (Belhumeur et al., 1996; Zhao et al., 1998) uses the class information and finds a set of vectors that maximize the between-class scatter while minimizing the within-class scatter. Distance metrics used are L1 (City block), L2 (Euclidean), cosine and Mahalanobis distance.

S U R V E I L L A N C E C A M E R A S S P E C I F I C AT I O N S DATABASE

DESCRIPTION

TYPE

CCD TYPE

CAM2

CAM3

CAM4

CAM5

BOSCH

SHANY

J&S J

ALARMCOM

SHANY

LT C 0 4 9 5 / 5 1

WTC-8342

CC-915D

VFD400-12B

MTC-L1438

1/3’’ IT

1/3’’ SONY

1/3’’ COLOR

1/3’’ IT

1/3’’ SONY SUPER

SUPER HAD

597X537

CAM1

ACTIVE PIXELS

HAD 795X596 480

RESOLUTION

752X582

795X596

350 TVL

752X582

TVL

MINIMUM ILLUMINA-

540 TVL 0,24/0,038

480 TVL

0,3 LUX

460 TVL

0 LUX (IR LED ON

TION

LUX (IR)

0,15 LUX

> 48 DB

1,5 LUX

AT 4 L U X )

46 DB

> 50 DB

SNR

1 VPP, 75 Ω

1 VPP,

VIDEO OUTPUT

> 50 DB

COMMENT

1 VPP, 75 Ω IR

75 Ω DOME CAM-

1 VPP,

NIGHT VISION

ERA

75 Ω DOME

1 VPP, 75 Ω IR NIGHT VISION

CAMERA

M U G S H OT C A M E R A Photographer’s camera was Canon EOS 10D model with 22.7x15.1 mm CMOS sensor, with 6.3 mega pixels, equipped with Sigma 18–50 mm F3.5–5.6 DC lenses and Sigma EF 500 DG Super flash.

INTER - SESSION VARIABILITY MODELLING AND JOINT FACTOR ANALYSIS FOR FACE AUTHENTICATION

This paper applies inter-session variability modelling and joint factor analysis to face authentication using Gaussian mixture models. These techniques, originally

ABSTRACT

developed for speaker authentication, aim to explicitly model and remove detrimental within-client (inter-session) variation from client models. We apply the techniques to face authentication on the publicly-available BANCA, SCface and MOBIO databases. We propose a face authentication protocol for the challenging SCface database, and provide the first results on the MOBIO still face protocol. The techniques provide relative reductions in error rate of up to 44%, using only

GMMS

limited training data. On the BANCA database, our results represent a 31% reduc-

AUTHENTICATION

tion in error rate when benchmarked against previous work.

can be modelled and removed using ISV and JFA.

approaches still suffer from increased errors in the presence of substantial ses-

This hypothesis is supported by this work, as the

sion variability. One approach of particular interest uses a parts-based topology,

ISV and JFA techniques reduce the error rate

whereby the distribution of features extracted from images of a client’s face is

by between 11% and 44% across the challenging

described by a Gaussian mixture model (GMM). In this approach was found to offer

BANCA, SCface and MOBIO databases.

the best trade off in terms of complexity, robustness and discrimination. Inter-

cessful techniques in improving robustness to session variability for speaker authentication are inter-

[ {

GMMS “48CDF8C8-841C-4D33-

“FACEID”:

B875-1710A3FC6542”, “FACERECTANGLE”: “WIDTH”:

session variability modelling (ISV) and the related

“LEFT”:

been shown to reduce errors by more than 30%.

“TOP”:

ISV and JFA aim to estimate more reliable client

{

228,

“HEIGHT”:

technique of joint factor analysis (JFA), which have

models by explicitly modelling and removing within

228,

460, 125

client variation using a low dimensional subspace. In speaker authentication, this detrimental variation is caused by different microphones, acoustic environments and transmission channels. JFA can be considered to be an extension of ISV as it also explicitly models between-client variation.

sources of within-class variation in speech have

pose and image acquisition, which we hypothesise

While face authentication has evolved considerably in the past 15 years, modern

received considerable focus. Two of the most suc-

authenti-cation task. The intuition is that the

the effects of variation in environment, expression,

acqusition, which cause mismatch between images of the same client (person).

which the issue of session variability has already

In this paper, we apply ISV and JFA to the face

thentication performance is adversely affected by

variability, that is, changes in environment, illumination, pose, expression, or image

basis of state-of-the-art speaker authentication, in

FACE

parallels in facial images. Specifically, face au-

Many challenges in face authentication can be attributed to the problem of session

estingly, a similar GMM based approach forms the

FOR

FACE

AUTHENTICATION

The GMM partsbased topology was first applied to face authentication in and has since been successfully utilised by several researchers. The method decomposes the face into a set of blocks that are considered to be separate observations of the same signal (the face). An overview of the procedure is shown in Figure 1. The main difference between this approach and the similar GMM-based approach to speaker authentication is the use of visual features extracted from parts of the face image, rather than acoustic features extracted from frames of the speech signal. Since features are extracted from each part of the face independently, the approach is naturally robust to occlusion, local transformation and face mislocalisation, and has been found to offer the best trade-off in terms of complexity, robustness and discrimination. The rest of this section describes the main processing stages of the framework, including image pre-processing, feature extraction and classi-fication.

Each image is rotated, cropped and registered to a 64x80 intensity image with the eyes 16 pixels from the top and separated by 33 pixels. Optionally, each cropped image is pro-cessed using Tan & Triggs normalisation. From each normalised image we exhaustively sample B x B blocks of pixel values by moving the sampling window one pixel at a time. Each block is mean and variance normalised prior to extracting the D lowest-frequency 2D-DCT coefficients. The resulting feature vectors extracted from an image are finally mean and variance normalised in each dimension. Each image is thus represented by a set of K feature vectors, O = o1,o2,...,oK.

IMAGE

PRE-PROCESSING

FEATURE

AND

EXTRACTION INDOOR AND OUTDOOR CAMERA S P E C I F I C AT I O N S IMAGE

ACQUISITION

PROCEDURES

The distribution of feature vectors for each client is modelled by a Gaussian mixture model, estimated using back ground model adaptation. Background model adaptation utilises a universal background model (UBM), m, as a prior for deriving client models using maximum a posteriori (MAP) adaptation. We only adapt the means of the GMM components and use diagonal covariance matrices, as this requires fewer observations to perform adaptation and has

S P E C I F I C AT I O N

INDOOR CAMERA

TYPE

ISECURE

IMAGE SENSOR

1/4â&#x20AC;?

RESOLUTION

700TVL

MINIMUM

0.3LUX/F1.6

I L L U M I N AT I O N

10X

OUTDOOR CAMERA

already been shown to be effective for face authentication. A test image, Ot, can then be verified by scoring against the model of the claimed client identity (si) and the UBM (m). The two models, si and m, each produce a loglikelihood from which a log-likelihood ratio (LLR) is calculated, to produce a single score. The image Ot is then classified as belonging to client i if and only if h(Ot,si) is greater than a threshold. In this work, we use a fast scoring technique known as linear scoring. This relies on using a particular representation of a GMM called a GMM mean supervector, which is formed by concatenating the GMM component means. Thus, with GMMs expressed as supervectors, linear scoring approximates with the diagonal matrix formed by concatenating the diagonals of the UBM covariance matrices, Fxt = Ft x Ntm is the supervector of UBM-centralised first order UBM statistics of the image, and N t is the diagonal matrix formed by concatenating the zeroth order UBM statistics of the test image. Readers are referred to for more details. Finally, the scores are normalised using ZT-norm (Z-norm followed by T-norm). Intersession variability modelling (ISV) and joint factor analysis (JFA) are two session variability model-

SDM01HD

SONY

ISECURESDH01HD D1

EFFIO

COLOR/750TVL

OPTICAL

1/3â&#x20AC;? B/W

SONYEXIEW CCD

700TVL

COLOR/750TVL B/W

0.3LUX/F1.6 ZOOM

27X

OPTICAL

ZOOM

OPTICAL FOCUS A P P L I C AT I O N

TRUE

DAY/NIGHT (ICR)

TRUE

DAY/NIGHT (ICR)

ling techniques that have been applied with success to speaker authentication. This section provides a brief overview of these techniques. Readers are referred to [11, 24, 15] for more details. Session variability modelling aims to estimate and exclude the effects of within-client variation, in order to create more reliable client models. At enrolment time, a GMM is trained for each client by adapting the means of a UBM, as described in Section 2. In terms of GMM mean supervectors, client enrolment can be expressed as where D is a diagonal matrix with elements and zi x N (0, I). Here, x (q, q) is the variance from the UBM and x is the adaptation relevance facto. The matrix D is set in this way to ensure that the MAP solution for di in is equivalent to the classical MAP mean update rule.

The system used to develop the database consists of two PTZ cameras, a monitor, a DVR and Joystick, which is installed in the Multimedia System Laboratory at the Department of Computer and Communication System Engineering, Universiti Putra Malaysia. The indoor PTZ camera has the capability to zoom up to 10 times optical zoom and was installed at the height of 3.66m from the floor level, while the outdoor camera has the capability to zoom up to 27 times optical zoom and was installed at the height of 4m. The specifications of both cameras are shown in Table. The setting of camera parameters and the distance of the subjects to the indoor PTZ camera is calibrated in order to ensure that setting should be identical across subjects.

CCTV

SYSTEM

To assess face authentication accuracy, scores were first generated on a development set, from which a global decision thresh-

EXPERIMENTAL

PROTOCOLS

old was found that minimised the equal error rate (EER). This threshold was then applied to a test set of completely separate clients to find the half total error rate (HTER), that is, the average of false acceptance and false rejection rates. Thus the threshold, as well as all other hyperparameters, were tuned prior to seeing the test set. This is a critical requirement if such technology is to be applied to real applications. In the past, a wide variety of databases have

The MOBIO database is a large and challenging biometric da-

been used for evaluation of face authentication

tabase. It contains videos of 150 participants captured in chal-

techniques. To properly evaluate ISV and JFA,

MOBIO

lenging real-world conditions on a mobile phone camera over a

we chose to use images taken in challenging

one and a half year period. Figure 2c shows example images,

conditions causing substantial within client

demonstrating session variability due to variation in pose and

variation. Furthermore, we chose to restrict

illumination. For this work, one image was extracted from each of

ourselves to publicly-available databases with

the videos and was manually annotated with eye locations. Using

separate training, development and test sets

manual face localisation allows us to evaluate face authenti-

to allow for unbiased evaluation. Unfortunate-

cation accuracy separately from the choice of face detection

ly, some popular databases such as FRGC

algorithm. These images and annotations will be made available

and LFW were thus not applicable, as they do

to facilitate future benchmarking5.

not include separate development and test sets2. We therefore

The MOBIO protocol is supplied with the database and defines

chose to evaluate the ISV and JFA techniques on the challenging

three non-overlapping partitions: training, development and test-

BANCA, SCface and MOBIO databases. The BANCA and MOBIO

ing. The development and testing partitions are defined in a gen-

databases already have well-defined protocols, while a suitable

derdependent manner, such that clientsâ&#x20AC;&#x2122; models are only test-

face authentication protocol for SCface is proposed in this work.

ed against images from clients of the same gender. We chose to use the training data in a gender independent manner to be consistent with the other databases, though future work could investigate gender dependent training. For male clients, the test set contains 3,990 target trials and 147,630 impostor trials. For female clients, there are 2,100 target trials and 39,900 impostor trials. All of the training data was used to train U , V , and D (9,579

SCFACE

The Surveillance Cameras face database (SCface) is particularly interesting from a forensics point of view because images were acquired using commercially available surveillance equipment, in a

images of 50 clients). A subset of 1,224 images of 34 clients (36 images each) was used for UBM training, while the other 16 clients were used for score normalisation.

range of challenging but realistic conditions. As could be imagined in a real-life scenario, for authentication, these surveillance images are compared to a single high-resolution mugshot image. Whilemsuggested a protocol for face identification, it did not include a world data set or a development data set, and no protocol was suggested for face authentication. Therefore, we propose the following face authentication protocol for SCface based on the DayTime tests scenario4. Then, test images were taken from the 5 surveillance cameras at 3 different distances: close, medium and far. Each client model was tested against the 15 surveillance images from each client in the same subset. This resulted in 645 target trials and 27,090 impostor trials in the test set. Unless otherwise noted, results are reported for a combined protocol, in which each test image was assumed to originate from an unknown camera at an unknown distance.

In this section, results are reported for each database independently. A GMM parts-based system, as described in Section 2, without ISV or JFA is used as the baseline system for comparison. Hyper-parameters were tuned on the development set for each database, including the block size used during feature extraction, and the dimensionality of subspaces U and V . UBMs were trained with 512 components6, and a relevance factor of = 4 was used for client model adaptation. For experiments on SCface only, cropped images were not pre-processed using Tan & Triggs normalisation, as it did not improve performance in that case. ISV and JFA were implemented based on the JFA cookbook7. Subspaces V , U and D were trained using 10 EM iterations, in that order for JFA. For ISV, only U was trained. Latent variables were estimated in the order yi (JFA only), xi,j, then zi, using one Gauss-Seidel iteration. Manual face localisation was utilised unless otherwise noted. For evaluating the statistical significance of improvements in HTER, we used the methodology proposed by equation and Figure 2, with a one tailed test.

RESULTS

SCFACE

BANCA

MOBIO(MALE)

MOBIO(FEMALE)

DEV

TEST

DEV

TEST

DEV

TEST

DEV

TEST

9.3%

8.2%

23.8%

25.7%

23.8%

25.7%

23.8%

25.7%

7.8%

6.1%

20.2%

20.6%

20.2%

20.6%

20.2%

20.6%

7.8%

6.5%

18.5%

17.7%

18.5%

17.7%

18.5%

17.7%

8.6%

7.6%

16.7%

16.4%

16.7%

16.4%

16.7%

16.4%

8.6%

7.2%

17.4%

16.4%

17.4%

16.4%

17.4%

16.4%

RESULTS ON BANCA, SCFACE AND MOBIO ( EER ON DEV SET, HTER ON TEST SET ) SHOWIN G T H E E F F E C T O F B L O C K S I Z E D U R I N G FEATURE EXTRACTION ( B Ã&#x2014; B PIXEL BLOCKS WITH D DCT COEFFICIENTS RETAINED ) .

Firstly, the size of the blocks used during feature extraction was

FEATURE

tuned on development data. The number of DCT coefficients for a

SCORE

given block size, D, was initially tuned on BANCA. As shown in Table

EXTRACTION

AND

NORMALISATION

BANCA

SCFACE

DEV

TEST

DEV

TEST

NO SCORE NORM

11.0%

11.1%

23.9%

25.1%

ZT-NORM

7.8%

6.1%

16.7%

16.4%

1, the optimal block sizes were 12 x 12 and 20 x 20 pixels for the BANCA and SC face databases, respectively. For MOBIO, across male and female clients, a block size of 12 x 12 was chosen. Table 2 illustrates that ZT-norm score normalisation was very effective for BANCA and SCface, with relative reductions in test set HTER of 45% and 35% respectively. For MOBIO, ZT-norm had little effect. This system, with tuned block size and ZT-norm score normalisation but without ISV or JFA, is referred to as the baseline system for the following session variability modelling experiments.

On both the development and test sets, the best performance was achieved using the ISV approach, which improved test set HTER by 11% relative.

RESULTS ON BANCA AND SCFACE ( EER ON DEV SET, HTER ON TEST SET ) SHOWING THE EFFECT OF ZT - NORM SCORE NORMALISATION.

SESSION

VARIABILITY

MODELLING

BANCA

These improvements are statistically significant

MAN,FACE LOC.

at a level of 95% and 85% for development and test sets respectively. Table 4 shows that using 50 to 100 dimensions in U was optimal on the

A U TO . F A C E L O C .

SYSTEM

DEV

TEST

DEV

TEST

BASELINE

7.8%

6.1%

9.2%

6.7%

ISV

6.6%

5.4%

7.5%

6.0%

JFA

6.6%

6.3%

9.1%

7.0%

development set, and this generalised well to the test set. It is encouraging that these results do not appear overly sensitive to the choice of subspace dimension. Our results are compared to recently published work in Table 5. For comparison, we report the HTER on the test set (g2), development set (g1)8, and the average. Used automatic face extraction from the BANCA videos, while Ahonen et al. Used the 5 preselected images from each video as in this work. Our results represent a 31% reduction in average HTER when compared to previous work.

RESULTS ON BANCA AND SCFACE ( EER ON DEV SET, HTER ON TEST SET ) SHOWING THE EFFECT OF ZT - NORM SCORE NORMALISATION.

NEW IMAGE QUALITY MEASURE BASED ON WAVELETS // Discrete wavelet transform DWT can be used in various image processing applications, such as image compression and coding.1 In this paper, we examine how DWT can be used in image quality evaluation, which has become crucial for the most image processing applications. Quality of an image can be evaluated using different measures. The best way to do this is by making a visual experiment, under controlled conditions, in which human observers grade which image provides better quality. Such experiments are time consuming and costly. A much easier approach is to use some objective measure that evaluates the numerical error between the original image and the tested one. In the real world, there is no perfect way for an objective assessment of image quality.2 The problem with the most objective measures is SESSION

VARIABILITY

MODELLING

SCFACE

On the SCface database, as shown in Table 6, both of the pro-

1. OBJECTIVE MEASUREMENTS, INCLUDING OUR

that objective measures need a reference origi-

posed session variability modelling techniques outperformed the

OWN DEVELOPED MEASURE, WERE PERFORMED

nal image to be able to grade the corresponding

baseline. JFA offered consistently improved performance over the

ON THE SAME SET OF PICTURE SEQUENCES

tested image, while human observers can grade

baseline and ISV systems, resulting in a relative reduction in test

TAKEN FROM AN ALREADY KNOWN IMAGE DATABASE.

image quality independently of a corresponding

2. SUBJECTIVE ASSESSMENT RESULTS WERE TAKEN

been many attempts to develop models or met-

set HTER of 18% over baseline results. The dimensionalities of V and U were tuned on the development set to values of 10 and 40, respectively. On the test set, the improvements provided by ISV and

FROM REF. THE DATABASE INCLUDES SUBJEC

rics for image quality that incorporate elements

JFA over the baseline are statistically significant at levels greater

TIVE GRADES WITH CALCULATED DIFFEREN

of human visual system HVS sensitivity. These

than 98% and 99% respectively. Results are further analysed by

TIAL MEAN OPINION SCORE

metrics for quality assessment have limited ef-

separating the scores into three separate groups, i.e. those from

THE MAIN GOAL OF THESE STUDIES WAS TO

test images taken at the 3 different distances, close, medium and far. For this analysis only, the dataset used for Z-norm score normalisation was matched to the distance of the test image. Table 8 shows that the JFA approach provided substantial improvements for close and medium images, however, recognising far images remains particularly difficult.

VARIABILITY

MODELLING

MOBIO

DMOS

RESULTS.

OBTAIN SUBJECTIVE RESULTS THAT WOULD BE USED IN THE THIRD STEP FOR VERIFICATION AND COMPARISON OF OBJECTIVE MEASURES. 3. THE RESULTS OF THE OBJECTIVE ASSESSMENTS STEP 1 AND SUBJECTIVE MEASUREMENTS STEP 2

SESSION

original image. Over the past years, there have

WERE STUDIED.

relative improvement over the baseline of 30%, using dimensionalities of 30 and 50 for V and U respectively. For female clients, the ISV technique performed the best, providing a relative improvement of 44% with 250 dimensions in U .

standard and objective definition of image quality. In our research, Watsonâ&#x20AC;&#x2122;s wavelet model5 is used to incorporate HVS characteristics in image quality measure. This model is based on direct measurement of the HVS noise visibility threshold for the specific wavelet decomposional filter. Blank images with uniform gray level

substantially reduced the error rate compared to the baseline, 99.99%. For the male tests, JFA outperformed ISV, providing a

of real images. However, there is no current

tion level using a linear-phase CDF9_7 biorthog-

On the MOBIO database, as shown in Table 7, both ISV and JFA with improvements statistically significant at a level greater than

fectiveness in predicting the subjective quality

were decomposed, and afterward noise was added to the wavelet coefficients. After inverse wavelet transform, the noise visibility threshold in the spatial domain was measured by subjective experimentation at a fixed viewing distance. An experiment was conducted for each subband, and the visual sensitivity for that subband was then defined as the reciprocal of the corresponding visibility threshold. This model can be directly applied for the perceptual image compression by quantizing the wavelet coefficients according to their visibility thresholds. It is also extendable to imagequality assessment, as was done in Ref. 6, where a wavelet visible difference predictor was used to predict visible differences between original and compressed or noisy images. In this paper, we present a new way of using Watsonâ&#x20AC;&#x2122;s wavelet model for image-quality evaluation.

FOREHEAD

MID

FOREHEAD

ORBITA UPPER

EYEBROW

RYELID

UPPER

EAR NOSE

LIP

TIP

LIP

CORNER

To be able to compare several later-described objective methods, we used subjective quality results from Ref. 7. Subjective quality evaluation was based on ITU-R recommendation BT.500-11.10 Details of the subjective testing can be found in Ref. 11. Briefly, they are as follows: 29 high-resolution 24 bits / pixel RGB color images typically, 768 512 were degraded using five degradation types:

ties for white noise, Gaussian blur, and fastfad-

1 . J P2K, JPEG2000 COMPRESSION

ing. About 20–29 observers had to grade image

2 . J PEG, JPEG COMPRESSION

quality on a continuous scale with five grades

3 . W N, WHITE NOISE IN THE

ed images. The experiments were conducted in

JSON: [ { “48CD-

1710A3FC6542”,

JPEG2000 and six images with different quali-

of which 203 were reference and 779 degrad-

IMAGE

MEASURE

F8C8-841C-4D33-B875-

seven to nine different qualities for JPEG and

observers evaluated total of 982 images, out

QUALITY

“FACEID”:

Each of these 29 images had versions with

bad, poor, fair, good, and excellent . In this way,

SUBJECTIVE

“FACERECTANGLE”: { “WIDTH”:

228,

“HEIGHT”: “LEFT”:

R GB COMPONENTS

“TOP”:

228,

460, 125

4 . G BLUR, GAUSSIAN BLUR

5 . F ASTFADING, TRANSMISSION ER

“FACELANDMARKS”: { “PUPILLEFT”: {

seven sessions: two sessions for JPEG2000, two for JPEG, and one each for white noise, Gaussian blur, and fastfading transmission errors. Raw

R ORS IN THE JPEG2000 BIT S TREAM USING A FAST - FADING

“X”:

507,

R AYLEIGH CHANNEL MODEL

“Y”:

204.9

scores for each subject were converted in difference scores between the test and reference images, di,j = riref

}, “PUPILRIGHT”: {

j-ri,j, where riref j denotes the raw quality score assigned by the i’th subject to the reference image corresponding to the j’th distorted image and ri,j score for the i’th subject and j’th image. Difference scores were converted to Z scores where di is the mean of the raw

because each observer uses different part of grading scale.

609.8,

“Y”:

175.4

}, “NOSETIP”: {

score differences overall images ranked by the subject i and i is the standard deviation. Z scores are used to make scores more equal,

“X”:

596.4,

“Y”:

250.9

}, “MOUTHLEFT”: {

OBJECTIVE QUALITY

In this paper, we examined several commonly used

IMAGE

objective quality measures, which were applied to a

MEASURES

“X”:

531.4,

1. MSE

“Y”:

301

2. PSNR

luminance channel only, because they give better correlation results with subjective testing by comparison to calculating objective measures

3. STRUCTURAL SIMILARITY

“MOUTHRIGHT”: {

4. MULTISCALE SSIM

“X”:

629,

“Y”:

273.7

}, “EYEBROWLEFTOUTER”: { “X”:

463.2,

“Y”:

201.2

}, “EYEBROWLEFTINNER”: { “X”:

547.4,

“Y”:

177.1

}, “EYELEFTOUTER”: { “X”:

496.8,

“Y”:

212

SS I M

MSSIM

5. VISUAL INFORMATION FIDELI TY

VIF

6. VISUAL SIGNAL - TO - NOISE RA TIO 7. IQM

using RGB components separately and then calculating mean of them. SSIM is a novel method for measuring the similarity between two images.12 It is computed from three image measurement comparisons:

VSNR

luminance, contrast, and structure. Each of

OUR PRO POSED MEASURE

these measures is calculated over an 8 8 local square window, which moves pixel-by-pixel over the entire image. At each step, the local statis-

tics and SSIM index are calculated within the local window. Because the resulting SSIM index map often exhibits undesirable “blocking” artifacts, each window is filtered with a Gaussian weighting function 11 11 pixels. In practice, one usually requires a single overall quality measure of the entire image; thus, the mean SSIM index is computed to evaluate the overall image quality. The SSIM can be viewed as a quality measure of one of the images being compared, while the other image is regarded as perfect quality. It can give results between 0 and 1, where 1 means excellent quality and 0 means poor quality. Similar to SSIM, the MSSIM method is a convenient way to incorporate image details at different resolutions.13 This is a novel image synthesis-based approach that helps calibrating the parameters such as viewing distance that weight the relative importance between different scales.

VIF criterion14 quantifies the Shannon information that is shared between the reference and distorted images relative to the information contained in the reference image itself. It uses natural scene statistics modeling in conjunction with an imagedegradation model and an HVS model. Results of this measure can be between 0 and 1, where 1 means perfect quality and near 0 means poor quality. VSNR15 operates in two stages. First, the threshold for distortions of a degraded image is determined to decide if it is below or above human sensitivity of error detection. This is computed using wavelet based models of visual masking. If distortions are below the threshold, then the distorted image is assumed to be

SYNTHESIS

ANALYSIS

perfect VSNR. If the distortions are above the threshold, then a second stage is applied. Calculations are made on the lowlev-

COLUMNS

ROWS

el visual property of perceived contrast and the midlevel visual

COLUMNS

ROWS

property of global precedence. These properties are used to determine Euclidean distances in distortion contrast space of multiscale wavelet decomposition. Finally, VSNR is calculated from

↓2

a linear sum of these distances. A higher VSNR means that the

↓2

↑2

↓2

↑2

↓2

↑2

↓2

↑2

tested image is less degraded. DWT refers to wavelet transforms for which the wavelets are discretely sampled. This can be done with multiresolution analysis of a signal.8 Multiresolution analysis allows us to decompose a signal into approximations and details. These coefficients can be computed using various filter banks, such as Daubechies, Coiflets, or biorthogonal filters. Suppose we have a one-dimensional input signal x t . It can be decomposed into approximation and detail coefficients of the first level. Then we can also decompose approximation coefficients at the first level further into approximation and detail coefficients at the second level. This can be expressed as

↓2

Equations 6 look like convolution, but there is a downsampling involved by a factor of 2 . h0 and h1 are accordingly scaling and wavelet filters. The decomposition of a signal into an approximation and a detail can be reversed. Similar expressions like 6 can be used, but we have to use upsampling and conjugate mirror filters. In image transform, we have two dimensions. Thus, we need to extend analysis of decomposition and reconstruction in two dimensions. We may do decomposition with separable wavelet transform, which is in fact one dimensional convolution with subsampling by a factor of 2 along the rows and columns of image. Reconstruction is done reversely. This means upsampling by 2 and then convolution along the rows and columns. Decomposition and reconstruction at level j are shown in Fig. 1.

Some of the existing objective measures described in Section 4 did not take into account HVS in the sense that the eye will see and grade image quality according to the type of error, as well

PROPOSED FOR

IMAGE

NEW

ALGORITHM

QUALITY

as location of an error in subband space. Because of that, our method calculates image quality using wavelet decomposition

CDF9_7, LOWPASS

CDF9_7, HIGHPASS

COIF22_14, LOWPASS

COIF22_14, HIGHPASS

0.03782845550726

−0.06453888262870

−0.00006038691911

0.00249239584019

−0.02384946501956

0.04068941760916

−0.00007137535849

0.00294555229198

−0.11062440441844

0.41809227322162

0.00097545380465

−0.02160076866236

0.37740285561283

−0.78848561640558

0.00120718683898

−0.02777241079070

0.85269867900889

0.41809227322162

−0.00658124080240

0.09720345190957

0.37740285561283

0.04068941760916

−0.00932685158094

0.16200574375453

−0.11062440441844

−0.06453888262870

0.03683394176520

−0.64802297501813

−0.02384946501956

0.01809725255148

0.64802297501813

0.03782845550726

−0.14280042659266

−0.16200574375453

0.07881441881590

−0.09720345190957

0.73001880866394

0.02777241079070

0.73001880866394

0.02160076866236

0.07881441881590

−0.00294555229198

−0.14280042659266

−0.00249239584019

and grades quality depending on the wavelet subband in which the error occurs. Experiments on image database7 have shown that different types of image degradation produce different error distributions in wavelet subbands. For example, for JPEG and JPEG2000 compressed image errors will be placed in the higher wavelet subbands HH subband, level 2 and higher while images with Gblur and fastfading degradations will also have errors in lower subbands. White noise has equally distributed errors in all subbands. In our research, we used two types of wavelet filters. The first filter, called CDF9_717

nine

coefficients in decomposition lowpass and seven in decomposition highpass filters , is designed as a spline variant with less dissimilar lengths between low pass and high-pass filters and has seen widespread use in image processing. The second filter, Coif22_1418

22 coefficients in

decomposition lowpass and 14 in decomposition high-pass filters , has properties of antisymmetric biorthogonal Coiflet systems, whose filter banks have even lengths and linear phase. Coefficients of these wavelet filters are presented in Table 1. Figure 2 shows decomposition low pass and high pass wavelet filters. All color images were first converted to grayscale images by forming a weighted sum of the red R , green G , and blue B components: Y = 0.2989R + 0.5870G + 0.1140B.

COEFFICIENTS OF THE USED W A V E L E T F I LT E R S DUMIC, GRGIC, AND GRGIC: NEW IMAGE - QUALITY MEASURE BASED ON WAVELETS

0.01809725255148 0.03683394176520 −0.00932685158094 −0.00658124080240 0.00120718683898 0.00097545380465 −0.00007137535849 −0.00006038691911

CDF9_7, LOWPASS

95% CONFIDENCE

95% CONFI-

95% CONFIDENCE

BOUNDS

DENCE BOUNDS

BOUNDS

−23.25

0.4292

28.71

−33.94,−12.57

0.6488

29.45

0.2096,

27.96,

95% CONFIDENCE

BOUNDS

−0.6641

61.49

−1.059,−0.2692

72.79

−151.6

121.2

−175.5,−127.8

131.4

50.2,

PSNR −100.9

−7.904

0.4158

SSIM

−128.9,−72.8

−9.698,−6.11

0.4304

MSSIM

−71.36

36.51

1.002

−124.2,−18.48

49.2

−34.3

6.443

−41.76,−26.83

8.04

163 −257.2,583.1

0.4011,

0.9657,

111.1,

−20.94

40.7

1.039

−25.73,−16.14

68.24

−0.2692

−13.14

32.05

−0.3165,−0.2218

−15.5,−10.78

15.04

−0.07769

22.4

1.432

−93.66,123.7

−0.1624,0.006981

23.85

−3.402,6.265

−114.5

−36.85

5.183

3.168

−71.25,−2.446

9.351

57.36

3.431

23.82,

13.17,

LOG10 VIF VSNR LOG10 IQM1

WAT -

SON’S MODEL LOG10 IQM2

EXPER-

I M E N T A L LY D E T E R -

4.845,

1.016,

20.95,

2.855,

3.481

43.23

38.4,

48.06

MINED 20.68,

94.04

COEFFICIENT PARAMETERS FOR LOGISTIC FUNCTION DUMIC, GRGIC, AND GRGIC: NEW IMAGE - QUALITY MEASURE BASED ON WAVELETS

4.813

2.048,

3.292 3.335

3.25,

−2.56 −17.8,12.68

30, 34.1

−129.4,−99.57 55.09

5.219, 105

SCHEDULE & MAP

S AT U R D AY , M AY 2 7 4:30

P.M.-7:30

P.M.

Conference Registration

PLAZA FOYER

PLAZA BALLROOM

TERRACE

OASIS

TAPESTRY

RESTROOM

HOTEL

EXECUTIVE

TERRACE

BOARDROOM

PATIO

LOBBY

S U N D AY , M AY 2 8 A.M.-5:00 A.M.-5:00

P.M.

8:00 8:00

A.M.-6:00

P.M.

7:30

P.M.

Management Development Program Preconference Workshops

M O N D AY , M AY 2 9 7:30 8:00 8:00 8:00

A.M.-5:30 A.M.-5:00

P.M.

A.M.-5:00 A.M.-6:00

P.M.

P.M. P.M.

Conference Registration Exhibitor Move-in

Management Development Program

Preconference Workshops

A . M . - 5 : 0 0 P . M . Regulatory Compliance for Education Abroad Risk Management: A NAFSA and URMIA Seminar P . M . - 4 : 3 0 P . M . Latin America Forum 4 : 0 0 P . M . - 5 : 0 0 P . M . First-Timer’s Networking 5 : 3 0 P . M . - 7 : 3 0 P . M . Special Event: Knowledge Community Networking Receptions

9:00 2:00

T U E S D AY , M AY 3 0 A . M . - 5 : 3 0 P . M . Conference Registration A . M . - 1 2 : 3 0 P . M . Symposium on Leadership 8 : 0 0 A . M . - 1 2 : 3 0 P . M . Management Development Program (Day 3) 8 : 0 0 A . M . - 1 2 : 0 0 P . M . Preconference Workshops

7:00

7:30

A . M . - 3 : 3 0 P . M . Expo Hall A . M . - 3 : 0 0 P . M . G lobal Learning Colloquium on Teacher Education 1 : 0 0 P . M . – 2 : 0 0 P . M . Concurrent Sessions 1 : 0 0 P . M . - 3 : 3 0 P . M . U . S . Higher Education Partnership Fair

8:30

9:00

P.M.-3:30 P.M. 4:00 P.M.-5:30 P.M. 5:30 P.M.-7:30 P.M.

2:30

Concurrent Sessions

Opening Plenary Address, Isabel Wilkerson Special Event: Opening Celebration

JSON: [ { “FACEID”:

“48CD-

F8C8-841C-4D33-B8751710A3FC6542”, “FACERECTANGLE”: { “WIDTH”:

COLOPHON

228,

“HEIGHT”: “LEFT”: “TOP”:

228,

460, 125

}, “FACELANDMARKS”: { “PUPILLEFT”: {

TYPEFACE

“X”:

507,

“Y”:

204.9

}, “PUPILRIGHT”: {

The text is set in

“X”:

609.8,

Myanmar Sangam MN /

“Y”:

175.4

}, “NOSETIP”: {

The headings are srt in

“X”:

596.4,

PT MONO /

“Y”:

250.9

in 13 January 2010

“MOUTHLEFT”: {

S O F T WA R E

“X”:

531.4,

“Y”:

301

}, “MOUTHRIGHT”: {

Adobe Creative Cloud,

“X”:

629,

Indesign, Illustrator, Photoshop

“Y”:

273.7

}, “EYEBROWLEFTOUTER”: { EQUIPMENT

“X”:

463.2,

“Y”:

201.2

iMac desktop 21.5-inch: 2.7GH z

Epson Stylus Pro 4900 Designer Edition 17-inch Printer

“EYEBROWLEFTINNER”: {

PAPAER

“X”:

547.4,

“Y”:

177.1

}, “EYELEFTOUTER”: { Epson Premium Presentation Paper Matte

“X”:

496.8,

44lb. White , 4Star

“Y”:

212

PRINTING AND BINDING Printing Plotnet, San Francisco, CA Binding Chums, San Francisco, CA Date:05/17/2017

DESIGNER Yirui Qian

ABOUT THE PROJECT This is a student project only. No part of this book or any other part of the project was produced for commercial use.