INTRODUCTION
D ATA B A S E DESCRIPTION
P OT E N T I A L U S E S O F T H E D ATA B A S E
P R O P O S E D E VA L U A T I O N P R OTO C O L
JSON: [
BASELINE PCA
{
PERFORMANCE
“FACEID”:
SAN FRANCISCO MOSCONE CENTER WEST
“48CD-
E VA L U A T I O N S
F8C8-841C-4D33-B875-
M AY 2 7 _ 3 0 2 0 1 7
1710A3FC6542”, “FACERECTANGLE”: { “WIDTH”: “HEIGHT”: “LEFT”: “TOP”:
CONCLUSION
228, 228,
460, 125
}, “FACELANDMARKS”: { “PUPILLEFT”: { “X”:
507,
“Y”:
204.9
}, “PUPILRIGHT”: { “X”:
609.8,
“Y”:
175.4
}, “NOSETIP”: { “X”:
596.4,
“Y”:
250.9
}, “MOUTHLEFT”: { “X”:
531.4,
“Y”:
301
}, “MOUTHRIGHT”: { “X”:
629,
“Y”:
273.7
},
SNR FICIAL IDENTITY
“EYEBROWLEFTOUTER”: { “X”:
463.2,
A FORTHCOMING TECHNOLOGY IS
“Y”:
201.2
P L A N N E D TO H A P P E N S O O N
}, “EYEBROWLEFTINNER”: { “X”:
547.4,
“Y”:
177.1
}, “EYELEFTOUTER”: { “X”:
496.8,
“Y”:
212
00
ABSTRACT INTRODUCTION DATABASE
DESCRIPTION
2 . 1 Image acquisition: equipment setup and imaging procedure 2 . 2 Naming conventions and final database forming 2 . 3 Image sets details 2 . 3 . 1 Frontal facial mug shots 2 . 3 . 2 Surveillance cameras images (visible light) 2 . 3 . 3 IR night vision mug shots 2 . 3 . 4 Surveillance cameras images (IR night vision) 2 . 3 . 5 Different pose images 2 . 4 Database demographics
Facial Recognition
POTENTIAL PROPOSED
USES OF THE DATABASE EVALUATION PROTOCOL
4 . 1 DayTime tests 4 . 2 NightTime tests 4 . 3 Performance metrics 4 . 4 Training scenario suggestion BASELINE
PCA
PERFORMANCE EVALUATIONS
5 . 1 Image preprocessing 5 . 2 Experimental design 5 . 3 Performance results 5 . 3 . 1 DayTime experiments 5 . 3 . 2 NightTime experiments 5 . 4 Discussion CONCLUSION APPENDIX COLOPHONE
02
01
CAMERAS SET UP //
CAMERA
2
CAMERA
1
CAMERA
5
CAMERA
3
CAMERA
4
Our database was designed mainly as a means of testing face recognition algorithms in real world conditions. In such a setup, one can easily imagine a scenario where an individual should be recognized comparing one frontal mug shot image to a low quality video surveillance still image. In order to achieve a realistic setup we decided to use commercially available surveillance cameras of varying quality (see Appendix for details). Since two of our surveillance cameras record both visible spectrum and IR night vision images, we decided to include IR imagery in the database as well. Capturing face images took place in Video Communications Laboratory at the Fac-
THE HIGH - QUALITY PHOTO CAMERA FOR
ulty of Electrical Engineering and Computing, University of Zagreb, Croatia. Capture
CAPTURING VISIBLE LIGHT MUG SHOTS
equipment included: six surveillance cameras, professional digital video surveillance
WAS INSTALLED THE SAME WAY AS THE
recorder, professional high-quality photo camera and a computer. For mug shot
INFRARED CAMERA BUT IN A SEPARATE
image acquisition we used a high-quality photo camera. For surveillance camera image acquisition we used five different (commercially available) models of surveillance cameras (see Appendix for details) and for IR mug shots a separate surveillance camera was used. Five surveillance cameras were installed in one room at
ROOM WITH THE STANDARD INDOOR LIGHTING AND IT WAS EQUIPPED WITH ADEQUATE FLASH.
the height of 2.25 m and positioned as illustrated in Figs. 1 and 2. The only source Two out of five surveillance cameras were able to record in IR night vision mode as well. The sixth camera was installed in a separate, darkened room for capturing IR mug shots. It was installed at a fixed position and focused on a chair on which participants were sitting during IR mug shot capturing. The IR part of the database is important since there are research efforts in this direction, e.g. and, yet no available database provides such images. Mug shot imaging conditions are exactly the same as would be expected for any law enforcement or national security use (passport images or any other personal identification document). All six cameras (five surveillance and one IR mug shot) were connected to a professional digital video surveillance recorder, which was recording all six video streams simultaneously all the time on internal hard disk. For storage of images and for controlling surveillance cameras we used Digital Sprite 2 recorder with adequate software (DM Multi Site Viewer) provided from manufacturer for connecting and controlling over personal computer. Settings recommended by the vendor were used for capturing images on Digital Sprite 2, because those settings represent most commonly used parameters set up in real life surveillance systems (see Appendix for details). The surveillance cameras are named cam1, cam2, cam3, cam4 and cam5. Cam1 and cam5 are also able to work in IR night vision mode. We decided to name the images taken by them in the IR night vision mode as cam6 (actually cam1 in night vision mode) and cam7 (actually cam5 in night vision mode). Camera for taking IR mug shots was named cam8. All cameras (surveillance and photo) were installed and fixed to same positions and were not moved during the whole capturing process.
2.25M
of illumination was the outdoor light, which came through a window on one side.
04
03
In this paper we describe a database of static images of human faces. Images were taken in uncontrolled indoor environment using five video surveillance cam-
ABSTRACT
eras of various qualities. Database contains 4,160 static images (in visible and infrared spectrum) of 130 subjects. Images from different quality cameras should mimic real-world conditions and enable robust face recognition algorithms testing, emphasizing different law enforcement and surveillance use case scenarios. In addition to database description, this paper also elaborates on possible uses of the database and proposes a testing protocol. A baseline Principal Component Analysis (PCA) face recognition algorithm was tested following the proposed protocol. Other researchers can use these test results as a control algorithm performance score when testing
KEYWORD
their own algorithms on this dataset. Database
V I D EO SURVEILLANCE CAMERAS.
is available to research community through the
F A C EDATABASE.
procedure described at www.scface.org.
F A C ERECOGNITION.
1. DIFFERENT QUALITY AND RESOLUTION CAMERAS WERE USED;
The problem specifically addressed with our da-
2. IMAGES WERE TAKEN UNDER UNCONTROLLED
tabase is the law enforcement person identifi-
ILLUMINATION CONDITIONS; 3 . IMAGES WERE TAKEN FROM VARIOUS DISTANCES;
Interest in face recognition, as a combination of pattern recognition and image analysis is still growing. Many papers are written and many real-world systems are being developed and distributed. As a non-invasive biometric method, face recognition is attractive for national security purposes as well as for smaller scale surveillance systems. However, in order to be able to claim that any face recognition system is efficient, robust and reliable, it must undergo rigorous testing and verification, preferably on real-world datasets. In spite of many face databases currently available to researchers, we feel they lack the real-world settings part. Images in other currently available databases are usually taken by the same camera e.g. Video database of moving faces and people, CMU PIE, FERET and are not taken using proper, commercially available, surveillance equipment (e.g. some multimodal databases that include sets for voice recognition—BANCA, XM2VTSDB). Images in those databases are mainly taken under strictly controlled conditions of lighting, pose, etc., and are of high resolution (high qual-
INTRODUCTION
4. HEAD POSE IN SURVEILLANCE IMAGES IS TYPICAL FOR A COMMERCIAL SURVEILLANCE SYSTEM, I.E.THE CAMERA IS PLACED SLIGHTLY ABOVE THE SUBJECT’S HEAD, MAKING TH E RECOGNITION EVEN MORE DEMANDING; BESIDES, DURING TH E SURVEILLANCE CAMERA RECORDINGS THE INDIVIDUALS WERE NOT LOOKING TO A FIXED POINT; 5. DATABASE CONTAINS NINE DIFFERENT POSES IMAGES SUITA B L E FOR HEAD POSE MODELING AND/OR ESTIMATION; 6. DATABASE CONTAINS IMAGES OF 130 SUBJECTS, ENOUGH TO ELIMINATE PERFORMANCE RESULTS OBTAINED BY PURE COINCIDENCE ( CHANCES OF RECOGNITION BY PURE COINCIDENCE IS LES S THAN 1/130 ≈ 0.8% ) ; 7. B OTH IDENTIFICATION AND VERIFICATION SCENARIOS ARE POSSIBLE, BUT THE MAIN IDEA IS FOR IT TO BE USED IN DIFFICULT REAL - WORLD IDENTIFICATION EXPERIMENTS.
ity capturing equipment is used). Although some
our database in experiments will show that real-world indoor face recognition by using standard video surveillance equipment is far from being a solved problem.
formance of both humans and computers. Since up to now there were no publicly available CCTV face
images
databases,
researchers
were
forced to perform the experiments using their images they captured themselves, which in turn often results in having a low number of subjects and results which have questionable statistical significance. We feel that our SCface database will successfully fill this gap. For this purpose, we collected images from 130 subjects. The database consists of 4,160 images in total, taken in uncontrolled lighting (infrared night vision images taken in dark) with five different quality surveillance video cameras.
distinct distances from the cameras with the out-
Head pose should be as natural as possible. All this mimics a really complex imag-
base mimics the real-world conditions as close as possible and we think that using
(Closed Circuit Television) images, exploring per-
Appendix). Subjects’ images were taken at three
uncontrolled (i.e. coming from a natural source) with varying direction and strength.
solved problem. This was the main motivation for collecting our database. Our data-
forming more and more experiments using CCTV
IR images; for cameras details please see the
from video cameras (of various qualities) on a public place. Illumination should be
resulting in very high recognition rates suggesting that face recognition is almost a
how important this topic is, researchers are per-
output as well (we will address these images as
real-world applications, algorithms should operate and be robust on data captured
reason for a low number of studies on face recognition in such naturalistic conditions,
ject’s cooperation is not expected. Recognizing
(IR) night vision mode so we recorded that IR
assure reproducibility of results) all this is still far from real-world conditions. For
any other national security system. The obvious lack of such image sets is the main
any other identification scenario where the sub-
Two cameras were able to operate in infrared
of the most frequently used databases have a standardized protocol (in order to
ing environment, such as is expected when building an airport security system or
cation from low quality surveillance images or
door (sun) light as the only source of illumination. All images were collected over a 5 days period. This database was collected mainly having identification rather then verification in mind, with the additional motive being the fact that identification is a more demanding recognition problem with its one-to-many comparisons. The proposed protocol and initial testing with PCA will also be along these lines of thought. Reported results will accordingly show rank 1 recognition percentages, but Cumulative Match Score (CMS) curves will be omitted, since such a detailed analysis is beyond the scope of this paper. Here is a short summary of what makes this dataset interesting to face recognition research community. The rest of this paper is organized as follows: Section 2 gives a detailed database description including the acquisition, naming conventions, database forming stage, image sets details and database demographics; Section 3 discusses potential uses of the database; Section 4 proposes evaluation protocols and results presentation; Section 5 describes an ad hoc baseline PCA test on this database in extremely difficult experimental setup where a separate database is used for training; Section 6 concludes the paper.
06
SURVEILLANCE CAMERAS S P E C I F I C AT I O N S CAM1 BOSCH
LTC0495/51
CAM2 SHANY
WTC-8342
The capturing was conducted over a 5 days period. All participants in this project have passed through the following procedure. First they had to walk in front of the surveillance cameras in the dark and after that they had to do the same in uncontrolled indoor lighting. During their walk in front of the cameras they had to stop at three previously marked
CAM3 J&S JCC-915D
positions. This way 21 images per subject were taken (cam1–7 at distances of 4.20, 2.60 and 1.00 m). After that, participants were photographed with digital photographer’s camera at close range in controlled conditions (standard indoor lighting, adequate use of flash to avoid shades, high resolution of images). This set of images provides nine discrete views of each face, ranging from left to right profile in equal steps of 22.5 degrees. To assure comparable
CCD TYPE
views for each subject, numbered markers were used as fixation points. As a final result, there are nine images per
ACTIVE PIXELS
subject with views from 90 to +90 degrees including another mug shot at 0 degrees.
RESOLUTION
In the end, subjects went into the dark room where high quality IR night vision surveillance camera was installed for
MINIMUM ILLUMI
capturing IR mug shots at close range. In overall that gives 32 images per subject in the database. Facial mug shots
N AT I O N
are high quality static color images, taken in controlled indoor illumination conditions environment using Canon EOS
SNR
10D digital camera. There is one mug shot per subject and those images are labeled with the suffix frontal (e.g.
VIDEO OUTPUT
001_frontal.jpg). Images are in lossless 24 bit color JPEG format with the original size of 3,0722,048 pixels, cropped
COMMENT
to 1,600x1,200 pixels. Cropping was done following the ANSI 385-2004 standard recommendation so that the face PHOTOGRAPHER’S CAMERA WAS CANO N CAM1
EOS 10D MODEL WITH 22.7×15.1 MM CMOS SENSOR, W I T H
6.3 MEGA
occupies approximately 80% of the image. Exact positions of eye centers, tip of the nose and the center of the mouth are manually extracted and stored in a textual file (this was also done for all
SIGMA
images in the database except for the different
752×582
18 – 50 MM F3.5 – 5.6 DC LENSES AN D
pose set). These mug shot images are what you
540 TVL
SIGMA EF 500 DG SUPER FLASH.
would expect to find in a law enforcement data-
1/3
PIXELS,
IT
0,24/0,038 LUX (IR) >
50 DB
1
VPP, 75 Ω IR
EQUIPPED
WITH
base or when registering to a security system. There are in total 130 frontal facial mug shot
images in the database, one per subject. Cameras 1–5 are visible light cameras and all images with camNum labels cam1–5 represent images taken with surveillance cameras of different qualities.
NIGHT VISION
There are three images per subject for each camera, taken at three discrete distances (4.20, 2.60 and 1.00 m). This gives total of 15 images per subject in this set (1950 in total). This set was
CAM2
designed to test the face recognition algorithms in real-world surveillance setup (we again emphasize that we used commercially
1/3
available cameras of both low and high quality). As can be seen
SONY
from the images differ substantially in quality and resolution. The
SUPER HAD
original images saved from Digital Sprite 2 recorder to PC hard
795×596 480
drive all have the same dimensions (680x556 pixels, 96 dpi and
TVL 0,15 LUX >
50 DB
1
VPP, 75 Ω
24 bit color). They have been cropped in order to remove as much background as possible. Due to different distances at which the images were taken, new (cropped) images are not all the same size. Cropped images have the following resolutions: 100x75 pixels for distance 1, 144108 for distance 2 and 224x168 for distance
CAM3
3. All the coordinates mentioned previously were labeled and stored in a textual database for this set of images as well. IR night vision
1/3
mug shots were taken in a separate dark room with a resolution of
COLOR
426x320 pixels, grayscale. There is one image per subject in this set,
597×537
yielding a total of 130 of those images in the database.
350 TVL 0,3 LUX >
48 DB
1
VPP, 75 Ω
DOME CAMERA
EXAMPLE OF ONE IMAGE SET FOR ONE SUBJECT
08
JSON: [ {
TA B L E 1 R A N K 1 P E R F O R M A N C E
“FACEID”: “48CDF8C8-841C-4D33-B8751710A3FC6542”, {
“FACERECTANGLE”: “WIDTH”:
228,
“HEIGHT”: “LEFT”: “TOP”:
RESULTS
FOR
DAYTIME
IMENTS.
GALLERY:
TO
EXPER-
MUG
SHOT
}, {
204.9
As a non-invasive biometric method, face recognition is attractive
“Y”:
175.4
596.4,
“Y”:
250.9
531.4,
“Y”:
301
DATABASE
RECOGNITION ABRM
“X”:
629,
“Y”:
273.7
{
R A N K 1 R E C O G N I T I O N R AT E %
D I S TA N C E
{
GALLERY:
GALLERY:
F R O N TA L
CAM8
researchers, we feel they lack the real-world settings part. Images in other currently available databases are usually taken by the same camera e.g. Video database of moving faces and people,
this is still far from real-world conditions. For real-world applications, algorithms should oper-
strength. Head pose should be as natural as possible. All this mimics a really complex imaging an airport security system or any other na-
{
201.2
CAM6_2
3.1
10.0
CAM6_3
3.9
10.0
CAM7_1
0.7
1.5
CAM7_1
5.4
2.3
CAM7_3
4.6
7.0
}, “EYEBROWLEFTINNER”:
}, “EYELEFTOUTER”:
from a natural source) with varying direction and
ing environment, such as is expected when build-
“Y”:
212
datasets. In spite of many face databases currently available to
Illumination should be uncontrolled (i.e. coming
1.5
496.8,
THE MINIMIZATION
undergo rigorous testing and verification, preferably on real-world
cameras (of various qualities) on a public place.
1.5
“Y”:
DURING
face recognition system is efficient, robust and reliable, it must
ate and be robust on data captured from video
CAM6_1
“X”:
PROCESS.
CHANGES
THE
(in order to assure reproducibility of results) all
463.2,
177.1
MATCHING
LOCAL
REDUCES
veillance systems. However, in order to be able to claim that any
most frequently used databases have a standardized protocol
“X”:
547.4,
SUCH
WHICH
UTILIZES
for national security purposes as well as for smaller scale sur-
(high quality capturing equipment is used). Although some of the
“EYEBROWLEFTOUTER”:
“Y”:
ALGORITHM,
NEC’S
trolled conditions of lighting, pose, etc., and are of high resolution
},
“X”:
OF
TECHNOLOGY
THE
Images in those databases are mainly taken under strictly con-
DESCRIPTION
}, “MOUTHRIGHT”:
CAUSED
es that include sets for voice recognition—BANCA, XM2VTSDB).
D ATA B A S E D E M O G R A P H I C S
CAMERA / “X”:
CHANGES
available, surveillance equipment (e.g. some multimodal databas-
{
}, “MOUTHLEFT”:
BLINKING
CMU PIE, FERET and are not taken using proper, commercially
{
“X”:
AND
OF THE LOCAL CHANGES GUARANTEES THE
}, “NOSETIP”:
SMILING
OVERALL FACE RECOGNITION ACCURACY.
“PUPILRIGHT”: 609.8,
Interest in face recognition, as a combination of pattern recogniand many real-world systems are being developed and distributed.
THE
}, “X”:
LOCAL
BY
BY
INTENTIONAL
IMPACT
“Y”:
ADVERSE
tion and image analysis is still growing. Many papers are written
THE
125
507,
OF
EYES,
FACE
460,
“X”:
IMPACT
WEARING OF CAPS, HATS AND GLASSES),
METRIC: COSINE ANGLE
“FACELANDMARKS”: “PUPILLEFT”: {
THE
CAUSED AND
FRONTAL IMAGES. DISTANCE
228,
REDUCE
CHANGES (E.G., VARYING FACIAL EXPRESSION
{
{
tional security system. The obvious lack of such image sets is the main reason for a low number of studies on face recognition in such naturalistic conditions, resulting in very high recognition rates suggesting that face recognition is almost a solved problem. This was the main motivation for collecting our database. Our database mimics the real-world conditions as close as possible and we think that using our database in experiments will show that real-world indoor face recognition by using standard video surveillance equipment is far from being a solved problem.
12
D ATA B A S E S U B J E C T S ’ A G E 3 5 DISTRIBUTION FROM
TOTAL
TEERS, 16
114
OF
130
WERE
VOLUN-
MALES
AND
FEMALES.
DAYTIME
This should be a straightforward test comparing the mug shot image (visible light, i.e. subjectID_
TESTS
frontal.jpg images) to cam1–5 images. Frontal mug shots represent the gallery of known images and cam1–5 images and three distances are probe sets (the main idea is similar to FRVT tests as in). This scenario gives 15 possible different probe sets, varying both in distances from camera and in camera qualities. Comparing the probe image to one gallery image is the most logical rea world (law enforcement) scenario.
NIGHTTIME
NightTime tests have two possible galleries that can be used. Cam6-7 images (IR night vision) can
TESTS
be compared to mug shot images (visible light, i.e. subjectID_frontal.jpg images) or they can be compared to cam8 (IR night vision mug shot) images. This scenario gives two galleries and six possible
35
different probe sets, again varying both in distances from camera and in camera qualities.
PERFORMANCE
METRICS
30
Rank 1 results and Cumulative Match Score (CMS) curves are the usual way to report results in identification scenario (for details see [16]), while ROI plots are the usual way of reporting results in verification scenario. Besides, we suggest the use of some form of statistical significance test like McNemar’s hypothesis test. Error bars and standard deviations of performance results are also something that would help establish the true robustness and performance improvement of any
25
algorithm. Control algorithm performance scores must accompany performance results of novel algorithms. This control algorithm should be something that is easily implemented and readily available to all, like PCA or correlation. An initial baseline PCA tests
20
will be provided here, but the researchers are encouraged to expand and improve them to make solid grounds for control algorithms comparisons in the future.
15
NUMBER OF SUBJECTS
TRAINING
SCENARIO
SUGGESTION
In order for any tests to be close to expected real-world (law enforcement) applications and for the results to be meaningful, we strongly suggest the use of a separate set of face imag-
10
es for training, especially in the identification scenario. In other words, any experiment should be designed so that images used in algorithm training stage have no overlap with the query images (gallery and probe). We believe that this is the only way to make
5
fair comparisons and to present fair performance results. Other databases (like FERET or CMU PIE) can be used for training, similar to the setup from. All algorithms will probably experience a serious performance drop with this experimental setup, but it will
0
actually show their true strength. 20
25
30
35 AGE
40
45
14
13
DATABASE DESCRIPTION
EXAMPLE OF DIFFERENT POSE IMAGES
GENERAL
INFO
Over the last ten years or so, face recognition has become a popular area of research in computer vision and one of the most successful ap-
The problem specifically addressed with our
plications of image analysis and understanding.
database is the law enforcement person identifi-
Because of the nature of the problem, not only
cation from low quality surveillance images or any
computer science researchers are interested in
other identification scenario where the subject’s
it, but neuroscientists and psychologists also.
cooperation is not expected. Recognizing how
It is the general opinion that advances in com-
important this topic is, researchers are perform-
puter vision research will provide useful insights
ing more and more experiments using CCTV
to neuroscientists and psychologists into how
(Closed Circuit Television) images, exploring
human brain works, and vice versa. A general
performance of both humans and computers.
statement of the face recognition problem (in
Since up to now there were no publicly availa-
computer vision) can be formulated as follows:
ble CCTV face images databases, researchers
Given still or video images of a scene, identify
were forced to perform the experiments using
or verify one or more persons in the scene using
their images they captured themselves, which
a stored database of faces.
in turn often results in having a low number of subjects and results which have questionable
WHAT
statistical significance. We feel that our SCface
IS
FACIAL
RECOGNITION?
Like all biometrics solutions, face recognition technology measures and matches the unique
database will successfully fill this gap. For this purpose, we collected images from
characteristics for the purposes of identification
130 subjects. The database consists of 4,160 images in total, taken in uncontrolled
or authentication. Often leveraging a digital or
lighting (infrared night vision images taken in dark) with five different quality sur-
connected camera, facial recognition software
veillance video cameras. Two cameras were able to operate in infrared (IR) night
can detect faces in images, quantify their fea-
vision mode so we recorded that IR output as well (we will address these images
tures, and then match them against stored tem-
as IR images; for cameras details please see the Appendix). Subjects’ images were
plates in a database. Face scanning biometric
taken at three distinct distances from the cameras with the outdoor (sun) light as
tech is incredibly versatile and this is reflected in
the only source of illumination. All images were collected over a 5 days period. This
its wide range of potential applications.
database was collected mainly having identification rather then verification in mind, with the additional motive being the fact that identification is a more demanding recognition problem with its one-to-many comparisons. The proposed protocol and
WHERE
DO
I
FIND
FACIAL
RECOGNITION??
Like all biometrics solutions, face recognition technology measures and matches the unique
initial testing with PCA will also be along these lines of thought. Reported results
characteristics for the purposes of identification
will accordingly show rank 1 recognition percentages, but Cumulative Match Score
or authentication. Often leveraging a digital or
(CMS) curves will be omitted, since such a detailed analysis is beyond the scope of
connected camera, facial recognition software
this paper. Here is a short summary of what makes this dataset interesting to face
can detect faces in images, quantify their fea-
recognition research community.
tures, and then match them against stored templates in a database. Face scanning biometric tech is incredibly versatile and this is reflected in its wide range of potential applications.
17
18
IMAGE SETS DETAILS// 1.SCALE, EXPRESS I O N A N D P O S E I N V A R I A N T FACE DETECTION Facial mug shots are high quality static color images, taken in controlled indoor illumination conditions environment using Canon EOS 10D digital camera. There is
2.THE STUDY OF A U T H E N T I C A T I O N A N D R E C O G N I FRONTAL
FACIAL
MUG
SHOTS
TION FOR FACES C A P T U R E D U N D E R C O N T R O L L E D AND UNCONTROLL E D C O N D I T I O N
one mug shot per subject and those images are labeled with the suffix frontal (e.g.
3.ANALYZING IMAG E S C A P T U R E D I N N I G H T M O D E .
001_frontal.jpg). Images are in lossless 24 bit color JPEG format with the original size of 3,072x2,048 pixels, cropped to 1,600x1,200 pixels. Cropping was done fol-
4.OCCLUSION ANAL Y S I S
lowing the ANSI 385-2004 standard recommendation so that the face occupies ap-
5.GAIT ANALYSIS
proximately 80% of the image. Exact positions of eye centers, tip of the nose and the center of the mouth are manually extracted and stored in a textual file (this was also done for all images in the database except for the different pose set). These mug shot images are what you would expect to find in a law enforcement database or when registering to a security system. There are in total 130 frontal facial mug
We have collected face database with variation in pose, scale,
shot images in the database, one per subject.
rotation, expression, illumination, occlusion, which are captured
POSSIBLE DATA USAGE AND AVAILABILITY
under both controlled and uncontrolled conditions in order to fulfil Cameras 1–5 are visible light cameras and all images with camNum labels cam1–5 represent images taken with surveillance cameras of different qualities. There are three images per subject for each camera, taken at three discrete distances (4.20, 2.60 and 1.00 m). This gives total of 15 images per subject in this set (1950 in total).
SURVEILLANCE CAMERAS IMAGES SURVEILLANCE CAMERAS IMAGES (VISIBLE LIGHT) (VISIBLE LIGHT)
the requirement of having a challenging and complete database for the task of detecting, authenticating and recognizing faces captured by surveillance camera. Providing such variations in our database, we suggest that this database can be used as follow.
This set was designed to test the face recognition algorithms in real-world surveillance setup (we again emphasize that we used commercially available cameras of both low and high quality). As can be seen from the Fig. 5, the images differ substantially in quality and resolution. The original images saved from Digital Sprite 2 recorder to PC hard drive all have the same dimensions (680x556 pixels, 96 dpi and 24 bit color). They have been cropped in order to remove as much background as possible. Due to different distances at which the images were taken, new (cropped) images are not all the same size. Cropped images have the following resolutions: 100x75 pixels for distance 1, 144x108 for distance 2 and 224x168 for distance 3. All the coordinates mentioned previously were labeled and stored in a textual database for this set of images as well.
In this paper, we have described and proposed a large scale, more challenging and large variability surveillance camera face database that consists of 15,000 and 5,000 images captured under controlled and uncontrolled conditions, respectively and 200 videos. The main objective of producing this database is to tackle the limitations of the available databases that are used in the task of detecting, authenticating and recognizing faces captured by surveillance camera. In order to make our database more challenging, we have maintained the same trends for face detection and recognition, which have been introduced in the available database such as variation in pose, expression and illumination. In future, we will increase the total number of images and length of the recorded videos.
CONCLUSIONS
20
19
POTENTIAL USES OF THE DATABASE
SURVEILLANCE IMAGES
(IR
CAMERAS
NIGHT
VISION)
IR night vision surveillance cameras images are completely the same as visible light surveillance images in section 2.3.2 regarding the resolution but are in grayscale. IR surveillance images are labeled as cam6 and cam7, again at three discrete distances.
The first and the most important potential use of this database is
There are six IR night vision surveillance images per subject, thus
to test the face recognition algorithm’s robustness in a real world
780 of them in total.
surveillance scenario. In such a setup a face recognition system should recognize a person by comparing an image captured by a surveillance camera to the image stored in a database. If we postulate a law enforcement use as the most potential scenario, this
DIFFERENT
POSE
IMAGES
database image, to which the surveillance image is compared to, is a
This set of images was taken with the same high quality photo
high quality full frontal facial mug shot. We would specifically like to
camera as frontal facial mug shots and under the same condi-
encourage researchers to explore more deeply the small sample size
tions. Subjects’ pose ranges from 90 (left profile) to +90 degrees
problem (more dimensions than examples). Having one frontal mug
(right profile) in nine discrete steps of 22.5 degrees. Image size
shot per subject in our database addresses this issue adequately.
is 3,072x2,048 pixels and frontal mug shot images of the same
Our surveillance cameras were of various quality and resolution
size are included for the sake of completeness. There are nine
(see Appendix) and this issue is the strong point of this database.
images per subject in this set, which gives 1,170 images in total.
It remains to be seen how will face recognition algorithms perform
The correlation between the angle label in the file name and the
in such difficult conditions and how does the quality of capturing
actual head pose is given in the Appendix.
equipment and subject’s distance from camera influence the results. There is also a potential to test various image preprocessing algorithms (enhancing image quality by filtering techniques), as some of these surveillance images are of extremely low quality and resolution. By including different pose images of subjects, we made it possible to use this database in
DATABASE
DEMOGRAPHICS
The participants in this project were students, professors or employees at the Faculty of Electrical Engineering and Computing, University of Zagreb, Croatia. From total of 130 volunteers, 114 were males and 16 females. All participants were Caucasians, between the ages of 20 and 75. The age distribution is displayed, the textual file containing each subjects date and year of birth will be distributed with the database. This also makes our SCface database unique, because subjects’ birthdays are not available in any other database. Moreover, the textual file will contain the following additional information: gender, beard, moustache, glasses.
face modeling and 3D face recognition. Other potential uses of this database include but are not restricted to: evaluation of head pose estimation algorithms, evaluation of face recognition algorithms’ robustness to different poses, evaluation of natural illumination normalization algorithms, indoor face recognition (in uncontrolled environment), low resolution images influence, etc.
22
P E R F O R M A N C E R E S U LT S F O R D AY T I M E E X P E R I M E N T S . GALLERY MUG
SHOT
IMAGES. COSINE
FRONTAL DISTANCE
METRIC:
ANGLE
We normalized all images from our SCface database following the standard procedure often described in many face recognition papers. Firstly, all color images were
APPENDIX
IMAGE PREPROCESSING
turned to grayscale. By using eye coordinates information, all images are scaled and rotated so that the distance between the eyes is always the defined number of pixels (32 in our case) and the eyes lie on a straight line (head tilt is corrected). Then the image is cropped to 64x64 pixels and masked with an elliptical mask. Two
CAMERA/
R E C O G N I T I O N R AT E [ % ]
D I S TA N C E
examples of preprocessed images can be seen in Fig. 8. The eyes are at positions left eye, and right eye. Standard histogram equalization (Matlab) is done prior to masking, normalizing the pixel values to [0, 255].
CAM1_1
2.3
The experiment performed is set up in a most difficult manner imaginable, but
CAM1_2
7.7
it mimics what can be expected in real world face recognition applications: the
CAM1_3
5.4
training set is different than the test set. In other words, training and test sets
CAM2_1
3.1
completely independent (images used in training will not be used in gallery or probe
CAM2_2
7.7
CAM2_3
3.9
CAM3_1
1.5
CAM3_2
3.9
CAM3_3
7.7
We will preprocess 501 randomly chosen FERET database images in the same
CAM4_1
0.7
manner as we preprocessed our database images and we will use those images to
CAM4_2
3.9
CAM4_3
8.5
CAM5_1
1.5
CAM5_2
7.7
CAM5_3
5.4
sets). We will use a totally different database for training to insure just that. This is why the recognition rates achieved will seem so low. This kind of setup, although often discussed, is rarely used in research papers, but we feel it is time to push this research area into a new level.
train baseline PCA. PCA algorithm details are well known and are beyond the scope of this paper. They will not be discussed here and an interested reader is referred to. Gallery and probe images will be from our SCface database but with no overlap between the two sets (no images from the gallery are in the probe set and vice versa). By training the PCA with one image per person (the frontal mug shot) we will also address the small sample size problem (more dimensions than examples). After the training, we kept 40% of the eigenvectors with highest eigenvalues, thus forming 200 dimensional face (sub)space in which the recognition will be done. Using the fact that once the PCA is trained the distance between any pair of projected images is constant, we were able to perform virtual experiments. We projected all images from our database onto the derived 200 dimensional subspace and calculated distance matrix (distance matrix being the size n x n, where n is the number of images in our database, and containing the distances between all images). All experiments were then performed on those matrices by changing the list items in probe and gallery sets and without running the algorithms again. Distance metric used was the cosine angle (COS).
EXPERIMENTAL DESIGN
26
25
INDEPENDENT COMPARATIVE STUDY OF PCA, ICA, AND LDA ON THE FERET DATA SET
Face recognition is one of the most successful applica- tions of image analysis and understanding and has gained much attention in recent years. Various algorithms
ABSTRACT
were proposed and research groups across the world reported different and often contradictory results when comparing them. The aim of this paper is to present an independent, comparative study of three most popular appearance-based face recognition projection methods (PCA, ICA, and LDA) in completely equal working conditions regarding prepro- cessing and algorithm implementation. We are motivated by the lack of direct and detailed independent comparisons of all possible algorithm implementations (e.g., all projection–metric combinations) in available literature. For consistency with other studies, FERET data set is used with its standard tests (gallery and probe sets). Our results show that no particular projection metric combination is the best across all standard FERET tests and the choice of appropriate projection metric combination can only be made for a specific task. Our results are compared to other available studies and some discrepancies are pointed out. As an additional contribution, we also introduce our new idea of hypothesis testing across all ranks when comparing performance results. VC 2006 Wiley Periodicals, Inc. Int J Imaging Syst Technol, 15, 252–260, 2005; Published online in Wiley InterScience.
KEYWORD F A C E RECOGNITION; PCA; ICA;
1. APPEARANCE - BASED, WHICH USES HOLISTIC TEXTURE
Among many approaches to the problem of face recognition, appearance-based
LDA;
FEATURES AND IS APPLIED TO EITHER WHOLE - FACE
subspace analysis, although one of the oldest, still gives the most promising results.
F E R E T;
OR SPECIFIC REGIONS IN A FACE IMAGE;
Subspace analysis is done by projecting an image into a lower dimensional space
S U B S PACE ANALYSIS METHODS
2. FEATURE - BASED, WHICH USES GEOMETRIC FACIAL
(subspace) and after that recognition is performed by measuring the dis- tances
FEATURES ( MOUTH, EYES, BROWS, CHEEKS ETC. ) AND GEOMETRIC RELATIONSHIPS BETWEEN THEM.
Over the last ten years or so, face recognition has become a popular area of research in computer vision and one of the most successful applications of image analysis and understanding. Because of the nature of the problem, not only computer science researchers are interested in it, but also neuroscientists and psychologists. It is the general opinion that advances in computer vision research will provide useful insights to neuroscientists and psychologists into how human brain works, and vice versa. A general statement of the face recognition problem can be formulated as follows (Zhao et al., 2003): Given still or video images of a scene, identify or verify one or more persons in the scene using a stored database of faces. A sur vey of face recognition techniques has been given by Zhao et al., (2003). In general, face recognition techniques can be divided into two groups based on the face representation they use:
INTRODUCTION
between known images and the image to be recognized. The most challenging part of such a system is finding an adequate subspace. In this paper, three most popular appearance-based subspace projection methods for face recognition will be presented, and they will be combined with four common distance metrics. Projection methods to be presented are: Principal Component Analysis (PCA), Independent Component Analysis (ICA), and Linear Discriminant Analysis (LDA). PCA (Turk and Pentland, 1991) finds a set of the most representative projection vectors such that the projected samples retain most information about original samples. ICA (Bartlett et al., 2002; Draper et al., 2003) captures both second and higherorder statistics and projects the input data onto the basis vectors that are as statistically independent as possible. LDA (Belhumeur et al., 1996; Zhao et al., 1998) uses the class information and finds a set of vectors that maximize the between-class scatter while minimizing the within-class scatter. Distance metrics used are L1 (City block), L2 (Euclidean), cosine and Mahalanobis distance.
S U R V E I L L A N C E C A M E R A S S P E C I F I C AT I O N S DATABASE
DESCRIPTION
TYPE
CCD TYPE
CAM2
CAM3
CAM4
CAM5
BOSCH
SHANY
J&S J
ALARMCOM
SHANY
LT C 0 4 9 5 / 5 1
WTC-8342
CC-915D
VFD400-12B
MTC-L1438
1/3’’ IT
1/3’’ SONY
1/3’’ COLOR
1/3’’ IT
1/3’’ SONY SUPER
SUPER HAD
597X537
CAM1
ACTIVE PIXELS
HAD 795X596 480
RESOLUTION
752X582
795X596
350 TVL
752X582
TVL
MINIMUM ILLUMINA-
540 TVL 0,24/0,038
480 TVL
0,3 LUX
460 TVL
0 LUX (IR LED ON
TION
LUX (IR)
0,15 LUX
> 48 DB
1,5 LUX
AT 4 L U X )
46 DB
> 50 DB
> 50 DB
SNR
1 VPP, 75 Ω
1 VPP,
VIDEO OUTPUT
> 50 DB
COMMENT
1 VPP, 75 Ω IR
75 Ω DOME CAM-
1 VPP,
NIGHT VISION
ERA
75 Ω DOME
1 VPP, 75 Ω IR NIGHT VISION
CAMERA
M U G S H OT C A M E R A Photographer’s camera was Canon EOS 10D model with 22.7x15.1 mm CMOS sensor, with 6.3 mega pixels, equipped with Sigma 18–50 mm F3.5–5.6 DC lenses and Sigma EF 500 DG Super flash.
30
29
INTER - SESSION VARIABILITY MODELLING AND JOINT FACTOR ANALYSIS FOR FACE AUTHENTICATION
This paper applies inter-session variability modelling and joint factor analysis to face authentication using Gaussian mixture models. These techniques, originally
ABSTRACT
developed for speaker authentication, aim to explicitly model and remove detrimental within-client (inter-session) variation from client models. We apply the techniques to face authentication on the publicly-available BANCA, SCface and MOBIO databases. We propose a face authentication protocol for the challenging SCface database, and provide the first results on the MOBIO still face protocol. The techniques provide relative reductions in error rate of up to 44%, using only
GMMS
limited training data. On the BANCA database, our results represent a 31% reduc-
AUTHENTICATION
tion in error rate when benchmarked against previous work.
can be modelled and removed using ISV and JFA.
approaches still suffer from increased errors in the presence of substantial ses-
This hypothesis is supported by this work, as the
sion variability. One approach of particular interest uses a parts-based topology,
ISV and JFA techniques re- duce the error rate
whereby the distribution of features extracted from images of a client’s face is
by between 11% and 44% across the challenging
described by a Gaussian mixture model (GMM). In this approach was found to offer
BANCA, SCface and MOBIO databases.
the best trade off in terms of complexity, robustness and discrimination. Inter-
cessful techniques in improving robustness to session variability for speaker authentication are inter-
[ {
GMMS “48CDF8C8-841C-4D33-
“FACEID”:
B875-1710A3FC6542”, “FACERECTANGLE”: “WIDTH”:
session variability modelling (ISV) and the related
“LEFT”:
been shown to reduce errors by more than 30%.
“TOP”:
ISV and JFA aim to estimate more reliable client
{
228,
“HEIGHT”:
technique of joint factor analysis (JFA), which have
models by explicitly modelling and removing within
228,
460, 125
},
client variation using a low dimensional subspace. In speaker authentication, this detrimental variation is caused by different microphones, acoustic environments and transmission channels. JFA can be considered to be an extension of ISV as it also explicitly models between-client variation.
sources of within-class variation in speech have
pose and image acquisition, which we hypothesise
While face authentication has evolved considerably in the past 15 years, modern
received considerable focus. Two of the most suc-
authenti-cation task. The intuition is that the
the effects of variation in environment, expression,
acqusition, which cause mismatch between images of the same client (person).
which the issue of session variability has already
In this paper, we apply ISV and JFA to the face
thentication performance is adversely affected by
variability, that is, changes in environment, illumination, pose, expression, or image
basis of state-of-the-art speaker authentication, in
FACE
parallels in facial images. Specifically, face au-
Many challenges in face authentication can be attributed to the problem of session
estingly, a similar GMM based approach forms the
FOR
FOR
FACE
AUTHENTICATION
The GMM partsbased topology was first applied to face authentication in and has since been successfully utilised by several researchers. The method decomposes the face into a set of blocks that are considered to be separate observations of the same signal (the face). An overview of the procedure is shown in Figure 1. The main difference between this approach and the similar GMM-based approach to speaker authentication is the use of visual features extracted from parts of the face image, rather than acoustic features extracted from frames of the speech signal. Since features are extracted from each part of the face independently, the approach is naturally robust to occlusion, local transformation and face mislocalisation, and has been found to offer the best trade-off in terms of complexity, robustness and discrimination. The rest of this section describes the main processing stages of the framework, including image pre-processing, feature extraction and classi-fication.
31
Each image is rotated, cropped and registered to a 64x80 intensity image with the eyes 16 pixels from the top and separated by 33 pixels. Optionally, each cropped image is pro-cessed using Tan & Triggs normalisation. From each normalised image we exhaustively sample B x B blocks of pixel values by moving the sampling window one pixel at a time. Each block is mean and variance normalised prior to extracting the D lowest-frequency 2D-DCT coefficients. The resulting feature vectors extracted from an image are finally mean and variance normalised in each dimension. Each image is thus represented by a set of K feature vectors, O = o1,o2,...,oK.
IMAGE
PRE-PROCESSING
FEATURE
AND
EXTRACTION INDOOR AND OUTDOOR CAMERA S P E C I F I C AT I O N S IMAGE
ACQUISITION
PROCEDURES
The distribution of feature vectors for each client is modelled by a Gaussian mixture model, estimated using back ground model adaptation. Background model adaptation utilises a universal background model (UBM), m, as a prior for deriving client models using maximum a posteriori (MAP) adaptation. We only adapt the means of the GMM components and use diagonal covariance matrices, as this requires fewer observations to perform adaptation and has
S P E C I F I C AT I O N
INDOOR CAMERA
TYPE
ISECURE
IMAGE SENSOR
1/4�
RESOLUTION
700TVL
MINIMUM
0.3LUX/F1.6
I L L U M I N AT I O N
10X
OUTDOOR CAMERA
already been shown to be effective for face authentication. A test image, Ot, can then be verified by scoring against the model of the claimed client identity (si) and the UBM (m). The two models, si and m, each produce a loglikelihood from which a log-likelihood ratio (LLR) is calculated, to produce a single score. The image Ot is then classified as belonging to client i if and only if h(Ot,si) is greater than a threshold. In this work, we use a fast scoring technique known as linear scoring. This relies on using a particular representation of a GMM called a GMM mean supervector, which is formed by concatenating the GMM component means. Thus, with GMMs expressed as supervectors, linear scoring approximates with the diagonal matrix formed by concatenating the diagonals of the UBM covariance matrices, Fxt = Ft x Ntm is the supervector of UBM-centralised first order UBM statistics of the image, and N t is the diagonal matrix formed by concatenating the zeroth order UBM statistics of the test image. Readers are referred to for more details. Finally, the scores are normalised using ZT-norm (Z-norm followed by T-norm). Intersession variability modelling (ISV) and joint factor analysis (JFA) are two session variability model-
SDM01HD
SONY
D1
ISECURESDH01HD D1
EFFIO
COLOR/750TVL
OPTICAL
1/3� B/W
SONYEXIEW CCD
700TVL
COLOR/750TVL B/W
0.3LUX/F1.6 ZOOM
27X
OPTICAL
ZOOM
OPTICAL FOCUS A P P L I C AT I O N
TRUE
DAY/NIGHT (ICR)
TRUE
DAY/NIGHT (ICR)
ling techniques that have been applied with success to speaker authentication. This section provides a brief overview of these techniques. Readers are referred to [11, 24, 15] for more de- tails. Session variability modelling aims to estimate and exclude the effects of within-client variation, in order to create more reliable client models. At enrolment time, a GMM is trained for each client by adapting the means of a UBM, as described in Section 2. In terms of GMM mean supervectors, client enrolment can be expressed as where D is a diagonal matrix with elements and zi x N (0, I). Here, x (q, q) is the variance from the UBM and x is the adaptation relevance facto. The matrix D is set in this way to ensure that the MAP solution for di in is equivalent to the classical MAP mean update rule.
The system used to develop the database consists of two PTZ cameras, a monitor, a DVR and Joystick, which is installed in the Multimedia System Laboratory at the Department of Computer and Communication System Engineering, Universiti Putra Malaysia. The indoor PTZ camera has the capability to zoom up to 10 times optical zoom and was installed at the height of 3.66m from the floor level, while the outdoor camera has the capability to zoom up to 27 times optical zoom and was installed at the height of 4m. The specifications of both cameras are shown in Table. The setting of camera parameters and the distance of the subjects to the indoor PTZ camera is calibrated in order to ensure that setting should be identical across subjects.
CCTV
SYSTEM
34
33
To assess face authentication accuracy, scores were first generated on a development set, from which a global decision thresh-
EXPERIMENTAL
PROTOCOLS
old was found that minimised the equal error rate (EER). This threshold was then applied to a test set of completely separate clients to find the half total error rate (HTER), that is, the average of false acceptance and false rejection rates. Thus the threshold, as well as all other hyperparameters, were tuned prior to seeing the test set. This is a critical requirement if such technology is to be applied to real applications. In the past, a wide variety of databases have
The MOBIO database is a large and challenging biometric da-
been used for evaluation of face authentication
tabase. It contains videos of 150 participants captured in chal-
techniques. To properly evaluate ISV and JFA,
MOBIO
lenging real-world conditions on a mobile phone camera over a
we chose to use images taken in challenging
one and a half year period. Figure 2c shows example images,
conditions causing substantial within client
demonstrating session variability due to variation in pose and
variation. Furthermore, we chose to restrict
illumination. For this work, one image was extracted from each of
ourselves to publicly-available databases with
the videos and was manually annotated with eye locations. Using
separate training, development and test sets
manual face localisation allows us to evaluate face authenti-
to allow for unbiased evaluation. Unfortunate-
cation accuracy separately from the choice of face detection
ly, some popular databases such as FRGC
algorithm. These images and annotations will be made available
and LFW were thus not applicable, as they do
to facilitate future benchmarking5.
not include separate development and test sets2. We therefore
The MOBIO protocol is supplied with the database and defines
chose to evaluate the ISV and JFA techniques on the challenging
three non-overlapping partitions: training, development and test-
BANCA, SCface and MOBIO databases. The BANCA and MOBIO
ing. The development and testing partitions are defined in a gen-
databases already have well-defined protocols, while a suitable
derdependent manner, such that clients’ models are only test-
face authentication protocol for SCface is proposed in this work.
ed against images from clients of the same gender. We chose to use the training data in a gender independent manner to be consistent with the other databases, though future work could investigate gender dependent training. For male clients, the test set contains 3,990 target trials and 147,630 impostor trials. For female clients, there are 2,100 target trials and 39,900 impostor trials. All of the training data was used to train U , V , and D (9,579
SCFACE
The Surveillance Cameras face database (SCface) is particularly interesting from a forensics point of view because images were acquired using commercially available surveillance equipment, in a
images of 50 clients). A subset of 1,224 images of 34 clients (36 images each) was used for UBM training, while the other 16 clients were used for score normalisation.
range of challenging but realistic conditions. As could be imagined in a real-life scenario, for authentication, these surveillance images are compared to a single high-resolution mugshot image. Whilemsuggested a protocol for face identification, it did not include a world data set or a development data set, and no protocol was suggested for face authentication. Therefore, we propose the following face authentication protocol for SCface based on the DayTime tests scenario4. Then, test images were taken from the 5 surveillance cameras at 3 different distances: close, medium and far. Each client model was tested against the 15 surveillance images from each client in the same subset. This resulted in 645 target trials and 27,090 impostor trials in the test set. Unless otherwise noted, results are reported for a combined protocol, in which each test image was assumed to originate from an unknown camera at an unknown distance.
In this section, results are reported for each database independently. A GMM parts-based system, as described in Section 2, without ISV or JFA is used as the baseline system for comparison. Hyper-parameters were tuned on the development set for each database, including the block size used during feature extraction, and the dimensionality of subspaces U and V . UBMs were trained with 512 components6, and a relevance factor of = 4 was used for client model adaptation. For experiments on SCface only, cropped images were not pre-processed using Tan & Triggs normalisation, as it did not improve performance in that case. ISV and JFA were implemented based on the JFA cookbook7. Subspaces V , U and D were trained using 10 EM iterations, in that order for JFA. For ISV, only U was trained. Latent variables were estimated in the order yi (JFA only), xi,j, then zi, using one Gauss-Seidel iteration. Manual face localisation was utilised unless otherwise noted. For evaluating the statistical significance of improvements in HTER, we used the methodology proposed by equation and Figure 2, with a one tailed test.
RESULTS
36
35
B
D
SCFACE
BANCA
MOBIO(MALE)
MOBIO(FEMALE)
DEV
TEST
DEV
TEST
DEV
TEST
DEV
TEST
8
28
9.3%
8.2%
23.8%
25.7%
23.8%
25.7%
23.8%
25.7%
12
45
7.8%
6.1%
20.2%
20.6%
20.2%
20.6%
20.2%
20.6%
16
66
7.8%
6.5%
18.5%
17.7%
18.5%
17.7%
18.5%
17.7%
20
66
8.6%
7.6%
16.7%
16.4%
16.7%
16.4%
16.7%
16.4%
24
91
8.6%
7.2%
17.4%
16.4%
17.4%
16.4%
17.4%
16.4%
RESULTS ON BANCA, SCFACE AND MOBIO ( EER ON DEV SET, HTER ON TEST SET ) SHOWIN G T H E E F F E C T O F B L O C K S I Z E D U R I N G FEATURE EXTRACTION ( B × B PIXEL BLOCKS WITH D DCT COEFFICIENTS RETAINED ) .
Firstly, the size of the blocks used during feature extraction was
FEATURE
tuned on development data. The number of DCT coefficients for a
SCORE
given block size, D, was initially tuned on BANCA. As shown in Table
EXTRACTION
AND
NORMALISATION
BANCA
SCFACE
DEV
TEST
DEV
TEST
NO SCORE NORM
11.0%
11.1%
23.9%
25.1%
ZT-NORM
7.8%
6.1%
16.7%
16.4%
1, the optimal block sizes were 12 x 12 and 20 x 20 pixels for the BANCA and SC face databases, respectively. For MOBIO, across male and female clients, a block size of 12 x 12 was chosen. Table 2 illustrates that ZT-norm score normalisation was very effective for BANCA and SCface, with relative reductions in test set HTER of 45% and 35% respectively. For MOBIO, ZT-norm had little effect. This system, with tuned block size and ZT-norm score normalisation but without ISV or JFA, is referred to as the baseline system for the following session variability modelling experiments.
On both the development and test sets, the best performance was achieved using the ISV approach, which improved test set HTER by 11% relative.
RESULTS ON BANCA AND SCFACE ( EER ON DEV SET, HTER ON TEST SET ) SHOWING THE EFFECT OF ZT - NORM SCORE NORMALISATION.
SESSION
VARIABILITY
MODELLING
ON
BANCA
These improvements are statistically significant
MAN,FACE LOC.
at a level of 95% and 85% for development and test sets respectively. Table 4 shows that using 50 to 100 dimensions in U was optimal on the
A U TO . F A C E L O C .
SYSTEM
DEV
TEST
DEV
TEST
BASELINE
7.8%
6.1%
9.2%
6.7%
ISV
6.6%
5.4%
7.5%
6.0%
JFA
6.6%
6.3%
9.1%
7.0%
development set, and this generalised well to the test set. It is encouraging that these results do not appear overly sensitive to the choice of subspace dimension. Our results are compared to recently published work in Table 5. For comparison, we report the HTER on the test set (g2), development set (g1)8, and the average. Used automatic face extraction from the BANCA videos, while Ahonen et al. Used the 5 preselected images from each video as in this work. Our results represent a 31% reduction in average HTER when compared to previous work.
RESULTS ON BANCA AND SCFACE ( EER ON DEV SET, HTER ON TEST SET ) SHOWING THE EFFECT OF ZT - NORM SCORE NORMALISATION.
38
NEW IMAGE QUALITY MEASURE BASED ON WAVELETS // Discrete wavelet transform DWT can be used in various image processing applications, such as image compression and coding.1 In this paper, we examine how DWT can be used in image quality evaluation, which has become crucial for the most image processing applications. Quality of an image can be evaluated using different measures. The best way to do this is by making a visual experiment, under controlled conditions, in which human observers grade which image provides better quality. Such experiments are time consuming and costly. A much easier approach is to use some objective measure that evaluates the numerical error between the original image and the tested one. In the real world, there is no perfect way for an objective assessment of image quality.2 The problem with the most objective measures is SESSION
VARIABILITY
MODELLING
ON
SCFACE
On the SCface database, as shown in Table 6, both of the pro-
1. OBJECTIVE MEASUREMENTS, INCLUDING OUR
that objective measures need a reference origi-
posed session variability modelling techniques outperformed the
OWN DEVELOPED MEASURE, WERE PERFORMED
nal image to be able to grade the corresponding
baseline. JFA offered consistently improved performance over the
ON THE SAME SET OF PICTURE SEQUENCES
tested image, while human observers can grade
baseline and ISV systems, resulting in a relative reduction in test
TAKEN FROM AN ALREADY KNOWN IMAGE DATABASE.
image quality independently of a corresponding
2. SUBJECTIVE ASSESSMENT RESULTS WERE TAKEN
been many attempts to develop models or met-
set HTER of 18% over baseline results. The dimensionalities of V and U were tuned on the development set to values of 10 and 40, respectively. On the test set, the improvements provided by ISV and
FROM REF. THE DATABASE INCLUDES SUBJEC
rics for image quality that incorporate elements
JFA over the baseline are statistically significant at levels greater
TIVE GRADES WITH CALCULATED DIFFEREN
of human visual system HVS sensitivity. These
than 98% and 99% respectively. Results are further analysed by
TIAL MEAN OPINION SCORE
metrics for quality assessment have limited ef-
separating the scores into three separate groups, i.e. those from
THE MAIN GOAL OF THESE STUDIES WAS TO
test images taken at the 3 different distances, close, medium and far. For this analysis only, the dataset used for Z-norm score normalisation was matched to the distance of the test image. Table 8 shows that the JFA approach provided substantial improvements for close and medium images, however, recognising far images remains particularly difficult.
VARIABILITY
MODELLING
ON
MOBIO
DMOS
RESULTS.
OBTAIN SUBJECTIVE RESULTS THAT WOULD BE USED IN THE THIRD STEP FOR VERIFICATION AND COMPARISON OF OBJECTIVE MEASURES. 3. THE RESULTS OF THE OBJECTIVE ASSESSMENTS STEP 1 AND SUBJECTIVE MEASUREMENTS STEP 2
SESSION
original image. Over the past years, there have
WERE STUDIED.
relative improvement over the baseline of 30%, using dimensionalities of 30 and 50 for V and U respectively. For female clients, the ISV technique performed the best, providing a relative improvement of 44% with 250 dimensions in U .
standard and objective definition of image quality. In our research, Watson’s wavelet model5 is used to incorporate HVS characteristics in image quality measure. This model is based on direct measurement of the HVS noise visibility threshold for the specific wavelet decomposional fil- ter. Blank images with uniform gray level
substantially reduced the error rate compared to the baseline, 99.99%. For the male tests, JFA outperformed ISV, providing a
of real images. However, there is no current
tion level using a linear-phase CDF9_7 biorthog-
On the MOBIO database, as shown in Table 7, both ISV and JFA with improvements statistically significant at a level greater than
fectiveness in predicting the subjective quality
were decomposed, and afterward noise was added to the wavelet coefficients. After inverse wavelet transform, the noise visibility threshold in the spatial domain was measured by subjective experimentation at a fixed viewing distance. An experiment was conducted for each subband, and the visual sensitivity for that subband was then defined as the reciprocal of the corresponding visibility threshold. This model can be directly applied for the perceptual image compression by quantizing the wavelet coefficients according to their visibility thresholds. It is also extendable to imagequality assessment, as was done in Ref. 6, where a wavelet visible difference predictor was used to predict visible differences between original and compressed or noisy images. In this paper, we present a new way of using Watson’s wavelet model for image-quality evaluation.
40
39
FOREHEAD
MID
FOREHEAD
ORBITA UPPER
EYEBROW
RYELID
UPPER
EAR NOSE
LIP
TIP
LIP
CORNER
42
41
To be able to compare several later-described objective methods, we used subjective quality results from Ref. 7. Subjective quality evaluation was based on ITU-R recommendation BT.500-11.10 Details of the subjective testing can be found in Ref. 11. Briefly, they are as follows: 29 high-resolution 24 bits / pixel RGB color images typically, 768 512 were degraded using five degradation types:
ties for white noise, Gaussian blur, and fastfad-
1 . J P2K, JPEG2000 COMPRESSION
ing. About 20–29 observers had to grade image
2 . J PEG, JPEG COMPRESSION
quality on a continuous scale with five grades
3 . W N, WHITE NOISE IN THE
ed images. The experiments were conducted in
JSON: [ { “48CD-
1710A3FC6542”,
JPEG2000 and six images with different quali-
of which 203 were reference and 779 degrad-
IMAGE
MEASURE
F8C8-841C-4D33-B875-
seven to nine different qualities for JPEG and
observers evaluated total of 982 images, out
QUALITY
“FACEID”:
Each of these 29 images had versions with
bad, poor, fair, good, and excellent . In this way,
SUBJECTIVE
“FACERECTANGLE”: { “WIDTH”:
228,
“HEIGHT”: “LEFT”:
R GB COMPONENTS
“TOP”:
228,
460, 125
4 . G BLUR, GAUSSIAN BLUR
},
5 . F ASTFADING, TRANSMISSION ER
“FACELANDMARKS”: { “PUPILLEFT”: {
seven sessions: two sessions for JPEG2000, two for JPEG, and one each for white noise, Gaussian blur, and fastfading transmission errors. Raw
R ORS IN THE JPEG2000 BIT S TREAM USING A FAST - FADING
“X”:
507,
R AYLEIGH CHANNEL MODEL
“Y”:
204.9
scores for each subject were converted in difference scores between the test and reference images, di,j = riref
}, “PUPILRIGHT”: {
j-ri,j, where riref j denotes the raw quality score assigned by the i’th subject to the reference image corresponding to the j’th distorted image and ri,j score for the i’th subject and j’th image. Difference scores were converted to Z scores where di is the mean of the raw
because each observer uses different part of grading scale.
609.8,
“Y”:
175.4
}, “NOSETIP”: {
score differences overall images ranked by the subject i and i is the standard deviation. Z scores are used to make scores more equal,
“X”:
“X”:
596.4,
“Y”:
250.9
}, “MOUTHLEFT”: {
OBJECTIVE QUALITY
In this paper, we examined several commonly used
IMAGE
objective quality measures, which were applied to a
MEASURES
“X”:
531.4,
1. MSE
“Y”:
301
2. PSNR
luminance channel only, because they give better correlation results with subjective testing by comparison to calculating objective measures
},
3. STRUCTURAL SIMILARITY
“MOUTHRIGHT”: {
4. MULTISCALE SSIM
“X”:
629,
“Y”:
273.7
}, “EYEBROWLEFTOUTER”: { “X”:
463.2,
“Y”:
201.2
}, “EYEBROWLEFTINNER”: { “X”:
547.4,
“Y”:
177.1
}, “EYELEFTOUTER”: { “X”:
496.8,
“Y”:
212
SS I M
MSSIM
5. VISUAL INFORMATION FIDELI TY
VIF
6. VISUAL SIGNAL - TO - NOISE RA TIO 7. IQM
using RGB components separately and then calculating mean of them. SSIM is a novel method for measuring the similarity between two images.12 It is computed from three image measurement comparisons:
VSNR
luminance, contrast, and structure. Each of
OUR PRO POSED MEASURE
these measures is calculated over an 8 8 local square window, which moves pixel-by-pixel over the entire image. At each step, the local statis-
tics and SSIM index are calculated within the local window. Because the resulting SSIM index map often exhibits undesirable “blocking” artifacts, each window is filtered with a Gaussian weighting function 11 11 pixels. In practice, one usually requires a single overall quality measure of the entire image; thus, the mean SSIM index is computed to evaluate the overall image quality. The SSIM can be viewed as a quality measure of one of the images being compared, while the other image is regarded as perfect quality. It can give results between 0 and 1, where 1 means excellent quality and 0 means poor quality. Similar to SSIM, the MSSIM method is a convenient way to incorporate image details at different resolutions.13 This is a novel image synthesis-based approach that helps calibrating the parameters such as viewing distance that weight the rela- tive importance between different scales.
44
43
VIF criterion14 quantifies the Shannon information that is shared between the reference and distorted images relative to the information contained in the reference image itself. It uses natural scene statistics modeling in conjunction with an imagedegradation model and an HVS model. Results of this measure can be between 0 and 1, where 1 means perfect quality and near 0 means poor quality. VSNR15 operates in two stages. First, the threshold for distortions of a degraded image is determined to decide if it is below or above human sensitivity of error detection. This is computed using wavelet based models of visual masking. If distortions are below the threshold, then the distorted image is assumed to be
SYNTHESIS
ANALYSIS
perfect VSNR. If the distortions are above the threshold, then a second stage is applied. Calculations are made on the lowlev-
COLUMNS
ROWS
el visual property of perceived contrast and the midlevel visual
COLUMNS
ROWS
property of global precedence. These properties are used to determine Euclidean distances in distortion contrast space of multiscale wavelet decomposition. Finally, VSNR is calculated from
L
↓2
a linear sum of these distances. A higher VSNR means that the
L
↓2
↑2
L
H
↓2
↑2
H
H
↓2
↑2
L
H
↓2
↑2
H
↑2
L
↑2
H
tested image is less degraded. DWT refers to wavelet transforms for which the wavelets are discretely sampled. This can be done with multiresolution analysis of a signal.8 Multiresolution analysis allows us to decompose a signal into approximations and details. These coefficients can be computed using various filter banks, such as Daubechies, Coiflets, or biorthogonal filters. Suppose we have a one-dimensional input signal x t . It can be decomposed into approximation and detail coefficients of the first level. Then we can also decompose approximation coefficients at the first level further into approximation and detail coefficients at the second level. This can be expressed as
H
↓2
Equations 6 look like convolution, but there is a downsampling involved by a factor of 2 . h0 and h1 are accordingly scaling and wavelet filters. The decomposition of a signal into an approximation and a detail can be reversed. Similar expressions like 6 can be used, but we have to use upsampling and conjugate mirror filters. In image transform, we have two dimensions. Thus, we need to extend analysis of decomposition and reconstruction in two dimensions. We may do decomposition with separable wavelet transform, which is in fact one dimensional convolution with subsampling by a factor of 2 along the rows and columns of image. Reconstruction is done reversely. This means upsampling by 2 and then con- volution along the rows and columns. Decomposition and reconstruction at level j are shown in Fig. 1.
Some of the existing objective measures described in Section 4 did not take into account HVS in the sense that the eye will see and grade image quality according to the type of error, as well
PROPOSED FOR
IMAGE
NEW
ALGORITHM
QUALITY
as location of an error in subband space. Because of that, our method calculates image quality using wavelet decomposition
CDF9_7, LOWPASS
CDF9_7, HIGHPASS
COIF22_14, LOWPASS
COIF22_14, HIGHPASS
0.03782845550726
−0.06453888262870
−0.00006038691911
0.00249239584019
−0.02384946501956
0.04068941760916
−0.00007137535849
0.00294555229198
−0.11062440441844
0.41809227322162
0.00097545380465
−0.02160076866236
0.37740285561283
−0.78848561640558
0.00120718683898
−0.02777241079070
0.85269867900889
0.41809227322162
−0.00658124080240
0.09720345190957
0.37740285561283
0.04068941760916
−0.00932685158094
0.16200574375453
−0.11062440441844
−0.06453888262870
0.03683394176520
−0.64802297501813
−0.02384946501956
0.01809725255148
0.64802297501813
0.03782845550726
−0.14280042659266
−0.16200574375453
0.07881441881590
−0.09720345190957
0.73001880866394
0.02777241079070
0.73001880866394
0.02160076866236
0.07881441881590
−0.00294555229198
−0.14280042659266
−0.00249239584019
and grades quality depending on the wavelet subband in which the error occurs. Experiments on image database7 have shown that different types of image degradation produce different error distributions in wavelet subbands. For example, for JPEG and JPEG2000 compressed image errors will be placed in the higher wavelet subbands HH subband, level 2 and higher while images with Gblur and fastfading degradations will also have errors in lower subbands. White noise has equally distributed errors in all subbands. In our research, we used two types of wavelet filters. The first filter, called CDF9_717
nine
coefficients in decomposition lowpass and seven in decomposition highpass filters , is designed as a spline variant with less dissimilar lengths between low pass and high-pass filters and has seen widespread use in image processing. The second filter, Coif22_1418
22 coefficients in
decomposition lowpass and 14 in decomposition high-pass filters , has properties of antisymmetric biorthogonal Coiflet systems, whose filter banks have even lengths and linear phase. Coefficients of these wavelet filters are presented in Table 1. Figure 2 shows decomposition low pass and high pass wavelet filters. All color images were first converted to grayscale images by forming a weighted sum of the red R , green G , and blue B components: Y = 0.2989R + 0.5870G + 0.1140B.
COEFFICIENTS OF THE USED W A V E L E T F I LT E R S DUMIC, GRGIC, AND GRGIC: NEW IMAGE - QUALITY MEASURE BASED ON WAVELETS
0.01809725255148 0.03683394176520 −0.00932685158094 −0.00658124080240 0.00120718683898 0.00097545380465 −0.00007137535849 −0.00006038691911
B1
CDF9_7, LOWPASS
95% CONFIDENCE
B2
95% CONFI-
B3
95% CONFIDENCE
BOUNDS
DENCE BOUNDS
BOUNDS
−23.25
0.4292
28.71
−33.94,−12.57
0.6488
29.45
0.2096,
27.96,
B4
95% CONFIDENCE
B5
95% CONFIDENCE
BOUNDS
BOUNDS
−0.6641
61.49
−1.059,−0.2692
72.79
−151.6
121.2
−175.5,−127.8
131.4
50.2,
PSNR −100.9
−7.904
0.4158
SSIM
−128.9,−72.8
−9.698,−6.11
0.4304
MSSIM
−71.36
36.51
1.002
−124.2,−18.48
49.2
−34.3
6.443
−41.76,−26.83
8.04
163 −257.2,583.1
0.4011,
0.9657,
111.1,
−20.94
40.7
1.039
−25.73,−16.14
68.24
−0.2692
−13.14
32.05
−0.3165,−0.2218
−15.5,−10.78
15.04
−0.07769
22.4
1.432
−93.66,123.7
−0.1624,0.006981
23.85
−3.402,6.265
−114.5
−36.85
5.183
3.168
−71.25,−2.446
9.351
57.36
3.431
23.82,
13.17,
LOG10 VIF VSNR LOG10 IQM1
WAT -
SON’S MODEL LOG10 IQM2
EXPER-
I M E N T A L LY D E T E R -
4.845,
1.016,
20.95,
2.855,
3.481
43.23
38.4,
48.06
MINED 20.68,
94.04
COEFFICIENT PARAMETERS FOR LOGISTIC FUNCTION DUMIC, GRGIC, AND GRGIC: NEW IMAGE - QUALITY MEASURE BASED ON WAVELETS
4.813
2.048,
3.292 3.335
3.25,
−2.56 −17.8,12.68
30, 34.1
−129.4,−99.57 55.09
5.219, 105
50
SCHEDULE & MAP
S AT U R D AY , M AY 2 7 4:30
P.M.-7:30
P.M.
Conference Registration
Conference Registration
6
PLAZA FOYER
7
PLAZA BALLROOM
TERRACE
8
OASIS
4
TAPESTRY
9
RESTROOM
5
HOTEL
1
EXECUTIVE
2
TERRACE
3
BOARDROOM
PATIO
LOBBY
S U N D AY , M AY 2 8 A.M.-5:00 A.M.-5:00
P.M.
8:00 8:00
A.M.-6:00
P.M.
7:30
P.M.
Management Development Program Preconference Workshops
3
4
M O N D AY , M AY 2 9 7:30 8:00 8:00 8:00
A.M.-5:30 A.M.-5:00
P.M.
A.M.-5:00 A.M.-6:00
P.M.
P.M. P.M.
Conference Registration Exhibitor Move-in
Management Development Program
6
Preconference Workshops
A . M . - 5 : 0 0 P . M . Regulatory Compliance for Education Abroad Risk Management: A NAFSA and URMIA Seminar P . M . - 4 : 3 0 P . M . Latin America Forum 4 : 0 0 P . M . - 5 : 0 0 P . M . First-Timer’s Networking 5 : 3 0 P . M . - 7 : 3 0 P . M . Special Event: Knowledge Community Networking Receptions
9:00 2:00
2
1
5
7
T U E S D AY , M AY 3 0 A . M . - 5 : 3 0 P . M . Conference Registration A . M . - 1 2 : 3 0 P . M . Symposium on Leadership 8 : 0 0 A . M . - 1 2 : 3 0 P . M . Management Development Program (Day 3) 8 : 0 0 A . M . - 1 2 : 0 0 P . M . Preconference Workshops
7:00
8
7:30
A . M . - 3 : 3 0 P . M . Expo Hall A . M . - 3 : 0 0 P . M . G lobal Learning Colloquium on Teacher Education 1 : 0 0 P . M . – 2 : 0 0 P . M . Concurrent Sessions 1 : 0 0 P . M . - 3 : 3 0 P . M . U . S . Higher Education Partnership Fair
8:30
9:00
P.M.-3:30 P.M. 4:00 P.M.-5:30 P.M. 5:30 P.M.-7:30 P.M.
2:30
Concurrent Sessions
Opening Plenary Address, Isabel Wilkerson Special Event: Opening Celebration
9
JSON: [ { “FACEID”:
51
“48CD-
F8C8-841C-4D33-B8751710A3FC6542”, “FACERECTANGLE”: { “WIDTH”:
COLOPHON
228,
“HEIGHT”: “LEFT”: “TOP”:
228,
460, 125
}, “FACELANDMARKS”: { “PUPILLEFT”: {
TYPEFACE
“X”:
507,
“Y”:
204.9
}, “PUPILRIGHT”: {
The text is set in
“X”:
609.8,
Myanmar Sangam MN /
“Y”:
175.4
Designed by John Hudson and issued in digital from by Microsoft Corporation
}, “NOSETIP”: {
The headings are srt in
“X”:
596.4,
PT MONO /
“Y”:
250.9
Designed by Alexandra Korolkova with Olga Umpelova and Vladimir Yefimov
},
in 13 January 2010
“MOUTHLEFT”: {
S O F T WA R E
“X”:
531.4,
“Y”:
301
}, “MOUTHRIGHT”: {
Adobe Creative Cloud,
“X”:
629,
Indesign, Illustrator, Photoshop
“Y”:
273.7
}, “EYEBROWLEFTOUTER”: { EQUIPMENT
“X”:
463.2,
“Y”:
201.2
iMac desktop 21.5-inch: 2.7GH z
},
Epson Stylus Pro 4900 Designer Edition 17-inch Printer
“EYEBROWLEFTINNER”: {
PAPAER
“X”:
547.4,
“Y”:
177.1
}, “EYELEFTOUTER”: { Epson Premium Presentation Paper Matte
“X”:
496.8,
44lb. White , 4Star
“Y”:
212
PRINTING AND BINDING Printing Plotnet, San Francisco, CA Binding Chums, San Francisco, CA Date:05/17/2017
DESIGNER Yirui Qian
ABOUT THE PROJECT This is a student project only. No part of this book or any other part of the project was produced for commercial use.