Visual Recognition of Hand Postures

Page 1

CMPE 537 Computer Vision

Fall 2005

CMPE 537 Computer Vision

Term Project

Visual Recognition of Hand Postures

by Başar Uğur Işık Barış Fidaner

Boğaziçi University 2005 Işık Barış Fidaner

Başar Uğur


CMPE 537 Computer Vision

Fall 2005

Introduction This project is based on the idea of human computer interaction in 3D graphics applications. Today, webcams are widely used in personal computers. A simple tool which makes use of computer vision by the real-time data supplied by the webcam would be a new and inexpensive way of human-computer interaction. Our very first inclinations were getting the data at very high rates, then applying the model to a virtual physical world by fast computation in the remaining intervals. But things have not come up in that way during the programming process. However, final result remains a robust application which is open to many improvements including our first aims. The program simply tries to mimick your hand movements. A special glove in red, green, blue colors is used. Program uses a simplified 3D model of the glove, and tries to immitate user’s movements on the screen.

Definitions Background difference: This is a way of detecting moving objects in vision. A still image called background is initially supplied. Then, every pixel is determined to be in the background or foreground, by comparing its color to the background image. Double thresholding: In clustering pixels, double thresholding is used for better cluster assignments. It involves: •

High threshold, which defines a starting point for adding new pixels to the cluster,

Low threshold, which defines an ending condition for adding new pixels, so that they do not go below it.

Alpha-trimmed mean filter: This is similar to mean filter, except that some number of minimum and maximum elements in the neighboring pixels are discarded. Median filter: This filter sorts the neighboring pixels and takes the median of them as the new value. Both median and alpha-trimmed mean filters are nonlinear and give good results for impulsive noise, which is the type of noise that we have in our colored webcam images. Total cluster coverage: This is the percentage of pixels that are clustered among all foreground pixels. Color cluster radius: This is the radius of the sphere in RGB space, that defines a color cluster. The center of the sphere is the pre-determined color value of that cluster. Işık Barış Fidaner

Başar Uğur


CMPE 537 Computer Vision

Fall 2005

Recursion depth: When a recursion is being done, this is the limit depth value for deepening.

Işık Barış Fidaner

Başar Uğur


CMPE 537 Computer Vision

Fall 2005

Interface Program is implemented as an MFC (Microsoft application. There are three main parts of the program: •

Video capturing

Image processing

OpenGL realization

Foundation

Classes)

The dialog interfaces that represent these parts are shown on Figure 1. Video capturing controls

Image processing controls Figure 1. Main Program Dialog

OpenGL frame

Video Capturing When you start the program, it automatically detects the available video input device and starts real time capturing. The video input device we used was a USB 2.0 PC Webcam. It has a capability of capturing 30 frames per second live video. We used NCVideoInput library for this purpose as follows: The library was especially capable of grabbing and saving frames from real-time video. We made use of grabbing the frames at the same rate that they are sent by the webcam. Therefore, at each capturing event we had a 320x240 bitmap waiting to be Işık Barış Fidaner

Başar Uğur


CMPE 537 Computer Vision

Fall 2005

processed in our hand. Image Processing After the camera is initiated and images are being grabbed successfully, they can be processed. Processing an image consists of several steps. These steps are disabled at first. The user sets parameters for every step and enables by using corresponding checkboxes on the left. The steps and corresponding parameters are shown in Figure 2. Feature Grab from Background Color extraction camera difference clustering

Background image

Three cluster colors

High threshold

Cluster color radius

Low threshold

Estimated position

Estimated angle

3D Glove Model

Total cluster coverage

Recursion depth

Figure 2. Program flow. The parameters to be set by the user are shown in dashed-line boxes. Firstly the user turns the camera to the still background and sets the background image by pressing button "Set Background". Default values for high and low threshold are 30000 and 3000, respectively. The values actually correspond to the squares of the distances between the background and grabbed image pixels, in RGB space. For example, if R, G and B differences are 100, the distance square is 1002+1002+1002 which is equal to the default high threshold. At this point, the user can also change the "Recursion Depth" to put a limit to the number of neighbors to be checked by low threshold. After setting necessary parameters, the user can enable background difference and see the results live on the "Processed Image" frame. "Apply alphatrimmed mean filter" and "Apply median filter" checkboxes can also be checked for further noise reduction and experiencing the results, which comes with a burden of computation. The background difference parameters can also be changed on the run.

Işık Barış Fidaner

Başar Uğur


CMPE 537 Computer Vision

Fall 2005

Color clustering requires the user to pick three colors which represent the Palm, Fingers and Thumb parts of the hand model. Picking is done by dragging the mouse over a rectangular area, and that area's average color value is shown on the relevant color box under "Picked cluster colors" part. After three colors are picked accordingly one by one, they are used as the cluster center colors. Next step is determining the cluster radius. All three clusters have the same radius. To set the radius, the glove is shown to the camera so that it is the only object on the foreground. Then, the user should press the "Find cluster radius" button, and the radius is automatically calculated according to the total cluster percentage parameter. Both parameters can also be manually set. Color clustering can be applied, after cluster colors and radius is determined. Result of clustering should be as in Figure 1. Three colored areas correspond to the three colored parts of the glove. If result is not successful, this may be because of several problems. It may be because the camera changed exposure value to adapt lighting conditions, or there may be objects with similar colors with the glove. It can be solved by suitably changing parameters of previous operations. When clustering is succesfully applied to the image, features can be extracted from the image. These are center pixels of color clusters, center pixel of all clusters and total clustered area. These values are used in the next part of the program. OpenGL Realization As we have the features in hand, we can apply them onto the model. The hand model consists of three parts: Palm, Fingers and Thumb. The 3D model of the hand is (currently) rendered by four features of the clustered glove image: Palm's X coordinate value, Palm's Y coordinate value, total clustered area, and the angle of the vector from Palm's center to Fingers’ center. At this point, we added other parameters to enable recognizing and rendering more shapes but the results were hardly satisfactory due to noise or data insufficiency. Movements of the 3D model depends on four variables: Estimated X, Y, Z coordinates and estimated rotation around Z axis. We also tried to estimate rotation around X axis using standard deviation, but the results were not satisfactory because of the noisy input. The model acts in a parallel way with the user's hand to keep the interactive behavior. That is, when user's hand approaches to the screen, the model moves in the positive Z direction in 3D world, i.e. into the screen. When the user moves right, 3D model moves right and so on. Horizontal and vertical movements remain the same.

Problems and solutions We tried different algorithms and methods, some of them were discarded Işık Barış Fidaner

Başar Uğur


CMPE 537 Computer Vision

Fall 2005

because either it was not appropriate for correct estimations, or it slowed down the process. Now we will discuss the conditions that are important to meet our criteria. Overall evaluation of this project can be based on two main criteria: 1. Correctness of the estimations 2. Speed of the calculations Illumination We used color recognition in the project, and illumination was the most important condition from the beginning to the end. We built a special glove for our purpose which consisted of a red palm, green fingers and a blue (cyan) thumb, in order to be easily separated by their bright colors. However, things did not go so well at the beginning. The illumination inside the room gave out very small responses to the bright colored gloves, because lighting was supplied by a normal yellow 100 Watt light bulb. Then we changed it to a fluorescent lamb and got more saturated images, which was what we have looked for (Figure 4).

Figure 4. The illumination difference. First image is captured under yellow light bulb and the second one under fluorescent lamp.

Clustering For clustering colors, we first implemented K-means color clustering algorithm. We gave glove’s colors as initial values for the clustering, and tried to find the glove from the clustered image. But the problem was, k-means clustered every foreground pixel, including the arm and any other object. But we only wanted certain areas on Işık Barış Fidaner

R Palm

Thumb B

Fingers

G

Başar Uğur

Figure 5. Color clusters in


CMPE 537 Computer Vision

Fall 2005

the glove. So, we took a different approach. We clustered three colors as three spheres in the RGB color space. Center colors of three clusters were at first picked by clicking on the captured image. But this gave noisy results, because single pixel colors did not give a good central color. The texture on the glove made it harder to select a good central color. Then we changed it so that the central color is picked by the user as an average value of a rectangular area on the captured image. Position/orientation estimation The first features we extracted were the center positions of color areas and center point of total clustered area. This gave an idea about X and Y coordinates of the hand. Z position was a function of total clustered area. The smaller the glove becomes, the farther it is from the webcam. The positions gave good results in a mirroring application (where the 3D hand model behaves like a mirror). Because X/Y lengths showed perfect correspondence. After mirroring, we tried a different application: Putting the 3D hand model in a virtual world. In this application, the 3D hand model is not the mirror image of your hand, but it tries to move parallel to your hand. For example if you bring your hand near the webcam, hand model moves inward the screen. In this application, X/Y coordinates did not work as mirroring. Because of the camera perspective, distant movements were simulated shorter than real movements. In this situation, we had to remove the perspective effect on the coordinate system. We needed some kind of coordinate conversion from the camera coordinates into the model coordinates. In computer graphics, a focus length is used to apply perspective projection of the 3D objects on the screen. In our case, we had the projected lengths, and we needed real 3D coordinates. We assumed that the focus length of the camera was 10, and applied this formula in the reverse direction (Figure 6). We did not multiply the coordinates with the perspective factor, but divided by it. We used estimated Z values in this formula. In this case, farther X/Y movements became greater, as it was supposed to be.

Camera Assumed focus depth

Projection known

Real X/Y position unknown

Estimated Z position

Figure 6. “Unprojecting” the X/Y coordinates Background difference

Işık Barış Fidaner

Başar Uğur


CMPE 537 Computer Vision

Fall 2005

The glove’s colors, red, green and blue are very common and the camera could easily fail if background difference was not used. First we implemented background difference with a single threshold value. Every pixel on the image was compared to the background, and pixels that were not different from the background (color distance below threshold) were painted black. First implementation of background thresholding gave noisy results. If threshold was high, glove was not detected as a single object, but was seperated into parts. If threshold was low, glove was seen, but many other pepper noise distorted the image. We thought a non-linear filter would be good enough to solve salt-pepper noise on the image. We first implemented median filter, but it was not fast enough for a real-time application. Alpha-trimmed-mean filter was also slow. After trying nonlinear filters, we found another way, giving background difference an additional threshold for enlarging detected foreground regions. This is similar to the double thresholding used in canny edge detection, but used with a different purpose. Background difference worked fast and better with double thresholding. Still there was noise, but it was mostly because the camera changed exposure to adapt lighting conditions. If we could find a way to set a constant exposure, results would be better. Logging In a real-time application, every frame is processed in several steps. If the program slows down, you have to find out which of these processes was the reason behind the problem. We used a simple logging system for this purpose. Every process calls logging functions when an event starts and ends. Image class automatically logs which process took how many milliseconds. This helped much, when we were trying to speed up the program.

Işık Barış Fidaner

Başar Uğur


CMPE 537 Computer Vision

Fall 2005

Implementation The project interface dialog uses some independent classes for different functionalities. The class hierarchy is shown on Figure 3. We are going to describe these classes one by one. NC Video Input Library GrabCurrentFrame SelectDevice Attach Run ...

Dialog

OpenGLControl

PickColorRect processBitmap initialize OnTimer ...

DrawGLScene InitGL ...

Glove setOrientation setPosition draw ...

Image applyBackgroundDifference calculateClusterRadii applyClustering logEventStarted logEventEnded getFromBitmap setBitmap ...

BMP I/O Bmp_24_write ...

Figure 3. Class hierarchy and some of class member functions. External parts are shown with dashed lines Dialog class This is the main dialog class CVITestDlg derived from CDialog. Main dialog window can be seen on Figure 1. Dialog class includes “button clicking” functions such as OnBnClickedBackground, as well as some additional functions. Some of the members are described below. Member variables: CImage *image

// This is the variable used for processing every frame

CImage *background

// This is where background image is stored

COpenGLControl openGLControl // Used to draw the 3D hand model on the dialog float clusterRadius

// Radius set for color clusters

int clusters[clusterCount][3] // Center colors of every cluster

Member functions: initialize()

// Initializes webcam and loads default parameters

PickColorRect(point1,point2) // Calculates average color in a rectangular region ProcessBitmap(bitmap)

Işık Barış Fidaner

// Applies enabled processing options to grabbed image

Başar Uğur


CMPE 537 Computer Vision

Fall 2005

Glove class This is a simple class that processes position and orientation parameters of the 3D hand model. Member variables: float x, y, z

// Position of the model

float z_angle, x_angle

// Orientation of the model

Member functions: setPosition(x, y, z)

// Sets position of the hand model

setOrientation(z_angle,x_angle) // Sets orientation of the hand model draw()

// Draws model in OpenGL using current parameters

drawPalm() drawThumb()

// Other functions used for drawing the hand

drawFingers() drawUnitCube()

Image class All image processing functions are implemented in this class. Member variables: char *id

// Name of the image

BYTE* bmpBuffer FILE *fptr

// Used by getFromBitmap, setBitmap // Used by writeFile

float BGDiffThresholdHigh

// High threshold used for background difference

float BGDiffThresholdLow

// Low threshold used for background difference

float BGDiffRecursionDepth // Recursion depth used for background difference float ClusterCoverage float RadiusUnit

// Percentage coverage used to calculate radius // Smallest radius difference

int height, width

// Height and width of the image

float image[][][3]

// Every pixel color on the image

float z_angle, x_angle int msec

// Orientation of the model // Used for logging event starting-ending times

Member functions: getFromBitmap(bmp) getFromBitmap(bmp, height, width) // Gets image from a bitmap setBitmap(bmp) // Sets bitmap pixels to the image pixels applyBackgroundDifference(background) // Applies background diff. using double thresholding applyLowThreshold(background, y, x, PixelState, depth); // Recursive function used by background difference

Işık Barış Fidaner

Başar Uğur


CMPE 537 Computer Vision

Fall 2005

applyMedianFilter(radius) // Applies median filter to the image applyAlphaTrimmedMeanFilter(radius) // Applies alpha-trimmed-mean filter to the image calculateClusterRadii(c lus te rCoun t , c lus te r s ) // Calculates cluster radius using total coverage value applyClustering( r ad ius , c lus te rCoun t , c lus te r s ) // Applies the color clustering to the image calculateClusterCoverage( r ad ius , c lus te rCoun t , c lus te r s ) // Calculates total cluster coverage CImage(id) CImage(id, height, width) CImage(id, bmp) CImage(id, bmp, height, width) // Constructors ~CImage() // Destructor Init(id) // Initializes image calculateStdDev(mean, directionVector) // calculates standard deviation of positions of clustered points // along (result.x) and orthogonal (result.y) to the direction vector initiateKmeans(clusterCount, clusters) // Initiates k-means quantization algorithm calculateKmeans(clusterCount, clusters) // Calculates k-means clusters writeFile(filename) // Writes image into BMP file calculateColorArea(color); // Calculates total area of a color on the image calculateColorCenter(color); // Finds the center pixel of a color area markColorCenter(color); // Marks color center markPoint(y, x, color); // Marks a point logEventStarted() // Logging the start of an event logEventEnded(message)

Işık Barış Fidaner

Başar Uğur


CMPE 537 Computer Vision

Fall 2005

// Logging the ending of an event

BMP I/O This is a C unit including functions for writing and reading bitmap files. http://www.csit.fsu.edu/~burkardt/cpp_src/bmp_io/bmp_io.html

Işık Barış Fidaner

Başar Uğur


CMPE 537 Computer Vision

Fall 2005

NC Video Input Library This is a commercial (but inexpensive) video input library we have found on the Internet. We started the program using their demo application. We choose this library because we could grab frames in real time, and it is fast enough, using DirectShow. http://www.neatcpp.com/ OpenGLControl class We found this class as a tutorial of how to combine OpenGL and MFC dialogs. Using this class, we could show our 3D hand model in a rectangle on our dialog. http://steinsoft.net/index.php?site=Programming/Tutorials/opengl_dialog

Conclusion In this project, our aim was to create a human computer interface that allowed the user to interact with a virtual physical world. What we have ended up is a program that saw user's hand and more or less repeat his hand movements by a 3D hand model. Obviously, the resulting program could not achieve what was intended at first. But, we faced many problems of computer vision and tried to find out new solutions. Some products of our efforts were background difference using double thresholding, clustering using color spheres, and reverse projection. These solutions can also be used in computer vision projects that have a different subject.

References 1. C. Keskin, A. Erkan, L. Akarun, "Real Time Hand Tracking And 3D Gesture Recognition For Interactive Interfaces Using HMM", Bogazici University 2. V. Pavlovic, R. Sharma, and T. S. Huang. "Visual interpretation of hand gestures for human-computer interaction: A review". IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(7):677–695, 1997.

Işık Barış Fidaner

Başar Uğur


Turn static files into dynamic content formats.

Create a flipbook
Issuu converts static files into: digital portfolios, online yearbooks, online catalogs, digital photo albums and more. Sign up and create your flipbook.