Develop - Issue 117 - June 2011 by Develop

BUILD | CAMERA TRACKING TUTORIAL

SEEING THE NOISE

Lightning Fish’s senior programmer and the original MAME creator Nicola Salmoria offers an in depth insight into getting the most from tracking cameras like those found in Kinect

ith the recent introduction of Kinect for the Xbox 360 the industry has seen something of a surge of interest in the field of camera based player tracking. To this end, Kinect comes preloaded with an impressive array of aids like traditional RGB, infra-red depth and microphone sensors. There really is an impressively large amount that can be achieved with the basic RGB cameras, including the fact that you can use it to enhance the depth camera information on Kinect. This article aims to show you how to get the best results, by giving you an understanding of RGB camera noise and where it comes from. If, for example, you want to separate out the moving objects in a scene (such as the players), a good place to start is with a per pixel image of the non-moving background. The moving objects – or foreground – are then the pixels in the scene which don’t match your model of the background. However, camera noise means that even pixels which are in the static background can appear to change. The better your estimate of the noise is, the more useful your background model will be (see Figure 1). UNDERSTANDING CAMERAS There are several different stages within the process which converts the light in a scene into a camera image. First, light passes through the lens and is converted to electrical charges. This signal is then amplified, sampled by an analogue-to-digital converter, and processed in the digital domain to apply white balance and gamma corrections. The sensor also needs to separate the red, green and blue colour channels. In almost all modern consumer cameras this is done using something called the Bayer Filter Mosaic (see Figure 2). A colour filter is placed over the camera sensor, so that each photosensor receives light of only one of the three primary colours, Figure 1.

62 | JUNE 2011

in a repeating pattern. 50 per cent of the photosensors receive green light, 25 per cent red and 25 per cent blue. The final colour image is produced in software by an algorithm called demosaicing, which derives the missing colour components of each pixel from the neighbouring pixels. So you could say that the effective resolution of a three megapixel camera is really only one megapixel. For vision processing algorithms, demosaicing isn’t a good thing. It increases the amount of data to process without introducing any new information, and may actually degrade the original data, depending on the type of interpolation used.

A lot can be achieved with basic RGB cameras, and you can also use it to enhance the depth camera information on Kinect. Fortunately, the raw Bayer image produced by the camera is often accessible. In this case, it’s certainly worth using the raw image directly, since you will get more accurate results and use less processing power. The one thing that you need to keep in mind is that the raw Bayer image isn’t a real grayscale image; it’s a filtered image, so you need to be aware of which filter colour corresponds to each pixel. UNDERSTANDING NOISE Whether you happen to be using either the raw Bayer data or the processed output from a camera driver, your image will invariably contain some random noise. Even if the scene which the camera saw was a perfectly static one, and the camera was completely accurate, the number of photons hitting the

sensor would vary from one image to the next (This effect is known as photon shot noise in quantum physics). The camera circuitry will also add a fair amount of noise of its own. If you are identifying moving foreground by comparing the camera input to a static background image, a good way of dealing with noise is to specify a threshold which quantifies how much the values of pixels can vary from those of the corresponding pixels in the background before they are categorised as foreground. This prevents you from automatically identifying ‘noisy’ pixels in the background area of the image as foreground. To get the best results, the threshold should depend on the actual level of noise. A theorem in probability theory, called Chebyshev’s inequality, can be used to pick a suitable value. This theorem guarantees that in any data sample, no more than 1/k2 of the samples can be more than k standard deviations away from the mean. For example, no more than 1/9th of the true background pixels can be further than three standard deviations from the background model. So if you know the standard deviation of the noise, you can set the threshold to the number of standard deviations which corresponds to the error rate – the number of true background pixels which will be wrongly classified as foreground – you’re happy with. If you use a very high error rate, background pixels which are not very noisy compared to the current level of noise in the scene will be classified as foreground. Similarly, if you use a very low rate, pixels which are actually foreground will often be categorised as background. The best choice of error rate will be somewhere in the middle. To use this approach, you need to estimate the standard deviation of the camera noise. The first thing to do is lock the gain, white balance and exposure settings on the camera, since variations in these will cause the noise level to change. You might expect that the next step is to find a noise level for the camera image as a