Page 1

Studies in Surveying and Mapping Science (SSMS) Volume 1 Issue 1, March 2013

Additive and Multiplicative Noise Removal Framework for Large Scale Color Satellite Images on OpenMP and GPUs Banpot Dolwithayakul1, Chantana Chantrapornchai2, Noppadol Chumchob3 1,2

Department of Computing, Faculty of Science Silpakorn University, Nakhon-Pathom, Thailand

Department of Mathematics, Faculty of Science Silpakorn University, Nakhon-Pathom and Centre of Excellence in Mathematics, CHE, Si Ayutthaya Rd., Bangkok, Thailand 3


Abstract The satellite images are usually contaminated with multiplicative noises and some additive noises [1, 2]. Due to the large size of images, the removal process of these two types of noises at real-time is time consuming. The use of many-core processors such as GPUs may be advantageous in reducing the time of denoising. However, with the limitation of the GPU memory and the memory transfer cost, the proper design for denoising the large images is required. In this paper, we introduce the novel method for denoising both additive and multiplicative noises on multiple GPUs. The method is extended from [8] to perform a large-image denoising. It considers the proper data fitting to the GPU memory, memory utilization and thread utilization on both the CPU and GPUs. The speedup on the computation time of upto 87.29 times can be achieved compared with the sequential computation on the color 4096Ă—4096 satellite image. Keywords Image Denoising; Satellite Image; GPU; Fixed-point Iterative Method; Parallel Computing; High Performance Computing

Introduction In image processing, noises in images are usually categorized into two models: additive and multiplicative noises. The former one is called additive Gaussian white noises which can usually be found in acquired images via digital devices. This type of noise model has been investigated for a long time by previous researches. There are a variety of algorithms for removing the additive noises, for example, nonlinear total variation by Rudin, Osher and Fatemi[4]. The additive noise model is usually written in Equation (1)

đ?‘§đ?‘§ = đ?‘˘đ?‘˘ + đ?œ‚đ?œ‚.


Here, z is the corrupted image, u is the original image


and đ?œ‚đ?œ‚ is the noise on the image.

Next, the so-called multiplicative noise (a.k.a. speckle noise) is found in the images obtained from synthetic aperture radar (SAR), ultrasound and sonar. The multiplicative noise is in the form of Equation (2)

�� = ����.


From recent researches, Hirakawa and Parks [9] and Lukin et al.[10] concluded that some images may not consist of the pure additive noises or multiplicative noises. The authors in [9] concluded that both noise models should be combined into general case, as expressed by Equation (3).

đ?‘§đ?‘§ = đ?‘˘đ?‘˘ + (đ?‘˜đ?‘˜0 + đ?‘˜đ?‘˜1 đ?‘˘đ?‘˘)đ?œ‚đ?œ‚.


where đ?‘˜đ?‘˜0 and đ?‘˜đ?‘˜1 are parameters indicating the amount of additive and multiplicative noises are in the image.

The novel and robust algorithm proposed by N. Chumchob, K. Chen and C. Brito-Loeza [3] is efficient in removing both types of noises by combining two techniques: ROF model [4] and JY model [5].

In this paper, we use this method as the main technique for removing both types of noises from satellite images. The main challenge in this work is to remove the noises in real-time since the considered satellite images are quite large. We take advantages of the many-core technology and design an efficient parallel denoising method for such an image. In general cases, the satellite images denoising process is the time consuming process. The GPU may be used for speeding up the overall computation time. In contrast, it is well known by the nature of GPUs that the memory transfer between a host and devices costs many cycles. Moreover, the satellite images are usually large which cannot be fit in available memory

Studies in Surveying and Mapping Science (SSMS) Volume 1 Issue 1, March 2013

of a GPU as a whole, Thus, the proper computation strategy and memory management strategy are required to overcome these problems. In this paper, we propose a new parallel denoising algorithm for large color satellite images. The algorithm distributes the work to GPU threads and CPU threads as soon as they become available. It also deallocates some finished denoised color channel to reuse the memory space for new data arrived from the CPU. The rest of this paper is organized as following: Backgrounds Section explains the algorithm which is used for denoising both additive and multiplicative noises. Proposed Strategy Section shows our proposed strategy, Next section is our experimental results and the last section is the conclusion, discussion and future work. Backgrounds Noise Removal Algorithm In order to remove both additive and multiplicative noises from color images, we assume that each channel (Red-Green-Blue) is independent and adaptive to the variation model

arg min{đ??˝đ??˝đ?›źđ?›ź 1 ,đ?›źđ?›ź 2 (đ?‘˘đ?‘˘1 , đ?‘˘đ?‘˘2 , đ?‘˘đ?‘˘3 ) 3




− đ?‘§đ?‘§đ?‘™đ?‘™ )2 đ?‘‘đ?‘‘Ί + đ?›źđ?›ź2 ďż˝ (đ?‘˘đ?‘˘đ?‘™đ?‘™ + đ?‘§đ?‘§đ?‘™đ?‘™ đ?‘’đ?‘’ −đ?‘˘đ?‘˘ đ?‘™đ?‘™ )đ?‘‘đ?‘‘Ί}.


here đ?‘˘đ?‘˘đ?‘™đ?‘™ : Ί ⊂ â„œ2 → [0,255] , đ?›źđ?›ź1 > 0 and đ?›źđ?›ź2 > 0 are regularized parameters fitting for additive noises and multiplicative noises removal respectively, Ί = [0, đ?‘›đ?‘›đ?‘Ľđ?‘Ľ ] Ă— [0, đ?‘›đ?‘›đ?‘Śđ?‘Ś ] is the domain of image. By using EulerLagrance equations, the variation model is written as Equation (5) −đ?‘˘đ?‘˘ đ?‘™đ?‘™ ) −đ??žđ??ž(đ?‘˘đ?‘˘ = 0. ��������������������������� đ?‘™đ?‘™ ) + đ?›źđ?›ź1 (đ?‘˘đ?‘˘đ?‘™đ?‘™ − đ?‘§đ?‘§đ?‘™đ?‘™ ) + đ?›źđ?›ź2 (1 − đ?‘§đ?‘§đ?‘™đ?‘™ đ?‘’đ?‘’ ⎧

⎨ ⎊

where â„Ş(đ?‘˘đ?‘˘đ?‘™đ?‘™ ) = ∇ ∙

đ?‘ đ?‘ (đ?‘˘đ?‘˘đ?‘™đ?‘™ )

đ?œ•đ?œ•đ?‘˘đ?‘˘đ?‘™đ?‘™ đ?œ•đ?œ• =0 đ?œ•đ?œ•đ?œ•đ?œ•

∇đ?‘˘đ?‘˘ ďż˝|∇đ?‘˘đ?‘˘ )|đ?‘™đ?‘™ ďż˝ đ?‘™đ?‘™ đ?›˝đ?›˝

between grid points, we discretize the domain into ���� × ���� grid cells. Each cell has size of 1×1 (ℎ�� = ℎ�� = 1). The discrete equation on (���� , ���� ) on the Ίℎ is written as follows

âˆ’ďż˝đ?‘˘đ?‘˘â„Žđ?‘™đ?‘™ ďż˝

âˆ’â„Şâ„Ž (đ?‘˘đ?‘˘â„Žđ?‘™đ?‘™ )đ?‘–đ?‘–,đ?‘—đ?‘— + đ?›źđ?›ź1 ďż˝(đ?‘˘đ?‘˘â„Žđ?‘™đ?‘™ )đ?‘–đ?‘–,đ?‘—đ?‘— − (đ?‘§đ?‘§â„Žđ?‘™đ?‘™ )đ?‘–đ?‘–,đ?‘—đ?‘— ďż˝ + đ?›źđ?›ź2 ďż˝1 − (đ?‘§đ?‘§â„Žđ?‘™đ?‘™ )đ?‘–đ?‘–,đ?‘—đ?‘— đ?‘’đ?‘’

��,�� � = �����������������������������������

đ?‘ đ?‘ (đ?‘˘đ?‘˘ đ?‘™đ?‘™ )

� � , ��ℎ�� ��,��


where đ??žđ??žâ„Ž ďż˝đ?‘˘đ?‘˘đ?‘™đ?‘™â„Ž ďż˝đ?‘–đ?‘–,đ?‘—đ?‘—

â„Ž â„Ž + đ?›żđ?›żđ?‘Ľđ?‘Ľ+ đ??ˇđ??ˇďż˝đ?‘˘đ?‘˘đ?‘™đ?‘™ ďż˝đ?‘–đ?‘–,đ?‘—đ?‘— đ?›żđ?›żđ?‘Ľđ?‘Ľ ďż˝đ?‘˘đ?‘˘đ?‘™đ?‘™ ďż˝đ?‘–đ?‘–,đ?‘—đ?‘— =ďż˝ ďż˝ ďż˝ â„Žđ?‘Ľđ?‘Ľ â„Žđ?‘Ľđ?‘Ľ


â„Ž â„Ž − đ?›żđ?›żđ?‘Śđ?‘Śâˆ’ đ??ˇđ??ˇďż˝đ?‘˘đ?‘˘đ?‘™đ?‘™ ďż˝đ?‘–đ?‘–,đ?‘—đ?‘— đ?›żđ?›żđ?‘Śđ?‘Ś ďż˝đ?‘˘đ?‘˘đ?‘™đ?‘™ ďż˝đ?‘–đ?‘–,đ?‘—đ?‘— ďż˝ ��. â„Žđ?‘Śđ?‘Ś â„Žđ?‘Śđ?‘Ś


From Equation (6), there are several methods to solve it. For example, time marching technique is the simple iterative technique by using a synthetic time variable. However, this method has very slow convergence rate and not suitable for parallel computation because of data dependency in each iteration. Refer to [3] and references therein. Asynchronous Parallel Gauss-Seidel [8]

đ?›źđ?›ź1 = ďż˝ ďż˝ |∇đ?‘˘đ?‘˘đ?‘™đ?‘™ |đ?‘‘đ?‘‘Ί + ďż˝ ďż˝ (đ?‘˘đ?‘˘đ?‘™đ?‘™ 2 Ί Ί Ί

In this paper, we combine our previous work in [8] with the so-called local fixed-point method proposed by [3]. This method is a state-based method which consists of 4 states to make each thread work independent. Each state is described as the follows: Waiting State - Thread working on this state will keep searching for its assigned job in the job table. Working State - Thread will compute the Gauss-Seidel algorithm on its current cell and change to the next state. Validation State - Thread will validate and wait for solving data dependency to ensure the correctness of algorithm before update data on the current cell. Shifting State - Thread will decide if it will shift to work on the right cell or move back to the waiting state.


, |∇đ?‘˘đ?‘˘|đ?›˝đ?›˝ = ďż˝|∇đ?‘˘đ?‘˘đ?‘™đ?‘™ |2 + đ?›˝đ?›˝

and �� > 0 is a small constant to avoid the singularity. By using the finite difference method for discretization Ί to discrete domain Ίℎ where h is the distance

This asynchronous approach outperforms the earlier algorithm Sliding Window Gauss-Seidel on the multicore processor [8]. However, this method uses additional memory to store the job table and 2-dimensional matrix for storing the iteration number and the states of each thread.


Studies in Surveying and Mapping Science (SSMS) Volume 1 Issue 1, March 2013

Compute Unified Device Architecture Compute Unified Device Architecture (CUDA) is the architecture for Single Instruction Multiple Data (SIMD) from NVIDIAÂŽ. This render graphic card with CUDA can be used as a general-purposed processor other than just graphic processing which is called the General-Purposed Graphic Processing Unit (GPGPU). For programming and developing on CUDA, a developer has to specify the number of threads for computation. Threads executed in a kernel must be organized as a group of threads with the shared data called thread blocks. A group of blocks forms a grid. Creating, organizing and destroying threads on the GPU consume only a little of resources. This allows the developers to manage hundreds of threads very fast and effectively.

divide the computation into 5 parts. For example, we measure the sequential computation time used for denoising on the 1024Ă—1024 image. The results are shown in FIG. 1. From FIG. 1, we found that the most time consuming part on each iteration is nonlinear Gauss-Seidel. Thus, we design our method for computing parallel GaussSeidel on GPUs. The strategy for denoising images on two GPUs is shown in FIG. 2.

CUDA uses 4 levels of memory. The first level of memory, which is on a GPU is called "Global memory". The global memory access is the slowest for a GPU. Hundreds of clock cycles are needed to access this kind of memory. The next level is called "shared memory," the fastest memory that a user can allocate and manage on a GPU device. Reading and writing through the shared memory uses approximately 40 clock cycles. The other two levels are local and texture memory. Both memory types have large memory space and can be allocated by users. They require the more cycles than the shared memory. In this paper, we used CUDA architecture for our experiment. However, the usage of OpenCL which is hardware independent is also possible. Our implementation can be extended to the OpenCL framework in the future. The proposed strategy To design the strategy effectively, we first measure the time for each fixed-point step of local fixed-point. We


Here are the brief explanations of our strategy. Our strategy consists of 5 parts as following: 1) Initialize At the beginning of computation, the CPU will partially read the satellite image from disk. 2) Image Decomposition



The CPU will divide the main image into chunks by the size specified by n. For denoising the satellite image seamlessly, the outer boundary pixels in four directions are needed. Thus, the chunk size will be (�� + 1) × (�� + 1) for the chunk on the four corners as in FIG. 3(a), and (�� + 1) × (�� + 2) or (�� + 2) × (�� + 1) for chunks on the image’s boundary and (�� + 2) × (�� + 2) for chunks located on elsewhere as in FIG. 3(b).

Studies in Surveying and Mapping Science (SSMS) Volume 1 Issue 1, March 2013

3) CPU to GPU Data transfer G threads on the CPU will work in parallel via OpenMP to send data to GPUs, where G is the number of graphic cards. The CPU will transfer the fetched chunks to all two GPUs at the same time.

On FIG. 3, the black area indicate the area for the chunk and the red dash line indicates the actual data which is needed to decompose from the image to each thread. It has the size of (n+2) × (n+2) where n is the size of chunks in one dimension.

4) Denoise Process

Experiment Results

Each CPU thread will invoke CUDA kernel. The denoised process will be done on the GPUs by using asynchronous Gauss-Seidel technique proposed in [8] for the local fixed-point technique. After the thread finishes denoising a chunk, it will transfer a chunk back to the main memory and deallocate finished chunk on the GPUs to free the space on the global Memory.

We implemented our method on two NVIDIA® GTX560 GPUs with 384 stream processors and 2GB of memory on Intel Core i5® with 4 cores of CPU and the total main memory is 8GB. We use the 64-bit version of Fedora 16 Linux with GCC-C++ 4.6.0 compiler with gdb enabled and OpenCV 2.3 library for the image manipulation. The experiments were made with real satellite photos of Dindang district in Bangkok, Thailand. They were captured from IKONOS satellite. Our results consist into 3 parts as follows:

5) Finalize The CPU will combine all denoised chunks from the GPUs to create a new large denoised image and save it to the disk afterward. The satellite image decompositions are illustrated as FIG. 3.

Performance Evaluation We measure the computational time varying the number of chunk size on 256 threads computation on each GPU as FIG. 4.



From FIG. 4, it shows clearly that the smaller chunk size will increase the overall number of computation time because each chunk needs to compute data on its border of each chunk. The smaller chunk will imply the additional number of border cells to be denoised.


The total number of cells needed to be computed on the 2048×2048 image size varying the chunk size is illustrated as FIG. 5. However, we have made some modification to our strategy by dividing threads on the GPUs into groups and denoised several smaller chunks at the same time.


Studies in Surveying and Mapping Science (SSMS) Volume 1 Issue 1, March 2013

We have tried varying the number of threads in the group and chunk size. The results are shown in FIG. 6.

We define speedup as t seq /t gpu where t seq is the total computation time in the sequential version and t gpu is the total computation time on GPUs. The speedup of 256×256 chunk size varying the total number of threads with 8 group of threads is displayed as FIG. 7.


From FIG. 7, the achieved maximum speedup is 87.29 times comparing with the sequential computation using 512 threads with the chunk size of 256×256 and 8 groups of threads. Denoised Images Quality Evaluation The noisy satellite images and denoised satellite images are shown in FIG. 8 and FIG. 9. FIG. 6 TOTAL COMPUTATIONAL TIME VARYING THE NUMBER OF GROUPS AND CHUNK SIZE

The computation repeated until there is no difference between two consecutive images from the previous and current iteration. From FIG. 6, it is obvious to divide threads into groups and let the threads work on each chunk at the same time to decrease the total computation time in most cases. On the larger chunk size, dividing threads into the large number of groups (which contains the smaller number of threads per group) will increase the computation time. It is found that the proper number of groups on the 256×256 chunk size is 8 groups (which has 32 threads per group). FIG. 6 indicates that if we use the large chunk size (256×256 and 512×512) and the number of group more than 8, it will decrease the overall performance because the larger number of groups means the less thread assigned on each chunk. The large chunk size requires appropriate number of threads under the specific number of GPU cores to reach the optimum performance.







Studies in Surveying and Mapping Science (SSMS) Volume 1 Issue 1, March 2013

images is visually satisfactory.


However, the color satellite images have three channels. This noise removal technique does remove each channel individually. In fact, it is possible that there are some dependencies of multiplicative and additive noises on each channel. We need to improve the mathematical model and investigate the dependency of noises on each channel in the future.



Memory Space Complexity Evaluation We also evaluate the average memory space complexity for the overall computation. We define F as the space required by the floating point (usually 4 bytes for GCC), and N is the image size. On the normal computation, the memory required by the algorithm for the 3-channel RGB color image is in Equation (8)

O(3FN 2 )


Our proposed method divides the satellite image into small chunks. We define c as the number of chunks and G is the number of GPUs. Each chunk will use the memory space in the order of (�� + 2)2 , where n is the size of chunk in one dimension. The space complexity required is displayed in Equation (9) assuming each GPU will process one chunk at a time.

O((n + 2) 2 3GF )


On our system, we use two GPUs (G=2) and the chunk size of 256Ă—256 with 8 groups of threads at the same time on each GPU card. We can rewrite Equation (9) for our environment as Equation (10)

O(532512 F )


This means the memory usage is constant for any size of original image. Conclusions and Future Works We propose a new strategy for improving satellite images quality by removing both additive and multiplicative noises from color images. Our method can work on both shared memory system and distributed memory system by decomposing the image into small chunks which can be fit on each GPU’s memory. The result shows that our strategy is able to achieve the speedup up to 87.29 time compared to the sequential computation. The quality of denoised

Our framework is tested on CUDA architecture but the usage of OpenCL [13] is possible in the same way. This framework can be easily extended to other distributed memory model such as Message Passing Interface (MPI) or the cloud implementation. We will further investigate the data transfer time on these implementations next. Additionally, we will integrate the other satellite image improvement such as the cloud fog removal [11] and the strip noise removal [12] to our approach to further improve the image quality. ACKNOWLEDGMENT

This work is supported in part by the Thailand Research Fund through the Royal Golden Jubilee Ph.D. Program., contract no. PHD/0275/2551. We would like to thank Dr. Ornprapa P. Robert for providing us sample satellite images from IKONOS satellites. REFERENCES A. Munshi, “OpenCL Parallel Computing on the GPU and CPU�, International Conference and Exhibition on Computer





(SIGGRAPH 2008), 2008. B. Dolwithayakul, C. Chantrapornchai, N. Chumchob, “An efficient





iterative solver for FDM/FEM equations on multi-core processors.�, Conference

Proceeding on








Engineering (JCSSE 2012), 2012, pp.357—361. C. R. Vogel and M. E. Oman, “Fast, Robust total variationbased reconstruction of noisy, blurred images�, IEEE Transaction of Image Processing, Vol.7, 1998, pp.813— 824. C. R. Vogel and M. E. Oman, “Iterative methods for total variation denoising�, SIAM Journal of Sci. Comput., Vol.17, 1996, pp.227—238.


Studies in Surveying and Mapping Science (SSMS) Volume 1 Issue 1, March 2013

E. Choi and M. G. Kang, “Striping Noise Removal of Satellite Images by Nonlinear Mapping”, Lecture Notes in Computer Science, Vol.4142, 2006, pp.722—729.

of images corrupted by nonstationary noise”, J. Electron. Imaging. Vol.19, 2010. S. S. Al-amri, N. V. Kalyankar and S. D. Khamitkar, “A

K. Hirakawa and T.W. Parks, Image denoising using total

Comparative Study of Removal Noise from Remote

least squares, IEEE Trans. Image Process. 15(9) (2006), pp.

Sensing Image”, IJCSI International Journal of Computer

2730--2742. L. Rudin, S. Osher and E. Fatemi, “Nonlinear total variation based noise removal algorithms”, Physica D., vol 60, 1992, pp.130—120. N. Chumchob, K. Chen and C. Brito-Loeza, “A new variational model for removal of combined additive and multiplicative noise and a fast algorithm for its

Science, vol.7, 2010, pp. 32—36. Y. Iikura, “Estimation of noise component in satellite images and its application”, Geoscience and Remote Sensing Symposium, 1995, pp.102—104. Y. Li, J. Chen, Y. Wang and R. Lu, “An Effective Approach to Remove Cloud-fog Cover and Enhance Remotely Sensed

numerical approximation”, International Journal of

Imagery”, Proceeding of Geoscience and Remote Sensing

Computer Mathematics, 2012, 1—12.

Symposium, 2005, pp.4252—4255.

N.N. Ponomarenko, S.K. Abramov, O. Pogrebnyak, K.O.

Z. Jin and X. Yang, “Analysis of a new variational model for

Egiazarian, V.V. Lukin, D.V. Fevralev, and J.T. Astola,

multiplicative noise removal”, Journal of Math. Anal.

“Discrete cosine transform-based local adaptive filtering

Appl. Vol.362, 2010, pp.415—426.


Additive and Multiplicative Noise Removal Framework for Large Scale Color Satellite Images The satellite images are usually contaminated with multiplicative noises and some additive noises [1, 2]. Due to the la...