Page 1

International Journal of Computer Science Engineering and Information Technology Research (IJCSEITR) ISSN 2249-6831 Vol. 3, Issue 2, Jun 2013, 11-18 Š TJPRC Pvt. Ltd.

ADVANCED ENCODING TECHNIQUE FOR SCANNED MULTIFACETED MANUSCRIPTS USING CLUSTERING TECHNIQUES NISHA JOSEPH & DIVYA MOHAN Assistant Professors, CSE, SAINTGITS College of Engineering, Kottayam, Kerala, India

ABSTRACT This paper deals mainly with the compression of scanned multifaceted manuscripts. Here we propose an enhanced encoder for scanned composite manuscripts. Effective compound document compression algorithms require that scanned document images be first separated into regions such as text, pictures, and background. The proposed algorithm first classifies the document as image and text regions. These regions are then compressed using different encoders, suited for encoding either type of regions. The adaptive use of different types of encoders using clustering techniques resulted in performance gains. An efficient algorithm can be used to separate the text from the image in a complicated document where the text overlays the picture. Then text regions are compressed using optimized grid based clustering algorithm and image regions are compressed using optimized density based clustering algorithm. This procedure will reduce the pixel classification errors in compound manuscript and thereby increase the overall efficiency of the compression scheme. Experiments using different databases show that the proposed method can improve the document compression to a great extend.

KEYWORDS: Multifaceted Manuscript, Multidimensional Multiscale Parser (MMP), Grid Based Clustering, Varied Density Based Spatial Clustering of Applications with Noise (VDBSCAN)

INTRODUCTION Image compression is a very dynamic research area as evidenced by the large number of publications in the journals and conferences of image processing. Document images have often become easier and cheaper to manuvre in electronic form than in paper form. The increasing significance of digital media support for manuscript transmission and storage justifies the demand for efficient coding algorithms for this type of data.Conventional paper media is being replaced by digital versions, with the benefit of avoiding the large storage and maintenance requirements related with the paper versions, while making the documents easily accessible for a larger number of users. In a multifaceted manuscript we entail an image that contains data of various types, such as photographic images, text, and graphics. Each of these data types has dissimilar statistical properties, and is characterized by different level of distortion that a human observer can notice. This organization of a compound document needs different parameter settings of a compression algorithm used for the entire image. Moreover, facts dictate that the best compression performance is obtained when different algorithms are applied to different image types. A common approach to compression of a multifaceted manuscript will include 3 major steps [1]: Image segmentation into the regions of analogous data types. Choice of the best compression procedure for each region. Bit allocation between various regions compression algorithm. While compressing complex manuscripts, separation of smooth and non smooth regions form a complicated task. Most algorithms are based upon the postulation that the segmentation procedure can precisely separate the text and graphics regions, which is not constantly true. Errors in pixel


12

Nisha Joseph & Divya Mohan

categorization usually compromise the overall competence of the compression method. Separation of text and image regions of multifaceted manuscripts plays a vital role in document encoding schemes. Clustering analysis is a principal method for data mining. Clustering is the method of allocating points in a specified dataset into disjoint and meaningful clusters. There are five areas of clustering, which are Partitioning, Hierarchical, Density, Grid, and Model methods [2],[3]. This paper implements an optimized grid based clustering algorithm for compressing text blocks and an enhanced density based algorithm for compressing image blocks. Thresh holding is the simplest method of separation. From a grayscale image, thresh holding can be used to create binary images. We can use edge information [4] to separate textual regions from a multifaceted manuscript. The algorithm locates the feature points in different entities and then groups those edge points of textual regions. Finally feature based connected component integration was introduced to collect identical textual regions together within the range of its bounding rectangles. The separation of text and image regions of the manuscript can also be done using clustering techniques. Many algorithms, such as K-MEANS, CLARANS, DBSCAN [5], CURE [6], STING [7], DBRS [8], DENCLUE [9], CLUGD [10] analyze large and high dimension data from different aspect.

STING is a Statistical

Information Grid Approach to Spatial Data Mining. As a result of using grid method, its efficiency is very high and can deal with high dimension data. It is used to powerfully process many common region oriented queries on a set of points. A set of points satisfying some decisive factor defines a Region. The idea is to confine statistical information linked with spatial cells in such a way that the whole classes of queries can be answered without referring to the individual objects. Density clustering methods are very helpful to locate clusters of any shape, giving the accurate parameters (yet hard to decide them) .Generally, the purpose of a clustering algorithm is to group the items of a database into a set of meaningful subsets. DBSCAN (Density Based Spatial Clustering of Applications with Noise) is a conventional and widely-accepted density-based clustering technique. It can locate clusters of arbitrary shapes and sizes yet may have difficulty with clusters of varying density. The density-based algorithms still endure from several problems. Conventional algorithms, such as DBSCAN can have problems with density if the density of clusters varies widely. To resolve the crisis of DBSCAN, VDBSCAN is introduced for the intention of valuable clustering analysis of datasets with varied densities. It selects appropriate parameters for different density, using k-dist plot, and adopts DBSCAN algorithm for each selected parameter. Compound manuscripts are document files that include several dissimilar types of data as well as text. A compound document may contain graphics, spreadsheets, images, or any other non-text data. A basic approach to it is to encode such images using typical state-of-the-art image encoders, like SPIHT [11], JPEG [12], JPEG2000 [13], or H.264/AVC Intra [14], [15]. However, despite the competence of these algorithms for smooth, low-pass images, they are not proficient of achieving an acceptable performance for non-smooth image regions, like the ones corresponding to text or graphics, recurrently present in scanned documents. For smooth images, the majority of the transform coefficients representing the maximum frequencies are of little significance, allowing their coarse quantization. This leads to a high compression ratio without affecting the perceptual quality of the reconstructed images. But when the image does not have a low-pass temperament, the coarse quantization of these high-frequency coefficients leads to highly disturbing visual artifacts, like ringing and blocking. A substitute is the use of encoding techniques specifically developed for text-like image coding JBIG [16]. But, such algorithms tend to have staid restrictions when used to encode smooth regions of compound manuscripts. One cause is the fact that text and graphics images typically require high spatial resolution to maintain the document’s readability. In contrast, they do not need high color depth, since characters and other graphic elements generally assume only a few distinct colors over solid color surroundings. With natural images, the opposite tends to occur.


Advanced Encoding Technique for Scanned Multifaceted Manuscripts Using Clustering Techniques

13

Because of their high relationship among neighboring pixels, they usually do not need high spatial resolution in order to sustain a good subjective quality, but frequently require high color depth. Thus, methods that are able to competently compress both smooth and text regions are of meticulous importance for multifaceted manuscript encoding, when smooth image regions coexist with text and graphics. Many algorithms, like Digipaper [17], DjVu [18], JPEG2000/ Part6 [19], among others [20], [21], have been proposed for compound manuscript compression. They accept the MRC (Mixed Raster Content) model [22] so as to decompose the original image in a number of distinct layers. A background layer presents the document’s smooth component, together with natural images regions as well as the paper’s texture. A foreground layer includes the data regarding the text and graphics colors. One or more binary segmentation masks including text and graphics shape information may also be used to merge the information. The scanning method degrades the characters’ crisp edges, causing their erroneous insertion on the background layer and it will influence the efficiency of compression scheme. For this cause, it is not easy to apply standard compound documents compression algorithms based upon MRC decomposition in this type of applications. An encoder for scanned compound manuscripts, based on multidimensional multiscale parser (MMP) uses approximate pattern toning, with adaptive multiscale dictionaries [23] that includes concatenations of scaled versions of formerly encoded image blocks. These features give MMP the ability to amend to the input image’s characteristics, resulting in high coding efficiencies for a broad series of image types. This adaptability makes MMP a good aspirant for compound digital document encoding. Smooth and non smooth blocks are then compressed with different MMP-based encoders [24], suited for encoding either type of blocks. The adaptive use of these two types of encoders resulted in performance gains over the normal MMP algorithm, by increasing the performance. This paper briefly explains image compression using clustering techniques. Section I gives a preface about the image compression and the clustering techniques. It also gives depiction about the organization of the paper. Section II provides an overview of the proposed algorithm. Section III gives the implementation details and the results of the proposed multifaceted manuscript compression using clustering techniques. The conclusion and future enhancement of the proposed algorithm is explained in section IV.

PROPOSED ALGORITHM The proposed algorithm first classifies the manuscript as image and text regions. These regions are then compressed with different encoders, modified for encoding either type of regions. The adaptive use of different types of encoders using clustering techniques resulted in performance gains. Thresholding is the simplest method of separation. From a grayscale image, thresholding can be used to create binary images. The separation of text and image regions of the manuscript can be done using clustering techniques. Then text regions are compressed using optimized grid based clustering algorithm and image blocks are compressed using optimized density based algorithm. As a result of using grid method, its efficiency is very high and can manage high dimension data. It uses statistical information of grid. Spatial area is divided into rectangular cells. Each cell at a higher level is partitioned into number of cells of the next lower level. A cell in level i corresponds to the union of the areas of its children at level i + 1. The size of the leaf level cells is dependent on the density of objects. Then encode the resultant using Huffman algorithm.


14

Nisha Joseph & Divya Mohan

Huffman coding [25] is an entropy encoding algorithm used for lossless data compression of a variable-length code table for encoding a source symbol. The code table has been derived in a particular way based on the anticipated chance of occurrence for each probable value of the source symbol.

It uses a specific method for selecting the

representation for each symbol, resulting in a code that expresses the most common source symbols using shorter strings of bits than are used for less common source symbols. The technique works by creating a binary tree of nodes.

Density

clustering methods are very useful to locate clusters of any shape, giving the exact parameters. VDBSCAN is introduced for the intention of efficient clustering analysis of datasets with varied densities. It chooses appropriate parameters for different density, using k-dist plot, and adopts DBSCAN algorithm for each selected parameter. DBSCAN algorithm is a centre-based approach. In this approach, density is estimated for a specific point in the dataset by counting the number of points within a precise radius, Ep, of that point. It includes the point itself. The centrebased approach to density permits us to categorize a point as a core point, a border point, a noise or a background point. A point is a core point if the number of points within Ep, a user-specified constraint, exceeds a certain threshold, MnPts, which is also a user specified parameter. Any two core points that are close enough within a distance Ep of one another are placed in the same cluster.

It is also valid for any border point which is close enough to a core point also is put in the

same cluster as the core point. Noise points are discarded.VDBSCAN algorithm [26] calculates and stores k-dist for each task that contains varied density datasets and partition k-dist plots. Then, the number of densities is given instinctively by k-dist plot and chooses parameter Eps automatically for each density. Next, scan the dataset and cluster different densities using corresponding Eps. Finally, display the valid clusters corresponding with varied densities. VDBSCAN has two steps: selecting parameters Eps and cluster in varied densities. For calculating k automatically in VDBSCAN [2], select n number of subjective points from input dataset. Then compute average distance


Advanced Encoding Technique for Scanned Multifaceted Manuscripts Using Clustering Techniques

15

from subjective points to all other points and find average of all. For every subjective points in the datasets we will draw a circle and the centre of the circle will be the points itself, and the radius of each circle will be the average of all distances. For every circle we have to determine the closest point which is nearest to the circumference of each circle. Then we have to determine the position of target points relative to the subjective points for that particular circle. Then we have to find out maximum repeated position of target point which is basically our expected value of parameter k in the k-dist plot [27]. Goals and Discussions The work concentrates mainly on finding out a good and effective method for multifaceted manuscript compression in order to get a better result. The paper proposes clustering techniques for effective compression of compound manuscripts. The adaptive use of different types of encoders using clustering techniques resulted in performance gains.

IMPLEMENTATION AND EXPERIMENTAL RESULTS The algorithm is evaluated using several still grayscale multifaceted manuscript databases. The database for compound manuscript compression is constructed

from JPEG2000 and the World Wide Web. Database includes

compound manuscripts of arbitrary shapes and large number of varying contexts (density regions).

This algorithm first

separates the document as image and text regions. These blocks are then compressed using different encoders, modified for encoding either type of blocks. The adaptive use of different types of encoders using clustering techniques resulted in performance gains. Here the text blocks are compressed using optimized grid based clustering algorithm and image blocks are compressed using optimized density based algorithm. This algorithm successfully compressed a large number of multifaceted manuscripts with any shapes and densities. Further, the algorithm improves the quality of output images by reducing the noise components. It also enhances the compression ratio. Output Snapshots

Figure 2: Input Manuscript

Figure 3: Text Separation


16

Nisha Joseph & Divya Mohan

The database contains 121 images. After analyzing the algorithm using this database, it is found that the overall performance is good. All the algorithmic parameters demonstrate that our algorithm can successfully compress wide range of scanned multifaceted manuscripts. Figure 4 shows the results after applying the algorithm on Figure 2.

Figure 4: Decompressed Manuscript

CONCLUSIONS This paper presents the concept of compression of grayscale multifaceted manuscripts using clustering techniques. The proposed compression technique is applied to the compound manuscripts of arbitrary shapes and large number of varying contexts (density regions). All experimental results demonstrated the effectiveness of the proposed technique. This algorithm successfully compressed a large number of compound documents Further, the algorithm enhances the PSNR value of decompressed images as well as the compression ratio. Future Enhancements The above algorithm can be further extended to compress the colored compound manuscripts. It is better to design a single encoder for compressing both the text and the image blocks in the compound manuscript efficiently rather than using separate encoders.

REFERENCES 1.

A. Said and A. Drukarev, “Simplified segmentation for compound image compression,” in Proc. IEEE Int. Conf. Image Processing, 1999,vol. 1, pp. 229–233.

2.

A.K.M Rasheduzzaman Chowdhury,Md. Elias Mollah, Md. Asikur Rahman “An Efficient Method for subjectively choosing parameter ‘k’ automatically in VDBSCAN (Varied Density Based Spatial Clustering of Applications with Noise) Algorithm”, IEEE Volume 1, 2010 .

3.

Rui Xu, Donald Wunsch II,” Survey of Clustering Algorithms”, IEEE Trans. Neural networks,, vol. 16, no. 3, may 2005.

4.

Q. Yuan, C. L. Tan, “Text Extraction from Gray Scale Document Images Using Edge Information”.

5.

M Ester, H.-P Kriegel, J Sander, Xiaowei Xu. “A density based algorithm for discovering clusters in large spatial databases with noise”. In: Proc. of Knowledge Discovery and Data Mining, Portland, AAAI Press, pp.226~231.1996.

6.

Guba S, Rastogi R, Shim K. “CURE: an efficient clustering algorithm for large databases”. In: Haas LM, Tiwary


Advanced Encoding Technique for Scanned Multifaceted Manuscripts Using Clustering Techniques

17

A, eds. Proceeding of the ACM SIGMOD International Conference on Management of Data. Seattle: ACM Press, pp.73-84, 1998. 7.

W.Wang, J.yang, R.Muntz. “Sting: a statistical information grid approach to spatial to spatial data mining”. In: Proc. Of VLDB’1997, pp.186-195, 1997.

8.

Xin Wang, Howard J.Hamlton. “DBRS: a density-based spatial clustering method with random sampling”. In: Proc. of 7th PAKDD, Seoul, Korea, pp.563-575, 2003.

9.

Alexander Hinneburg, Daniel A. Keim, “An efficient approach to clustering in large multimedia databases with noise”. In Proc. of the 4th Int'l Conf. on Knowledge Discovery and Data Mining (KDD'98). New York: AAAI Press, pp. 58-65, 1998.

10. Zhiwei SUN, Zheng ZHAO, Hongmei WANG, Maode MA, Lianfang ZHANG, Yantai SHU, “A Fast Clustering Algorithm based on grid and Density”, IEEE, 2005 . 11. A. Said and W. A. Pearlman, “A new fast and efficient image codec based on set partitioning in hierarchical trees,” IEEE Trans. Circuits Syst. Video Technol., vol. 6, no. 3, pp. 243–250, Jun. 1996. 12. W. Pennebaker and J. Mitchel, JPEG: Still Image Data Compression Standard. New York: Van Nostrand, 1993. 13. D. S. Taubman and M. W. Marcelin, JPEG2000: Image Compression Fundamentals, Standards and Practice. Norwell, MA: Kluwer, 2001. 14. Joint Video Team (JVT) of ISO/IEC MPEG & ITU-T VCEG (ISO/IEC JTC1/SC29/WG11 and ITU-T SG16 Q.6) Draft of Version 4 of H.264/AVC (ITU-T Recommendation H.264 and ISO/IEC 14496-10 (MPEG-4 part 10) Advanced Video Coding) Mar. 2005. 15. D. Marpe, T. Wiegand, and S. Gordon, “H.264/MPEG4-AVC fidelity range extensions: Tools, profiles, performance, and application areas,” in Proc. IEEE Int. Conf. Image Processing, Sep. 2005, vol. 1, pp. 593–596. 16. W.Kou, Digital Image Compression Algorithms and Standards. Norwell, MA: Kluwer, 1995. 17. D. Huttenlocher, P. Felzenszwalb, and W. Rucklidge, “Digipaper: A versatile color document image representation,” in Proc. IEEE Int. Conf. Image Processing, Kobe, Japan, 1999, pp. 219–223. 18. P. Haffner, L. Bottou, P. G. Howard, and Y. L. Cun, “Djvu : Analyzing and compressing scanned documents for internet distribution,” in Proc. Int. Conf. Document Analysis and Recognition, 1999, pp. 625–628. 19. ISO/IEC JTC 1/SC 29/WG 1 (ITU-T SG8), JPEG 2000 Part I Final Committee Draft Version 1.0 2001. 20. A. Zaghetto and R. L. de Queiroz, “Segmentation-driven compound document coding based on H.264/AVCIntra,” IEEE Trans. Image Process., vol. 16, no. 7, pp. 1755–1760, Jul. 2007. 21. A. Zaghetto and R. L. de Queiroz, “Iterative pre- and post-processing for MRC layers of scanned documents,” in Proc. IEEE Int. Conf. Image Processing, CA, Oct. 2008, pp. 1009–1012. 22. ITU-T Recommendation T.44, Mixed Raster Content (MRC), Study Group-8 Contribution 1998. 23. N. M. M. Rodrigues, E. A. B. de Silva, M. B. de Carvalho, S. M. M.de Faria, and V. M. M. Silva, “On dictionary adaptation for recurrent pattern image coding,” IEEE Trans. Image Process., vol. 17, no. 9, pp.1640–1653, Sep. 2008.


18

Nisha Joseph & Divya Mohan

24. Nelson C. Francisco, Nuno M. M. Rodrigues, Eduardo A. B. da Silva, Murilo Bresciani de Carvalho, Sérgio M. M. de Faria, Vitor M. M. Silva,” Scanned Compound Document Encoding Using Multiscale Recurrent Patterns”, IEEE Trans. image processing, vol. 19, no. 10, october 2010. 25. Andrei Broder, Michael Mitzenmacher, “Pattern-based compression of text images”. 26. Peng Liu, Dong Zhou, Naijun Wu, “Varied Density Based Spatial Clustering of Applications with Noise”, IEEE, 2007. 27. Divya Mohan,Nisha Joseph, “Image Compression using clustering Techniques”, TJPRC ,jan 2013

2.Advanced encoding.full  
Read more
Read more
Similar to
Popular now
Just for you