Issuu on Google+

1

Development of a Self-tuning Document Classification Using Confidence Level Abstract— Document classification is the task of assigning a document to one predefined category using a classifier trained from training data. Unfortunately, unless the classification results are inspected individually, the classifier is unable to reveal any information regarding classification accuracy or errors, and users will have difficulty assessing the accuracy of each classification and the quality of the classifier. Therefore, this study proposes a confidence-enhanced document classification mechanism (CEDC) to provide a classification confidence index (CI) that controls category adjustment or classifier retraining, thereby efficiently and accurately classifying documents into their appropriate categories, while simultaneously improving the consistency and accuracy of classification quality. Experimental results indicate that the classification accuracy of CEDC is superior to other algorithms, thereby verifying the performance of CEDC performance. Through the confidence and similarity indices, CEDC is able to maintain a stable error-tolerance and avoid the frequent production of new categories, which leads to constant classifier retraining. At the same time, CEDC is able to provide timely adjustments to categories by responding to changes in the documents. The confidence index trend graph can facilitate user understanding of category changes, ensuring the quality of classification.

Index Terms—Document classification, Confidence index, Similarity index, Text classification

I.

INTRODUCTION

Document classification is the task of assigning a document to one predefined category using a classifier trained from training data, and document classification can also be used to predict the class label of unknown records (Tan et al., 2006). Related classification methods have found wide applications in areas such as knowledge management (Al-Obeidat et al., 2010), image annotation (Zhang et al., 2010), and bioinformatics (Li et al., 2010). Several classifier technologies have been proposed and proven to be effective, such as Kohonen’s algorithm (Guerrero-Bote et al., 2002), neural networks (Manevitz & Yousef, 2007; Rajan et al., 2009), Bayesian networks (Denoyer & Gallinari, 2004), and rough sets (Miao et al., 2009). Despite the great necessity and vital importance of document classification in the fields of knowledge management and document retrieval, several issues still exist in research related to classification. First, document data increase exponentially over time. Without intervention, the chance for classification errors by trained classifiers increase due to the complexity and ambiguity of document contents. Although the classifier can be adjusted by user feedback (Shen & Zhai, 2005), even this is unable to effectively decrease classification errors because of the extensive labor and time requirements and the difficulty of ensuring the quality of feedback. Furthermore, though relevant studies can accurately classify the majority of document contents, the risk of classification error still exists. Unfortunately, unless the classification results are inspected individually, the classifier is unable to reveal any information regarding classification accuracy or errors, and users will have difficulty assessing the accuracy of each classification and the quality of the classifier. Therefore, successive studies have proposed probability and confidence values to facilitate clearer user understanding of results. These simplified methods rely on the probability provided by the classifier itself to help estimate confidence (Dashevskiy & Luo, 2009; Hastie & Tibshirani, 1998; Huang et al., 2006; Wu et al., 2004). However, many classifiers do not provide any probability estimations, making this method difficult to execute effectively. Costnik (1990) proposed a Bayesian confidence estimation method. The obvious problem with this method, however, is that the each dimension of assumed data in Bayesian networks must be linear and independent, while a non-linear relationship is implied with most document content, making linear, independent assumptions difficult to substantiate. Platt (2000) improved the SVM algorithm by converting the results of SVM classification into posterior probability through a sigmoid function. However, no theory to date has verified that the posterior probability after document classification in an actual environment is identical to the sigmoid curve, and numerous situations do not conform to this assumption. Based on k-NN (k-nearest neighbor), Delany et al. (2005) proposed


2 five different types of confidence indices, and combined these five indices into a single aggregated confidence measure used in spam filtering. Evaluation indices based on k-NN, however, can only be used on linearly dividable data; for non-linear content, the results will be significantly discounted. Though each of the confidence estimation methods described above have their own advantages and disadvantages, they all tend to overlook the difference and variability of documents in real life. Well-trained classifiers are usually relatively accurate for current documents. However, with dramatic changes in document content or concept innovation over time, the accuracy of the classifier decreases. Moreover, the predefined categories can be difficult to use completely in document collection, decreasing category and classifier quality. Therefore, an excellent document classification can not only estimate the confidence of document classification results, but needs to be able to assist in the inspection of classifier quality; thereby automatically improving the classifier, or dynamically adjusting categories over time and by content. In this way, document classification can provide users with information after each classification to ensure the stability and accuracy of the classification mechanism. Therefore, this study proposes a confidence-enhanced document classification mechanism (CEDC) to provide a classification confidence index (CI) that controls category adjustment or classifier retraining, thereby efficiently and accurately classifying documents into their appropriate categories, while simultaneously improving the consistency and accuracy of classification quality. The rest of this paper is organized as follows: Next, we describe the concept of the confidence index. Section 3 discusses document classification based on the confidence index. Then, Section 4 is dedicated to experimental results, and the conclusions and future work are finally presented in Section 5. II.

THE CONCEPT OF CONFIDENCE INDEX

The confidence index (CI) is the degree of confidence in a classifier for classifying a document. If the CI for the classification results of a document is low, the results for that given document are unreliable, and classification errors may occur. A CI can be obtained according to the category for each document predicted by the classifier, and allows the system to test the classifier in real-time and discover the existence of new categories. The system can then retrain the classifier and increase, differentiate, or combine categories. In addition to resolving this issue, CI can also make timely improvements to the classifier and the categories, thereby effectively increasing the accuracy of document classification. Compared to traditional classification accuracy, this accuracy uses a reliability assessment for the entire classifier, emphasizing the advantages and disadvantages to classification of the entire mechanism. However, in a real-life environment, posterior probability of accuracy or error classification does not exist to facilitate an assessment of classifier quality. Therefore, CI not only assesses the reliability of each classification, but can also be used to evaluate changes in classifier quality, promptly improving and further increasing the accuracy and reliability of CEDC. Specifically, CI evaluation is similar to the reliability index (RI) (Cheng et al., 2008). Similar to the RI, the CI uses two different classification models and has its own probability distribution. The CI is the area of intersection between two probability distributions (Fig. 1). The more similar the two probability distributions (and thereby the larger their area of intersection), the greater the CI value, making the two classification results identical. However, RI presupposes virtual metrology as the normal distribution, and an inherent difference exists between the classification variables (categories) and continuous variables (virtual metrology) in CI and RI. Therefore, this study had to redefine the basic assumptions of the normal distribution. Suppose

that

document

�� (� = 1 ‌ �)

will

correspond

to

a

category

đ?‘?(đ?‘‘đ?‘– )

,

and

đ?‘?(đ?‘‘đ?‘– ) ∈ {đ??śđ?‘— |đ??śđ?‘— is category đ?‘—, đ?‘— = 1 ‌ đ?‘š}. If the probability that it appears in category đ??śđ?‘— is đ?‘?đ?‘— , and the probability that it does not appear is 1 − đ?‘?đ?‘— , then the probability distribution for category đ??śđ?‘— is that of a binomial distribution (đ?œ‡ = đ?‘›đ?‘?đ?‘— , and


3 đ?œŽ = đ?‘›đ?‘?đ?‘— (1 − đ?‘?đ?‘— ) ), expressed as đ??śđ?‘— ~B(đ?‘›, đ?‘?đ?‘— ) . When đ?‘› is large enough ( đ?‘›đ?‘?đ?‘— ≼ 5 and đ?‘›(1 − đ?‘?đ?‘— ) ≼ 5 ), a normal 2

distribution can be used to approximate a binomial distribution, expressed as đ??śđ?‘— ~N(đ?‘›đ?‘?đ?‘— , đ?‘›đ?‘?đ?‘— (1 − đ?‘?đ?‘— )). Based on the above assumptions, this study selected multinomial logistic regression (MLR) (Hosmer & Lemeshow, 2000; Kim et al., 2006) as one of the classification models. Compared with multiple regression (MR), MLR is more suitable for prediction of category variables. Additionally, MLR is a natural extension of MR, with a simpler explanation for prediction results. This study also selected the back-propagation network (BPN) (Rumelhart et al., 1986) as the other classification model. While MLR is a linear classifier with consistent results for similar document classification, BPN is a non-linear classifier that is more adaptable to the classification of new documents. Consistent and adaptable, together these two classifiers can efficiently increase classification accuracy.

For the same category đ??śđ?‘— with a sufficiently large n, the probability distribution as it appears for MLR can be expressed as đ??śđ?‘—,đ?‘€ ~N(đ?‘›đ?‘?đ?‘—,đ?‘€ , đ?‘›đ?‘?đ?‘—,đ?‘€ (1 − đ?‘?đ?‘—,đ?‘€ )), where đ?‘?đ?‘—,đ?‘€ is the probability of đ??śđ?‘— appearing in the MLR model (đ?‘?đ?‘—,đ?‘€ = |{đ?‘?đ?‘€ (đ?‘‘đ?‘– )|đ?‘?đ?‘€ (đ?‘‘đ?‘– ) = đ??śđ?‘— }|â „đ?‘›; đ?‘?đ?‘€ (đ?‘‘đ?‘– ) is the predicted category for data đ?‘‘đ?‘– using MLR). Conversely, the probability distribution as it appears for BPN can be expressed as đ??śđ?‘—,đ??ľ ~N(đ?‘›đ?‘?đ?‘—,đ??ľ , đ?‘›đ?‘?đ?‘—,đ??ľ (1 − đ?‘?đ?‘—,đ??ľ )), where đ?‘?đ?‘—,đ??ľ is the probability of đ??śđ?‘— appearing in the BPN model (đ?‘?đ?‘—,đ??ľ = |{đ?‘?đ??ľ (đ?‘‘đ?‘– )|đ?‘?đ??ľ (đ?‘‘đ?‘– ) = đ??śđ?‘— }|â „đ?‘›; đ?‘?đ??ľ (đ?‘‘đ?‘– ) is the predicted category for data đ?‘‘đ?‘– using BPN). Therefore, two scenarios are possible for categories đ?‘?đ?‘€ (đ?‘‘đ?‘– ) and đ?‘?đ??ľ (đ?‘‘đ?‘– ): ď Ź

Category đ?‘?đ?‘€ (đ?‘‘đ?‘– ) = category đ?‘?đ??ľ (đ?‘‘đ?‘– ) (= đ??śđ?‘— ) The defined confidence index for đ?‘‘đ?‘– (đ??śđ??źđ?‘– ) is the area of intersection between the probability distributions đ??śđ?‘—,đ?‘€ and đ??śđ?‘—,đ??ľ (Eq. 1) (Fig. 1). ∞

đ??śđ??źđ?‘– = 2 âˆŤđ?‘›(đ?‘? 2

with

ď Ź

1 đ?‘Ľâˆ’đ?œ‡ 2 exp (−1 ( ) ) đ?‘‘đ?‘Ľ 2 +đ?‘? ) đ?œŽ đ?‘—,đ?‘€ đ?‘—,đ??ľ √2đ?œ‹đ?œŽ

(1)

đ?œ‡ = đ?‘›đ?‘?đ?‘—,đ?‘€ , đ?œŽ = √đ?‘›đ?‘?đ?‘—,đ?‘€ (1 − đ?‘?đ?‘—,đ?‘€ )

if đ?&#x2018;?đ?&#x2018;&#x2014;,đ?&#x2018;&#x20AC; < đ?&#x2018;?đ?&#x2018;&#x2014;,đ??ľ

đ?&#x153;&#x2021; = đ?&#x2018;&#x203A;đ?&#x2018;?đ?&#x2018;&#x2014;,đ??ľ , đ?&#x153;&#x17D; = â&#x2C6;&#x161;đ?&#x2018;&#x203A;đ?&#x2018;?đ?&#x2018;&#x2014;,đ??ľ (1 â&#x2C6;&#x2019; đ?&#x2018;?đ?&#x2018;&#x2014;,đ??ľ )

otherwise

Category đ?&#x2018;?đ?&#x2018;&#x20AC; (đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; ) (= đ??śđ?&#x2018;&#x2014; ) â&#x2030;  category đ?&#x2018;?đ??ľ (đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; ) (= đ??śđ??˝ , đ?&#x2018;&#x2014; â&#x2030;  đ??˝) The defined confidence index đ??śđ??źđ?&#x2018;&#x2013;,đ?&#x2018;&#x2014; is the area of intersection between the probability distributions đ??śđ?&#x2018;&#x2014;,đ?&#x2018;&#x20AC; and đ??śđ?&#x2018;&#x2014;,đ??ľ . Confidence index đ??śđ??źđ?&#x2018;&#x2013;,đ??˝ is the area of intersection between the probability distributions đ??śđ??˝,đ?&#x2018;&#x20AC; and đ??śđ??˝,đ??ľ . When đ??śđ??źđ?&#x2018;&#x2013;,đ?&#x2018;&#x2014; approaches đ??śđ??źđ?&#x2018;&#x2013;,đ??˝ , the confidence index is high, meaning that the confidence index for đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; (đ??śđ??źđ?&#x2018;&#x2014; ) is the harmonic mean of đ??śđ??źđ?&#x2018;&#x2013;,đ?&#x2018;&#x2014; and đ??śđ??źđ?&#x2018;&#x2013;,đ??˝ , reflecting this phenomenon. After the confidence index for đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; (đ??śđ??źđ?&#x2018;&#x2013; ) has been measured and the confidence index threshold (đ??śđ??źđ?&#x2018;&#x2013; ) established, if đ??śđ??źđ?&#x2018;&#x2013;

surpasses the threshold đ??śđ??źđ?&#x2018;&#x2021; , classification confidence index is high. When the confidence index is low, the classification results need to be explained further. III. SYSTEM ARCHITECTURE The CEDC presented in this study can be divided by purpose into concept extraction, category assignment, variant detection, and category adjustment (Fig. 2). The concept extraction module extracts the crucial concepts in the document and forms a vector for the features expressed in the document. The category assignment module classifies the document according to MLR and BPN classifiers and measures the classification confidence index. The variant detection module determines the need for category adjustment based on the confidence index. The category adjustment module adjusts the current categories, and updates MLR and BPN. Each module is detailed below.


4 A.

Concept Extraction The concept extraction module extracts the crucial concepts in the document to form a document feature vector. This

module uses a traditional vector space model (VSM) to express document features (Salton et al., 1975; Xu & Akella, 2010). Concepts were weighted using tf-idf (Eq. 2). Therefore, training document đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; (đ?&#x2018;&#x2013; = 1 â&#x20AC;Ś đ?&#x2018;&#x203A;) can be expressed as đ?&#x2019;&#x2026;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; = (đ?&#x2018;Ľđ?&#x2018;&#x2013;1 , đ?&#x2018;Ľđ?&#x2018;&#x2013;2 , â&#x20AC;Ś , đ?&#x2018;Ľđ?&#x2018;&#x2013;đ?&#x2018;? ) , where

đ?&#x2018;¤đ?&#x2018;&#x2013;đ?&#x2018;&#x2DC; is the weight of concept k in training document đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; , and đ?&#x2018;Ľđ?&#x2018;&#x2013;đ?&#x2018;&#x2DC; is the weight after

standardization (Eq. 3). đ?&#x2018;¤đ?&#x2018;&#x2013;đ?&#x2018;&#x2DC; = â&#x2C6;&#x2018;

đ?&#x2018;&#x201C;đ?&#x2018;&#x2013;đ?&#x2018;&#x2DC;

đ?&#x2018;&#x17D; đ?&#x2018;&#x201C;đ?&#x2018;&#x2013;đ?&#x2018;&#x17D;

where

đ?&#x2018;Ľđ?&#x2018;&#x2013;đ?&#x2018;&#x2DC; =

đ?&#x2018; đ?&#x2018;&#x2DC;

(2)

đ?&#x2018;&#x203A;

đ?&#x2018;&#x201C;đ?&#x2018;&#x2013;đ?&#x2018;&#x2DC;

number of the concept k occurring in document đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013;

đ?&#x2018;&#x203A;

total number of documents

đ?&#x2018; đ?&#x2018;&#x2DC;

total number of documents where the concept k appears

đ?&#x2018;¤đ?&#x2018;&#x2013;đ?&#x2018;&#x2DC; â&#x2C6;&#x2019;đ?&#x2018;¤â&#x2C6;&#x2122;đ?&#x2018;&#x2DC;

(3)

Std(đ?&#x2018;¤â&#x2C6;&#x2122;đ?&#x2018;&#x2DC; )

where

B.

Ă&#x2014; log

đ?&#x2018;¤â&#x2C6;&#x2122;đ?&#x2018;&#x2DC;

average of the weight of the concept k

Std(đ?&#x2018;¤â&#x2C6;&#x2122;đ?&#x2018;&#x2DC; )

the standard deviation of the weights about concept k

Category Assignment Category assignment module constructs document classifiers and estimates their confidence index (CI). The primary

classifiers in this module are MLR and BPN. Assume that document đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; is written as đ?&#x2019;&#x2026;đ?&#x2018;&#x2013; = (đ?&#x2018;Ľđ?&#x2018;&#x2013;1 , đ?&#x2018;Ľđ?&#x2018;&#x2013;2 , â&#x20AC;Ś , đ?&#x2018;Ľđ?&#x2018;&#x2013;đ?&#x2018;? ) and its category as đ??śđ?&#x2018;&#x2014; (đ?&#x2018;&#x2014; = 1 â&#x20AC;Ś đ?&#x2018;&#x161;).

ď Ź

For the MLR classifier In a multinomial logit model with m different categories, (đ?&#x2018;&#x161; â&#x2C6;&#x2019; 1) logits can be written as (Eq. 4): ln [

đ?&#x2018;&#x192;(đ?&#x2018;Ś=đ??ś1 |đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; )

] = đ?&#x203A;˝10 + â&#x2C6;&#x2018;đ?&#x2018;&#x17D; đ?&#x203A;˝1đ?&#x2018;&#x17D; đ?&#x2018;Ľđ?&#x2018;&#x2013;đ?&#x2018;&#x17D;

đ?&#x2018;&#x192;(đ?&#x2018;Ś=đ??śđ?&#x2018;&#x161; |đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; ) đ?&#x2018;&#x192;(đ?&#x2018;Ś=đ??ś2 |đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; )

] = đ?&#x203A;˝20 + â&#x2C6;&#x2018;đ?&#x2018;&#x17D; đ?&#x203A;˝2đ?&#x2018;&#x17D; đ?&#x2018;Ľđ?&#x2018;&#x2013;đ?&#x2018;&#x17D; â&#x2039;Ż P(y=Cmâ&#x2C6;&#x2019;1 |di ) ] = β(mâ&#x2C6;&#x2019;1)0 + â&#x2C6;&#x2018;a β(mâ&#x2C6;&#x2019;1)a xia {ln [ |d ) ln [

đ?&#x2018;&#x192;(đ?&#x2018;Ś=đ??śđ?&#x2018;&#x161; |đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; )

(4)

P(y=Cm i

where the final category đ??śđ?&#x2018;&#x161; is the reference category. Using the maximum likelihood estimation (MLE) to estimate the approximate value of [đ?&#x203A;˝đ?&#x2018;&#x2014;đ?&#x2018;&#x17D; ] , [đ?&#x203A;˝Ě&#x201A;đ?&#x2018;&#x2014;đ?&#x2018;&#x17D; ] is the parameter estimation. Therefore, using the following series of (đ?&#x2018;&#x161;â&#x2C6;&#x2019;1)Ă&#x2014;đ?&#x2018;?

(đ?&#x2018;&#x161;â&#x2C6;&#x2019;1)Ă&#x2014;đ?&#x2018;?

functions to calculate different category đ??śđ?&#x2018;&#x2014; probabilities (Eq. 5), the category for document đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; is maxj đ?&#x2018;&#x192;(đ?&#x2018;Ś = đ??śđ?&#x2018;&#x2014; |đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; ). đ?&#x2018;&#x192;(đ?&#x2018;Ś = đ??śđ?&#x2018;&#x161; |đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; ) =

1 Ě&#x201A; Ě&#x201A; 1+â&#x2C6;&#x2018;đ?&#x2018;&#x161;â&#x2C6;&#x2019;1 đ?&#x2018;&#x2014;=1 exp(đ?&#x203A;˝đ?&#x2018;&#x2014;0 +â&#x2C6;&#x2018;đ?&#x2018;&#x17D; đ?&#x203A;˝đ?&#x2018;&#x2014;đ?&#x2018;&#x17D; đ?&#x2018;Ľđ?&#x2018;&#x2013;đ?&#x2018;&#x17D; )

đ?&#x2018;&#x192;(đ?&#x2018;Ś = đ??śđ?&#x2018;&#x161;â&#x2C6;&#x2019;1 |đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; ) =

Ě&#x201A;(đ?&#x2018;&#x161;â&#x2C6;&#x2019;1)0 +â&#x2C6;&#x2018;đ?&#x2018;&#x17D; đ?&#x203A;˝ Ě&#x201A; (đ?&#x2018;&#x161;â&#x2C6;&#x2019;1)đ?&#x2018;&#x17D; đ?&#x2018;Ľđ?&#x2018;&#x2013;đ?&#x2018;&#x17D; ) exp(đ?&#x203A;˝ Ě&#x201A;đ?&#x2018;&#x2014;0 +â&#x2C6;&#x2018;đ?&#x2018;&#x17D; đ?&#x203A;˝ Ě&#x201A;đ?&#x2018;&#x2014;đ?&#x2018;&#x17D; đ?&#x2018;Ľđ?&#x2018;&#x2013;đ?&#x2018;&#x17D; ) 1+â&#x2C6;&#x2018;đ?&#x2018;&#x161;â&#x2C6;&#x2019;1 exp(đ?&#x203A;˝

â&#x2039;Ż

{

ď Ź

đ?&#x2018;&#x192;(đ?&#x2018;Ś = đ??ś1 |đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; ) =

đ?&#x2018;&#x2014;=1

(5)

Ě&#x201A;10 +â&#x2C6;&#x2018;đ?&#x2018;&#x17D; đ?&#x203A;˝ Ě&#x201A;1đ?&#x2018;&#x17D; đ?&#x2018;Ľđ?&#x2018;&#x2013;đ?&#x2018;&#x17D; ) exp(đ?&#x203A;˝

Ě&#x201A; Ě&#x201A; 1+â&#x2C6;&#x2018;đ?&#x2018;&#x161;â&#x2C6;&#x2019;1 đ?&#x2018;&#x2014;=1 exp(đ?&#x203A;˝đ?&#x2018;&#x2014;0 +â&#x2C6;&#x2018;đ?&#x2018;&#x17D; đ?&#x203A;˝đ?&#x2018;&#x2014;đ?&#x2018;&#x17D; đ?&#x2018;Ľđ?&#x2018;&#x2013;đ?&#x2018;&#x17D; )

For the BPN classifier This study used the sigmoid function as the non-linear function to deduce the predictive value. Therefore, the predictive

value for the category of document đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; (đ?&#x2018;&#x152;đ?&#x2018;&#x2013; ) can be determined from Eq. 6, in which the number of hidden layers is measured from the geometric mean of the input and output layers.


5 đ?&#x2018;&#x152;đ?&#x2018;&#x2013; =

1

(6)

1+exp(â&#x2C6;&#x2018;â&#x201E;&#x17D; đ?&#x2018;&#x160;đ??ťđ?&#x2018;&#x152; â&#x2C6;&#x2122;đ??ťâ&#x201E;&#x17D; â&#x2C6;&#x2019;đ?&#x153;&#x192;đ?&#x2018;&#x152; ) â&#x201E;&#x17D;đ?&#x2018;&#x2013;

where

đ?&#x2018;&#x160;đ??ťđ?&#x2018;&#x152;

the weight matrix between the hidden layers and output layers

đ??ť

the vectors of the hidden layers

đ?&#x153;&#x192;đ?&#x2018;&#x152;

the bias of the output layer

During the training process, the weight matrix đ?&#x2018;&#x160;đ??ťđ?&#x2018;&#x152; and bias đ?&#x153;&#x192;đ?&#x2018;&#x152; must be revised using the gap between the category predictive value and the target category value (Eq. 7). Minimizing the gap in this way will improve the learning quality of BPN. đ?&#x2018;&#x160;đ??ťđ?&#x2018;&#x152;â&#x201E;&#x17D;đ?&#x2018;&#x2013; = đ?&#x2018;&#x160;đ??ťđ?&#x2018;&#x152;â&#x201E;&#x17D;đ?&#x2018;&#x2013; + đ?&#x203A;žđ?&#x203A;żđ?&#x2018;&#x2013; đ??ťâ&#x201E;&#x17D; đ?&#x153;&#x192;đ?&#x2018;&#x152; = đ?&#x153;&#x192;đ?&#x2018;&#x152; + (â&#x2C6;&#x2019;đ?&#x203A;žđ?&#x203A;żđ?&#x2018;&#x2013; )

(7)

đ?&#x203A;żđ?&#x2018;&#x2013; = đ?&#x2018;&#x152;đ?&#x2018;&#x2013; (1 â&#x2C6;&#x2019; đ?&#x2018;&#x152;đ?&#x2018;&#x2013; )(đ?&#x2018;?(đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; ) â&#x2C6;&#x2019; đ?&#x2018;&#x152;đ?&#x2018;&#x2013; ) where

đ?&#x203A;ž

learning rate

đ?&#x203A;żđ?&#x2018;&#x2013;

the back propagated error

đ?&#x2018;?(đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; )

the category of training document đ?&#x2018;&#x2018;đ?&#x2018;&#x2013; , i.e. the target value

Once MLR and BPN have been constructed, new document đ?&#x2018;&#x2018;đ?&#x2018;?đ?&#x2018;&#x2013; can be predicted using both categories đ?&#x2018;?đ?&#x2018;&#x20AC; (đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; ) and đ?&#x2018;?đ??ľ (đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; ) and the confidence index đ??śđ??źđ?&#x2018;&#x2013; ofthe new document can be estimated using the method outlined in Chapter 2. C.

Variant Detection The variant detection module uses confidence index evaluation as the primary reference to detect new categories. As

described above, the classification result is more reliable when the confidence index surpasses the threshold value, and is less reliable when it fails to do so. However, the difference between the MLR and BPN classifiers that the confidence index responds does not represent classification error. If new document đ?&#x2018;&#x2018;đ?&#x2018;?đ?&#x2018;&#x2013; and trained document collection đ??ˇ = {đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; } are relatively similar, a historical document is acting as a support, indirectly signifying that the classification result is reliable. Therefore, this study proposed a similarity index (SI) measuring the similarity between the new document đ?&#x2018;&#x2018;đ?&#x2018;?đ?&#x2018;&#x2013; and trained document collection đ??ˇ (= đ?&#x2018;&#x2020;đ?&#x2018;&#x2013;đ?&#x2018;&#x161;đ?&#x2018;&#x2013;đ?&#x2018;&#x2122;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013;đ?&#x2018;Ąđ?&#x2018;Ś(đ?&#x2018;&#x2018;đ?&#x2018;?đ?&#x2018;&#x2013; , đ??ˇ)). This auxiliary confidence index can then be used to estimate the accuracy of classification. The determinant rules for the variant detection module for a given confidence index threshold đ??śđ??źđ?&#x2018;&#x2021; (determined by the mean difference ratio

đ??ˇđ?&#x2018;&#x2021; ) and similarity threshold đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2021; (manually set) are as follows: accept the BPN classifier results đ?&#x2018;?đ??ľ (đ?&#x2018;&#x2018;đ?&#x2018;?đ?&#x2018;&#x2013; ).

ď Ž

If đ??śđ??źđ?&#x2018;&#x2013; â&#x2030;Ľ đ??śđ??źđ?&#x2018;&#x2021; and đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2013; â&#x2030;Ľ đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2021; :

ď Ž

If đ??śđ??źđ?&#x2018;&#x2013; â&#x2030;Ľ đ??śđ??źđ?&#x2018;&#x2021; and đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2013; < đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2021; : though the documents are not highly similar, the predicted confidence is high, and the MLR predictions are relatively consistent, so this study continued to use the MLR classifier results đ?&#x2018;?đ?&#x2018;&#x20AC; (đ?&#x2018;&#x2018;đ?&#x2018;?đ?&#x2018;&#x2013; ).

ď Ž

If đ??śđ??źđ?&#x2018;&#x2013; < đ??śđ??źđ?&#x2018;&#x2021; and đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2013; â&#x2030;Ľ đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2021; : though the confidence is low, the documents are highly similar, and the BPN classifier is better able to predict subtle differences in similarities, so the BPN results đ?&#x2018;?đ??ľ (đ?&#x2018;&#x2018;đ?&#x2018;?đ?&#x2018;&#x2013; ) are used.

ď Ž

If đ??śđ??źđ?&#x2018;&#x2013; < đ??śđ??źđ?&#x2018;&#x2021; and đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2013; < đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2021; : determining the production of new classification requires category adjustment.

The procedure for the category assignment and variant detection modules is compiled in Fig. 3. The collection of trained documents (đ??ˇ = {đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; }) is used to construct MLR and BPN, and each new document đ?&#x2018;&#x2018;đ?&#x2018;?đ?&#x2018;&#x2013; is classified through MLR and BPN before the confidence index đ??śđ??źđ?&#x2018;&#x2013; and similarity index đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2013; are calculated. Finally, these two indices determine whether to maintain or adjust the current categories.


6 D.

Category Adjustment For new document đ?&#x2018;&#x2018;đ?&#x2018;? , the aim of the variant detection module is to determine when a new category has been created,

and place the new document in the new category đ??śđ?&#x2018;&#x161;+1(m is the original number of categories). The adjustment concept is as follows: (a)

If the new document đ?&#x2018;&#x2018;đ?&#x2018;? is similar to existing documents đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; , the documents đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; can also be belong to the new category đ??śđ?&#x2018;&#x161;+1 (Eq. 8). If Similarity(đ?&#x2018;&#x2018;đ?&#x2018;? , đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; ) â&#x2030;Ľ Threhold Then đ?&#x2018;&#x2018;đ?&#x2018;? and đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; are highly similar.

(b)

(8)

If the documents đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; are dissimilar from category đ?&#x2018;?(đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; ), then documents đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; can be removed from the attributed category đ?&#x2018;?(đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; )(Eq. 9). If Similarity(đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; , đ?&#x2018;?(đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; )) < đ?&#x153;&#x2021;đ?&#x2018;  â&#x2C6;&#x2019; 2đ?&#x153;&#x17D;đ?&#x2018;  Then đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; and đ?&#x2018;?(đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; )are not similar. where

đ?&#x153;&#x2021;đ?&#x2018; 

mean of similarities for documents in the same category

đ?&#x153;&#x17D;đ?&#x2018; 

standard deviation of similarities for document in the same category

(9)

In summary, when the current documents đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; are highly similar to the new documents đ?&#x2018;&#x2018;đ?&#x2018;? and satisfy concept b, documents đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; will be attributed to the new category đ??śđ?&#x2018;&#x161;+1 , and will be removed from the original category đ?&#x2018;?(đ?&#x2018;&#x2018;đ?&#x2018;&#x17D;đ?&#x2018;&#x2013; ). Finally, MLR and BPN classifiers will be retrained to accommodate the changes in the new categories. IV. EXPERIMENTAL RESULTS AND ANALYSIS This study designed an experiment to demonstrate the performance of the CEDC and to observe its characteristics, including: (a) the influence of parameters on classification results; (b) the difference between the method proposed in this study and other classification methods; and (c) the trends and changes for the confidence index and the similarity index (Fig. 4). Accuracy was used as the specific value in the evaluation of classification results (Eq. 10). The experimental data source is a letter dataset from (Hsu & Lin, 2002), including 15,000 pieces of training data and 5,000 pieces of test data in 26 categories. Training data and test data were differentiated into 10 folds according to the cross validation method. This objective measurement of experimental results was used to verify the performance of CEDC. đ??´đ?&#x2018;?đ?&#x2018;?đ?&#x2018;˘đ?&#x2018;&#x;đ?&#x2018;&#x17D;đ?&#x2018;?đ?&#x2018;Ś =

A.

# đ?&#x2018;&#x153;đ?&#x2018;&#x201C; đ?&#x2018;?đ?&#x2018;&#x153;đ?&#x2018;&#x;đ?&#x2018;&#x;đ?&#x2018;&#x2019;đ?&#x2018;?đ?&#x2018;Ą đ?&#x2018;?đ?&#x2018;&#x2122;đ?&#x2018;&#x17D;đ?&#x2018; đ?&#x2018; đ?&#x2018;&#x2013;đ?&#x2018;&#x201C;đ?&#x2018;&#x2013;đ?&#x2018;&#x2019;đ?&#x2018;&#x2018; đ?&#x2018;&#x2018;đ?&#x2018;&#x153;đ?&#x2018;?đ?&#x2018;˘đ?&#x2018;&#x161;đ?&#x2018;&#x2019;đ?&#x2018;&#x203A;đ?&#x2018;Ąđ?&#x2018;  # đ?&#x2018;&#x153;đ?&#x2018;&#x201C; đ?&#x2018;&#x2018;đ?&#x2018;&#x153;đ?&#x2018;?đ?&#x2018;˘đ?&#x2018;&#x161;đ?&#x2018;&#x2019;đ?&#x2018;&#x203A;đ?&#x2018;Ąđ?&#x2018; 

(10)

The influence and experimental results of parameter Because MLR and BPN are the core algorithms to CEDC, MLR doesnâ&#x20AC;&#x2122;t require specific parameters during an

estimation process, but the number of hidden layers in BPN will influence the quality of training. The aim of this experiment was to set different amounts of hidden layers (the parameter) to select the optimal BPN classifier, and to begin inspecting the quality of MLR and BPN classification. The hidden layers were set between 8 and 13. The results of the experiment are shown in Fig. 5. As the average classification accuracy for training data shows, BPN12 (12 hidden layers) was the most accurate (93.96 %), followed by BPN13 (93.80 %). Using the average classification accuracy for test data, however, BPN13 was the most accurate (93.32 %), followed by BPN12 (93.02 %). From the overall average classification accuracy, BPN13 was the most accurate (93.56 %), followed by BPN12 (93.49 %). To verify the degree of confidence in these results, this study used a paired-sample t test to test the difference in accuracies between different classifiers. The test results are shown in Tables


7 Table 1. At a confidence level of 95 %, BPN12, BPN13, and MLR were clearly superior to BPN8 and BPN9, but there was no significant difference between the three classifiers. Therefore, while this study lacked sufficient confidence to verify that the respective accuracies of BPN12, BPN13, and MLR differed significantly, the reference value for the test data was higher than the training data, so a parameter of 13 hidden layers was selected for subsequent analysis of experimental data. B.

Comparison between CEDC and others CEDC determines the existence of new categories using the thresholds of the confidence index (đ??śđ??źđ?&#x2018;&#x2021; ) and the similarity

index (đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2021; ), and retrains the classifier. Therefore, this experiment compared CEDC with the overall classification accuracy of other methods (MLR and BPN), further contrasting the three classifiers under different situations. The experimental procedure was as follows: (1)

Train the classifiers using training data.

(2)

Calculate the confidence index CI and similarity index SI for each training datum.

(3)

Set the confidence index threshold đ??śđ??źđ?&#x2018;&#x2021; as the average confidence index CI minus two standard deviations.

(4)

Set the similarity index threshold đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2021; as the average similarity index SI minus two standard deviations.

(5)

Evaluate the classification accuracy using test data.

The results for the entire experiment are shown in Table 2. Using overall average classification accuracy, CEDC had the best accuracy (92.96 %), followed by BPN (92.87 %) and MLR (92.68 %). For the different scenarios: in Scenario 1, CEDC and BPN were the most accurate (92.96 %) and MLR was the least (93.95 %); in Scenario 2, CEDC and MLR were the most accurate (69.37 %) and BPN was the least (67.94 %); and in Scenario 3, the three classifiers were identical in terms of accuracy. No significant difference was found using one-way ANOVA to test the overall average classification accuracy for all three classifiers (Table 3), and paired-sample test t did not reveal a significant difference in either scenario 1 or 2 (Table 4). Despite this, CEDC was superior to each of the other classifiers in terms of average accuracy. Additionally, scenarios 1 and 2 were consistent with the determination rules for the variant detection module, but Scenario 3 was not. A possible reason for this is that while the confidence index was relatively low (đ??śđ??ź < đ??śđ??źđ?&#x2018;&#x2021; ), the results of MLR and BPN classification were still consistent for documents with high similarity (đ?&#x2018;&#x2020;đ??ź â&#x2030;Ľ đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2021; ), creating identical classification accuracies. Furthermore, the respective accuracies of all three classifiers were significantly lower than their accuracies in either scenario 1 or 3. This phenomenon is perhaps due to the excessively low similarity between documents (đ?&#x2018;&#x2020;đ??ź < đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2021; ), distorting the classification predictions. However, CEDC has an error-tolerance (đ??śđ??ź â&#x2030;Ľ đ??śđ??źđ?&#x2018;&#x2021; ), so it is unable to display an immediate prediction error warning. C.

The trends of Confidence Index (CI) and Similarity Index (SI) This study proposed a similarity index (SI) in the variant detection module to evaluate document similarity. However, it

was still unclear which similarity evaluation function should be used. Therefore, the purpose of this experiment was two-fold: (1) to select a similarity evaluation function from cosine, correlation, and Mahalanobis coefficients suitable for the similarity index SI; and (2) to analyze the trends in the confidence index (CI) and similarity index (SI) using two different types of test data to observe CEDC characteristics. The experimental procedure was as follows: (1)

Construct the classifiers using training data.

(2)

Stage 1 Experiment: Predict the categories using test data 1 (categories identical to the training data), and calculate the CI and SI.

(3)

Stage 2 Experiment: Predict the categories using test data 2 (categories different from the training data), and


8 calculate the CI and SI. ď Ź

Stage 1 Experiment Because the concepts for CI and SI differ from the evaluation method, if the two correlate, one index can measure the

other. Therefore, this study hoped that CI and SI did not correlate, so they could each effectively exert their own effect. This study used the Pearson product-moment correlation coefficient to test for correlation between CI and SI (including cosine, correlation, and Mahalanobis coefficients), the results of which are shown in Table 5. The correlations between CI and SI approached 0, and only the Mahalanobis coefficient achieved significance, with a correlation of -0.096. The cosine coefficient had a correlation of 0.000, which is the similarity evaluation function nearest to that desired by this study. Therefore, this study used the cosine coefficient as the SI. ď Ź

Stage 2 Experiment A line chart was drawn using the CI and SI obtained from different test data (Fig. 6). The trend for test data set 1 used

data pieces 1 to 986, and the trend for test data set 2 used the remaining data pieces. Overall, the existence of new categories in CI gradually decreased with increasing oscillation amplitude, indicating an unstable situation. For SI, due to the characteristics of the new categories, all documents had low similarity after creation. In further observing data pieces 950 to 1100 (the intersection between the original and new categories) (Fig. 7), the CI first dropped below the threshold at datum 1013, while the SI first dropped below its threshold at datum 991. As this sequence shows, SI is significantly more sensitive to new document categories, as it only required 5 data points before recognizing the change in category. However, though CI cycled through 27 data points before finally discovering the existence of a new category, CI gradually began to decrease from datum 991. CI was not slow to detect the new category but instead has an error tolerance, which conservatively accepts classification results, and gradually alters itself until it surpasses the threshold. Datum 1014 was the point at which CI and SI simultaneously dropped below their respective thresholds, and is when CEDC actually decided to produce a new category. Because CEDC could both quickly detect a new category and adjust existing categories, the document classification reliability of CEDC itself did not decrease over time. V.

CONCLUSIONS AND FUTURE WORKS

This study proposes a confidence-enhanced document classification mechanism (CEDC), which uses both MLR and BPN classifiers to measure confidence index, and includes the similarity index to evaluate the confidence of the classification results and the reliability of the mechanism itself. Experimental results indicate that the classification accuracy of CEDC is superior to using only MLR or BPN, thereby verifying the performance of CEDC performance. Through the confidence and similarity indices, CEDC is able to maintain a stable error-tolerance and avoid the frequent production of new categories, which leads to constant classifier retraining. At the same time, CEDC is able to provide timely adjustments to categories by responding to changes in the documents. The confidence index trend graph can facilitate user understanding of category changes, ensuring the quality of classification. CEDC only has difficulty in accurately predicting the categories for documents with low similarity, and a structure dependent upon similarity for category adjustment have difficulty ensuring homogeneity between files in the new category. These two deficiencies in CEDC represent potential directions for future research. REFERENCES

Al-Obeidat, F., Belacel, N., Carretero, J. A., & Mahanti, P. (2010). Differential Evolution for learning the classification


9 method PROAFTN. Knowledge-Based Systems, 23(5), 418-426. Cestinik, B. (1990). Estimating Probabilities: A Crucial Task in Machine Learning. Proceedings of European Conference on Artificial Intelligence 90. Stockholm, Sweden. Cheng, F.-T., Chen, Y.-T., Su, Y.-C., & Zeng, D.-L. (2008). Evaluating Reliance Level of a Virtual Metrology System. IEEE Transactions on Semiconductor Manufacturing, 21(1), 92-103. Dashevskiy, M., & Luo, Z.-Y. (2009). Predictions with Confidence in Applications. In P. Perner (Ed.), Machine Learning and Data Mining in Pattern Recognition, Lecture Notes in Computer Science (Vol. 5632, pp. 775-786). Springer Berlin / Heidelberg. Delany, S. J., Cunningham, P., Doyle, D., & Zamolotskikh, A. (2005). Generating Estimates of Classification Confidence for a Case-Based Spam Filter. In H. Muñoz-Avila & F. Ricci (Eds.), Case-Based Reasoning Research and Development, Lecture Notes in Computer Science (Vol. 3620, pp. 177-190). Springer Berlin / Heidelberg. Denoyer, L., & Gallinari, P. (2004). Bayesian network model for semi-structured document classification. Information Processing & Management, 40(5), 807-827. Guerrero-Bote, V., Moya-Anegón, F., & Herrero-Solana, V. (2002). Document organization using Kohonen’s algorithm. Information Processing & Management, 38(1), 79-89. Hastie, T., & Tibshirani, R. (1998). Classification by pairwise coupling. The Annals of Statistics, 26(2), 451-471. Hosmer, D. W., & Lemeshow, S. (2000). Applied logistic regression. John Wiley and Sons. Hsu, C.-W., & Lin, C.-J. (2002). A comparison of methods for multiclass support vector machines. IEEE Transactions on Neural Networks, 13(2), 415-425. Huang, T.-K., Weng, R. C., & Lin, C.-J. (2006). Generalized Bradley-Terry Models and Multi-Class Probability Estimates. The Journal of Machine Learning Research, 7, 85-115. Kim, Y., Kwon, S., & Heun Song, S. (2006). Multiclass sparse logistic regression for classification of multiple cancer types using gene expression data. Computational Statistics & Data Analysis, 51(3), 1643-1655. Li, B., Zheng, C.-H., Huang, D.-S., Zhang, L., & Han, K. (2010). Gene expression data classification using locally linear discriminant embedding. Computers in Biology and Medicine, 40(10), 802-810. Manevitz, L., & Yousef, M. (2007). One-class document classification via Neural Networks. Neurocomputing, 70(7-9), 1466-1481. Miao, D., Duan, Q., Zhang, H., & Jiao, N. (2009). Rough set based hybrid algorithm for text classification. Expert Systems with Applications, 36(5), 9168-9174. Platt, J. C. (2000). Probabilistic Outputs for Support Vector Machines and Comparison to Regularized Likelihood Methods. Advances in Large Margin Classifiers (pp. 61-74). Cambridge, MA: MIT Press. Rajan, K., Ramalingam, V., Ganesan, M., Palanivel, S., & Palaniappan, B. (2009). Automatic classification of Tamil documents using vector space model and artificial neural network. Expert Systems with Applications, 36(8), 10914-10918. Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1986). Learning internal representations by error propagation. Parallel distributed processing: explorations in the microstructure of cognition, vol. 1: foundations (pp. 318-362). MIT Press. Salton, G., Wong, A., & Yang, C. S. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613-620. Shen, X.-H., & Zhai, C.-X. (2005). Active feedback in ad hoc information retrieval. Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval, SIGIR New York, USA: ACM.

’05 (pp. 59-66).


10 Tan, P.-N., Steinbach, M., & Kumar, V. (2006). Introduction to data mining. Pearson Addison Wesley. Wu, T.-F., Lin, C.-J., & Weng, R. C. (2004). Probability Estimates for Multi-class Classification by Pairwise Coupling. The Journal of Machine Learning Research, 5, 975-1005. Xu, Z., & Akella, R. (2010). Improving probabilistic information retrieval by modeling burstiness of words. Information Processing & Management, 46(2), 143-158. Zhang, S., Li, B., & Xue, X. (2010). Semi-automatic dynamic auxiliary-tag-aided image annotation. Pattern Recognition, 43(2), 470-477.


11

FIGURES

Fig. 1 The definition of CI on probability distributions of Cj,M and Cj,B

Fig. 2 System architecture of confidence-enhanced document classification


12

Training Content dai = (xi1, xi2, ..., xip)

MLR Training

New Content dbi = (xi1, xi2, ..., xip) Compute cB(di)

Compute CIi & SIi

Compute cM(di)

Variation Detection

Variation Detection

Category Assignment

BPN Training

Fig. 3 The procedure of category assignment and variant detection module

Fig. 4 Experiment design


13

Fig. 5 Primitive comparison between classifiers on different parameters

Fig. 6 The trends of CI and SI

Fig. 7 The trends of CI and SI (range: 950~1100)


14

TABLES Table 1 The results of paired-sample t test between classifiers with different parameters Paired differences Paired-sample

Mean

Std. Deviation

Std. Error Mean

t

df

Sig. (2-tailed)

*MLR - BPN8

0.008360

0.017326

0.003874

2.157902

19

0.043945

*MLR - BPN9

0.008005

0.015609

0.003490

2.293535

19

0.033392

MLR - BPN10

-0.002615

0.012702

0.002840

-0.920712

19

0.368744

MLR - BPN11

0.001340

0.016881

0.003775

0.354989

19

0.726508

MLR - BPN12

-0.006080

0.018264

0.004084

-1.488738

19

0.152968

MLR - BPN13

-0.006780

0.020274

0.004533

-1.495560

19

0.151194

BPN10 - BPN11

0.003955

0.018646

0.004169

0.948570

19

0.354749

BPN10 - BPN12

-0.003465

0.015389

0.003441

-1.006963

19

0.326605

BPN10 - BPN13

-0.004165

0.016789

0.003754

-1.109476

19

0.281071

BPN12 - BPN13

-0.000700

0.017221

0.003851

-0.181779

19

0.857682

Table 2 The average accuracy of MLR, BPN and CEDC Status

Description

MLR

BPN

CEDC

Meet the assumption

1

đ??śđ??ź â&#x2030;Ľ đ??śđ??źđ?&#x2018;&#x2021; and đ?&#x2018;&#x2020;đ??ź â&#x2030;Ľ đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2021;

93.95%

94.25%

94.25%

Y

2

đ??śđ??ź â&#x2030;Ľ đ??śđ??źđ?&#x2018;&#x2021; and đ?&#x2018;&#x2020;đ??ź < đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2021;

69.37%

67.94%

69.37%

Y

3

đ??śđ??ź < đ??śđ??źđ?&#x2018;&#x2021; and đ?&#x2018;&#x2020;đ??ź â&#x2030;Ľ đ?&#x2018;&#x2020;đ??źđ?&#x2018;&#x2021;

94.93%

94.93%

94.93%

N

Overall

92.68%

92.87%

92.96%

Y

Table 3 The results of one-way ANOVA between MLR, BPN and CEDC Sum of Squares

df

Mean Square

F

Sig.

Between Groups

.000

2

0.000105

0.167018

0.846345

Within Groups

.093

147

0.000631

Total

.093

149

Table 4 The results of paired-sample t test between MLR, BPN and CEDC Paired differences Status

Paired-sample

Mean

Std. Deviation

Std. Error mean

t

df

Sig. (2-tailed)

1

BPN â&#x20AC;&#x201C; CEDC

-0.000886

0.004243

0.000600

-1.476215

49

.146

2

MLR â&#x20AC;&#x201C; CEDC

-0.002922

0.014734

0.002084

-1.402351

49

.167


15

Table 5 The Pearson’s correlation between CI and Sis

CI

SI (Cosine)

SI (Correlation)

SI (Mahalanobis)

Pearson’s correlation

.000

.030

.096

Sig. (2-tailed)

1.000

.30

.002


2012 development of a self tuning document classification using confidence level