Issuu on Google+

Int. J. on Recent Trends in Engineering and Technology, Vol. 10, No. 1, Jan 2014

Brill's Rule-based Part of Speech Tagger for Kadazan Marylyn Alex1, and Lailatul Qadri Zakaria2 CAIT Research Group, Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Selangor Email: alexmarylyn@gmail.com, laila@ftsm.ukm.my

Abstract— This paper presents the Part of Speech Tagger (POS) for Kadazan language by implementing Brill's approach which is also known as a Transformation-Based Error Driven Learning approach. Kadazan language is chosen because there is not even one POS tagger has been developed for this language yet. Hence, this study has been carried out in order to develop a POS tagger especially for Kadazan language that can tag Kadazan corpus systematically, help to reduce the ambiguity problem and at the same time can be used as a learning language tool. Therefore, the main objective of this study is to automate the tagging process for Kadazan language. Brill' approach is an enhance version of the original Rule-Based approach which it transforms the tags based on a set of predefined rules. Brill’s approach uses rules to transform wrong tags into correct tags in the corpus. In order to achieve the main goal, several objectives have been set which are to create the specific lexical and contextual rules for Kadazan language, by applying Brill’s approach based on rules and to evaluate the effectiveness of Kadazan Part of Speech using Brill’s approach. The tagging process is divided into four main phases. In first phase, Brill’s approach process begins by inputting a new untagged text into the system. In second phase, the input text will go through the initial state annotater to tag all the words inside the corpus to its most likely tags and produce a temporary corpus. In third phase, the temporary corpus is then compared to the goal corpus to detect if there is any errors occurred. In last phase, the rules will be applied to reduce any errors occurred and fix the temporary corpus. The tagging approach has been trained using two Kadazan children’s story books which contain 2069 words. Evaluation process is done by comparing the tagging results of Brill’s approach with the manual tagging. Kadazan Part of Speech Tagger has achieved around 93 % of accuracy. This study has shown how Brill’s tagging approach can be used to identify tags for Kadazan language. Index Terms— Kadazan Language, Transformation-Based, POS tagger, Brill’s approach, Statistical, Rule-Based

I. INTRODUCTION POS tagging is a process of reading text in some languages and marking up a word in the text (corpus) that correspond to a particular POS such as noun, verb, adjective and adverb. In Natural Language Processing, POS tagging is important because it will show how the words relate to each other and also will help to resolve human language ambiguity in different types of analysis levels. It has been used in many applications such as in machine translation, speech recognition and information retrieval. Hence, the importance of POS tagging cannot be ignored at all. There are few different approaches have been applied to POS tagging. The first technique that was used to address POS tagging is rule-based. Then, statistical came into existence and DOI: 01.IJRTET.10.1.1380 © Association of Computer Electronics and Electrical Engineers, 2014


gained more popularity after 1980’s because it applies automatic tagging. Later on, Brill presented a rulebased system [2] in 1992 which is different from the original rule-based because it uses a set of rules to transform tags. This tagger has been trained for tagging English text and the results of tagging using Brill’s approach is it has achieved 97% of accuracy [5]. All these approaches are compared and implemented by other researches in order to get better tagging results. Rule-based approach works by assigning tags to words using contextual information which rules are writen by the linguistics. However, in statistical approaches, it is known as one of the complex method because it uses complex calculations of probability to tag corpus where one corpus has to be distributed into two parts where the first part contains a subset of corpus which is used for training tagger to learn a statistiscal model and the second part is a testing part where the learning statistical model will be used for tagging untagged text. Over the past few years, statistical approach were thought to be the most successful one compared to a rule-based method until Brill’s approach was introduced [4]. Brill’s approach is easier to develop compared to rule-based and statistical approaches because it is known as language and tag set independent and automatically acquires model from annotated corpora [4]. So, in this paper, we are trying to implement this approach to evaluate the effectiveness of this approach towards Kadazan language. This can be done by evaluating the tagging accuracy in order to see the tagging performance. The tagger which is trained for Kadazan Language might have a lower accuracy compared to English tagging result. Hence, we will observe and determine the tagging performance for Kadazan language by using Brill's approach. II. BRIEF OVERVIEW OF KADAZAN LANGUAGE Kadazan language is language which is spoken by Kadazan race in Borneo which is in the region from the Nosoob-Kepayan area through Penampang-Putatan and to Papar, Sabah. As we all know, every language has its own characteristics and grammatical structures and so as Kadazan language. Concerning about the characteristics of Kadazan language, the grammatical relations can be expressed by means of prefixes, infixes and suffixes. There is a need of understanding the function of different POS. For example, the noun can be formed from verb by prefixing the letter ‘manan’. For example, ‘aat’ which means ‘painting’ where its current tag is a verb can be transformed into a noun after the letter 'manan' is prefixed to that word and become 'mananaat' which means ‘painter’. Besides that, the grammatical relations also can be expressed by preceding certain word before the following tag word. For example, ‘i’ always precedes the noun such as ‘i kazu’ which means ‘that tree’. A. Challenges in Kadazan POS Tagger There are many POS tagger have been developed for other languages such as Asian Languages, Europe Languages but the most widely explored language is English. This research about Kadazan POS tagger has been carried out because there is no POS tagger has been developed for Kadazan Language yet. III. LITERATURE REVIEW There are few different approaches for POS tagger besides Brill’s Tagger which are use for POS tagging. The two most known approaches are rule-based approach and statistical approach. In general, rule-based approach is where the rules are written by humans based on linguistic knowledge. It is done by generating the input sentence to output text in the basic morphological, syntactical and semantic analysis of both sources and target languages which involved in tagging or translation. This approach usually relies on dictionary and humans to tag words. The related work [8] has developed a POS Tagger for Pashto language using rule-based approach and has achieved 88 % of accuracy. The related work [9] has developed a POS Tagger for Manipuri language using rule-based approach and has achieved 85 % of accuracy. The statistical approach is then become popular after the original Rule-based approach. This approach dissambiguate words depends on the probability which a word occurred with a particular tag. Most frequently tag which occured in the training set is the one assigned to an ambiguous instance of that word. The probability of the given sequence of tags occured will be calculated. Statistical approach mostly use Hidden Markov Model (HMM) where lexical and probabilities are used to search a tag for a word. Statistical approach requires complex computations as mention before. One of the related work which used statistical approach for tagging is [7]. They use Hidden Markov Model (HMM) to tag Manipuri language and achieved

76


92 % of accuracy. Besides that, the related work [3] also used statistical approach for tagging Indonesian language and yet has achieved an average accuracy of 80 %. After Brill’s Tagger has been introduced, it has become an approach which give similar or even better than those two approaches mentioned above. Brill's approach is also known as a Tranformation-Based Error Driven Learning where it transform tags based on rules applied. In this approach, a tag will be assign to each word and transform using a set of rules. These rules will be applied over and over to transform the incorrect tags into correct tags until there is no more rules can be applied. Besides that, it is also known as self learning where it uses a comprehensive technique known as TEL and rule templates instead of pure statistical. Two of the related works which used Brill's approach are [1] and [6]. In related work [1], they used Brill's approach to tag Polish language and has achieved 89.2 % of accuracy. In related work [6], they used Brill's approach to tag Greek language and has achieved 95% of accuracy. Brill’s approach is chosen to develop a Kadazan POS tagger because of its better performance showing good results for tagging not only English language but also for other languages such as Polish and Greek. IV. BRILL’S TAGGER FOR KADAZAN Kadazan POS tagger has been divided into four phases. The first phase is where the text or corpus will be inputted into the system in order tag all the words inside the corpus. The second phase begins when the corpus will go through the initial state annotater to tag all the words to its most likely tag based on the lexicon. The output of this process is the temporary corpus. In third phase, the temporary corpus will be compared to the goal corpus (manual tagged corpus) to detect if there is any errors occurred. Lastly, in phase four, the lexical and the contextual rules will be applied to correct the errors that occurred before. Figure 1 shows the overall model for Kadazan POS Tagger based on Brill's approach. Input Unannotated Corpus/Text to to assign

Initial State Annotater produce

Most Likely Tag to words (initial tag) based on lexicon

output Temporary Corpus compare apply

Error

Lexical/Contextual Rules

Goal Corpus

Correct Tags Corpus

Figure 1. Kadazan POS tagger model based on Brill's Approach

A. The First Phase The first phase of tagging begins by inputting an annotated text into the system. Figure 2 shows the diagram of the first phase.

Figure 2. First phase

77


B. The Second Phase The second phase of tagging begins when the input text go through the initial state annotater to tag all the words inside the corpus to its most likely tag. The most likely tags for the words are given in the lexicon. The words which are not in the lexicon are considered as unknown words. The unknown words will be tagged automatically as noun (N). Table I shows the examples of the lexicon for most likely tag. TABLE I. E XAMPLES OF L EXICON Word in Kadazan

English Translation

Most Likely Tag

Other Possible Tags

Kalaja Taagang Tavazaan Tuni Vagu

Work Red Way Sound New

V J N N R

N N J J, R J

Based on table 1, the word ‘Kalaja’ is usually tag as a verb (V) but also can be tagged as a noun (N). The word ‘Taagang’ is usually tag as an adjective (J) but also can be tagged as a noun (N). The word ‘Tavazaan’ is usually tag as a noun (N) but also can be tagged as an adjective (J). The word ‘Tuni’ is usually tag as a noun (N) but also can be tagged as an adjective (J) or as an adverb (R). The word ‘Vagu’ is usually tag as an adverb (R) but also can be tagged as an adjective (J). All these depends on the structure of the sentence and rules applied. The same thing applies to other words in the lexicon. The output of this phase is temporary corpus. Figure 3 shows the diagram of phase two. Initial State Annotater Temporary Corpus

to assign

Most Likely Tag to words (initial tag) based on lexicon

output

Figure 3. Second phase

C. The Third Phase The third phase of tagging continues by comparing the temporary corpus with the goal corpus to detect if there is any wrong tags occurred in the temporary corpus. The goal corpus is the manually tagged corpus. Figure 4 shows the diagram of phase three. Temporary Corpus

compare Goal Corpus

Error Figure 4. Third phase

D. The Fourth Phase The fourth phase begins when the lexical and the contextual rules are applied to fix the errors which occurred from phase three. The lexical rules are based on prefixes, infixes and suffixes of the word. Usually, the lexical rules will only affects the unknown words. Table II shows the examples of lexical rules for Kadazan language. Rule L1 stated that if the current tag of a word is a verb (V), after prefixing ‘manan’ to that word, the current tag of the word will be transformed into a new tag which is a noun (N). Rule L2 stated that if the current tag of the word is a verb (V), after prefixing ‘ka’ and suffixing ‘an’ to that word, the current tag will be transformed into a new tag which is a noun (N). Rule L3 stated if the current tag of the word is an adjective (J), after prefixing ‘k’ and suffixing ‘an’ to that word, the current tag of the word will be transformed into a 78


TABLE II. E XAMPLES OF L EXICAL RULES Rule L1

Current Tag V

New Tag N

L2

V

N

L3

J

N

L4 L5

J V

N N

When Prefix ‘manan’ Prefix ‘ka’ and Suffix ‘an’ Prefix ‘k’ and Suffix ‘an’ Infix 'in' Prefix ‘ko’ and Suffix ‘an’

new tag which is a noun (N). Rule L4 stated if the current tag of the word is an adjective (J), after infixing ‘in’ to that word, the current tag of the word will be transformed into a new tag which is a noun (N). Rule L5 stated if the current tag of the word is a verb (V), after prefixing ‘ko’ and suffixing ‘an’ to that word, the current tag of the word will be transformed into a new tag which is a noun (N). The same thing applies to other words based on their lexical rules. For example based on rule L3, the word 'avasi (J)' which means 'good', will be transformed into 'kavasian (N)' which means 'goodness'. The contextual rules is then applied to the tagger to transformed the wrong tags into a correct tags based on the rules templates. Table III shows the examples of contextual rules for Kadazan language. TABLE III. EXAMPLES OF C ONTEXTUAL RULES Rule C1 C2 C3 C4 C5

Current Tag J/R/V J/R/V J/R/V J/R/V J/R/V

New Tag N N N N N

If NEXTWD i PREVWD diti NEXTWD aiso NEXTWD o PREVWD tokuudi

Rule C1 stated that the next word after 'i' which the current tag is either an adjective (J), an adverb (R) or a verb (V) will be transformed into a noun (N). Rule C2 stated that, the word before 'diti' where its current tag is either an adjective (J), an adverb (R) or a verb (V), all will be automatically transform into a noun or all words before 'diti' will be tagged as a noun. Rule C3 stated that the next word after 'aiso' which the current tag is either an adjective (J), an adverb (R) or a verb (V) will be transformed into a noun (N). Rule C4 stated that, the next word after 'o' where its current tag is either an adjective (J), an adverb (R) or a verb(V), all will be transformed into a noun or all words after 'o' will be automatically transform into a noun. Rule C5 stated that, the word before 'tokuudi' where its current tag is either an adjective (J), an adverb (R) or a verb (V), all will be automatically transform into a noun or all words before 'tokuudi' will be tagged as a noun. The same thing applies to other words based on their contextual rules. For example, in Kadazan sentence 'aiso louti' which means 'no bread', the word 'louti' must be tagged as a noun based on rule C4. The tagged sentence will become 'aiso (ADJ) louti (N). Figure 5 shows the diagram of phase four. Temporary Corpus

apply

pr oduce

Lexical/Contextual Rules Correct Tags Corpus Figure 5. Phase four

79

Error


V. RESULTS AND EVALUATION In this section, the performance of Kadazan POS tagger using Brill’s approach will be shown and discuss. The tagger has been evaluated in three ways. First is to get the accuracy and the error rate by comparing each rules with the lexicon and by combining both lexical rules and contextual rules with the lexicon. Secondly we did the evaluation by tagging the corpus using the most likely tag (lexicon) and with combined rules to compare the tagging results with and without using any rules. Lastly is by calculating the precision, recall and f-measure for every main tag set in both corpuses to obtain the accuracy. TABLE IV. T AGGING R ESULTS FOR C ORPUS 1 BY C OMPARING RULES Tags Correct Wrong Accuracy (%) Error Rate (%)

Lexicon + Lexical(L) 681 60 91.9

Lexicon + Contextual(C) 688 53 92.85

Lexicon + L+C 696 45 93.93

8.10

7.15

6.07

As shown in table IV, by applying the lexicon and lexical rules only, we obtained 681 words for correct tags, 60 wrong tags and overall we obtained 91.9% of accuracy and 8.10 % of error rate. By applying the lexicon and contextual rules only, we obtained 688 correct word tags, 53 wrong tags and overall we get 92.85 % of accuracy and 7.15 % of error rate. Lastly, by applying the lexicon and both lexical and contextual rules, we obtained 696 correct word tags, 45 wrong tags and overall we obtained 93.93 % of accuracy and 6.07 % of error rate. TABLE V. TAGGING R ESULTS WITH R ULES AND WITHOUT R ULES FOR C ORPUS 1 Tags

Without Rules (lexicon) 670 71 90.42 9.58

Correct Wrong Accuracy (%) Error Rate (%)

With Rules (L+C) + Lexicon 696 45 93.93 6.07

As shown in table V, by tagging the corpus 1 without applying rules, we obtained 670 correct tags, 71 incorrect tags and overall we obtained 90.42 % of accuracy and 9.58 % of error rate. By tagging the corpus with rules and the lexicon, we obtained 696 correct tags, 45 wrong tags and overall we obtained 93.93 % of accuracy and 6.07 % of error rate. TABLE VI. T AGGING R ESULTS FOR E ACH T AG SET IN CORPUS 1 Noun (N)

Verb (V)

Adverb (R)

Recall

91.6

85.5

92.2

Adjective (J) 73.8

Precision

92.6

92.2

94.0

60.8

F-measure

92.1

88.8

93.1

66.7

Tags / %

As shown in table VI, by testing only the noun tag set, we obtained 91.6 % recall, 92.6 % precision and 92.1 % f-measure. By testing the verb tag set only, we obtained 85.5 % recall, 92.2 % precision and 88.8 % fmeasure. By testing the adverb tag set, we obtained 92.2 % recall, 94.0 % precision and 93.1 % f-measure. Lastly, by testing the adjective tag set only, we obtained 73.8 % recall, 60.8 % precision and 66.7 % fmeasure. The same process of evaluation will be performed for corpus 2 which consist of 1328 words where the size is larger than corpus 1. Table VII shows the tagging results for corpus 2 by comparing each rules with the lexicon and by combining both rules with the lexicon to obtain the tagging accuracy and error rate. As shown in table VII, by applying the lexicon and the lexical rules only, we obtained 1203 words for correct tags, 125 wrong tags and overall we obtained 90.59 % of accuracy and 9.41 % of error rate. By applying the lexicon and the contextual rules only, we obtained 1224 correct tags, 104 wrong tags and overall we get 92.17 % of accuracy and 7.83 % of error rate. Lastly, by applying both lexical and contextual rules with the 80


TABLE VII. TAGGING R ESULTS FOR CORPUS 2 BY COMPARING RULES Tags Correct Wrong Accuracy (%) Error Rate (%)

Lexicon + Lexical(L) 1203 125 90.59

Lexicon + Contextual(C) 1224 104 92.17

Lexicon + L+C

9.41

7.83

8.06

1221 107 91.94

Lexicon, We Obtained 1221 Correct Tags, 107 Wrong Tags And Overall We Obtained 91.94 % Of Accuracy And 8.06 % Of Error Rate. TABLE VIII. TAGGING RESULTS WITH R ULES AND WITHOUT R ULES FOR CORPUS 2 Tags

Without Rules (lexicon) 1203 125 90.59 9.41

Correct Wrong Accuracy (%) Error Rate (%)

With Rules (L+C) + Lexicon 1221 107 91.94 8.06

As shown in table VIII, by tagging the corpus 2 without rules, we obtained 1203 correct tags, 125 incorrect tags and overall we obtained 90.59 % accuracy and 9.41 % error rate. By tagging the corpus with rules, we obtained 1221 correct tags, 107 wrong tags and overall we obtained 91.94 % accuracy and 8.06 % error rate. TABLE IX. T AGGING R ESULTS FOR E ACH T AG SET IN CORPUS 2 Tags / %

Noun (N)

Verb (V)

Adverb (R)

Adjective (J)

Recall

70.6

80.0

95.8

74.5

Precisio n

66.4

73.8

94.2

74.5

Fmeasur e

68.4

76.8

95.0

74.5

As shown in table IX, by testing only the noun tag set, we obtained 70.6 % recall, 66.4 % precision and 68.4 % f-measure. By testing the verb tag set only, we obtained 80.0 % recall, 73.8 % precision and 76.8 % fmeasure. By testing the adverb tag set, we obtained 95.8 % recall, 94.2 % precision and 95.0 % f-measure. Lastly, by testing the adjective tag set only, we obtained 74.5 % of recall, precision and f-measure. A. Discussion Based on the overall results, it shows that the tagging performance achieved quite a high accuracy around 91 % - 93 % but not as high as the result obtained by tagging English language with 97% of accuracy [2]. Some problems could be due to complicated morphological structures of Kadazan language because Brill’s approach is actually develop for English language. So, if it is implemented into other languages, some problems or errors might occured and that is why other POS Tagger for other languages cannot achieve high accuracy as English POS Tagger. Evaluation on the corpus using certain rules such as lexical and contextual rules as shown in table IV and table VII also have been carried out to see if there are any improvements by applying those rules into the tagger. Based on the results, we can see that the applied rules manage to improve the results accuracies and reduce the error rates. If rules are not applied as shown in table V and table VIII, the accuracies are lower and the error rates are higher compared after the rules have been applied. Besides that, we also can see that by applying contextual rules after the lexical rules also increase the tagging accuracy. This proves that some of the tags in the temporary corpus where the lexicon and the lexical rules have been applied before have been fixed after applying the contextual rules. Moreover, as we can see in table VII and VIII for corpus 2, the accuracy before and after applying the lexical rules is the same as when

81


applying the lexicon only. Lexical rules is usually applied for unknown words. So, if the results are the same means that there is no unknown words have been detected in corpus 2. Next is by evaluating the precision, recall and f-measure for each main tag sets in the corpus. Recall is the total number of correctly machines-tagged words over total number of words with correct tag [5]. Precision is the total number of correctly machine-tagged words over total number of machine-tagged words [5]. Fmeasure is a measure of test accuracy that considers both precision and recall of the test to calculate the score. Based on table VI, by taking the noun tag set, the precision of 92.6 % means that 7.4 % of what the system retrieved was not considered as the correct result and the recall of 91.6 % means that the system left 8.4 % of the intended tags. By calculating the f-measure based on the precision and recall, the result obtained is 92.1 %. The same explanation goes for results of every tag sets on table VI for corpus 1 and also for table IX for corpus 2. All these results can be improved by adding more rules. By using larger lexicon would also help to reduce the number of unknown words. By using different types of approaches besides Brill’s approach may also help to compare which of those approaches are suitable and better for developing Kadazan POS Tagger. In our study, we just focused more on the main tag sets such as noun, verb, adjective and adverb because our purpose is to develop a basic POS tagging for Kadazan which can be enhanced again in the future. Hence, the overall results showed that the Kadazan language did not achieved higher or similar accuracy as English language but at least it has achieved more than 90% of accuracy which is also considered as high. VI. CONCLUSION For conclusion, the performance of Kadazan POS tagger using Brill’s approach has been shown and discuss. The approach has been implemented into POS Tagger in four phases based on Brill's approach and the tagger also has been evaluated in three ways. First is to get the accuracy and error rate by comparing each rules with the lexicon and combining both lexical rules and contextual rules with lexicon. Secondly is to evaluate the tagging corpus using most likely tag (lexicon) only and with combined rules with the lexicon to compare the tagging results with and without using any rules. Lastly is by calculating the precision, recall and f-measure of the tagger for every main tag sets in both corpuses to obtain the accuracy. Overall the results achieved is around 93 % of accuracy. ACKNOWLEDGMENT I would like to thank Dr. Lailatul Qadri Zakaria for being my supervisor and for all her encouragements and support throughout this project. I would also like to thank my parents and friends who have been supporting me always. Lastly, I would also want to thank all the referees for their valuable comments. REFERENCES [1] S.Acedanski and K. Gołuchowski, "A morphosyntatic rule-based Brill tagger for Polish," in Intelligent Information System 9999, Poland, pp. 1-10, 2010. [2] E. Brill, "A simple rule-based part-of speech," in Proceedings of the Third Conference on Applied Computational Linguistic (ACL), Trento, 1992, pp. 152-155, 1992. [3] M. Adriani and R. Manurung F. Pisceldo, "Probabilistic Part Of Speech Tagging for Bahasa Indonesia," , Depok, 2010. [4] W. Anwar,U. Ijaz Bajwa and E. Ullah Munir F. Naz, "Urdu Part of Speech Tagging Using Transformation Based Error Driven Learning," in World Applied Sciences Journal 16 (3), Abbottabad/Wah Cantt, pp. 437-448, 2012. [5] B. Megyesi, "Brill’s Rule-Based Part of Speech Tagger," in Brill’s Rule-Based Part of Speech Tagger. Stockholm, 1998. [6] G. Petasis, G. Palioras, V. Karkaletsis, C. D. Spyropoules dan I. Androutsopoulas, "Resolving Part-of-Speech Ambiguity in Greek Language Using Learning Techniques,", Demokritos, 1998. [7] K. Raju Singha, B. Syam Purkayastha, and K. Dhiren Singha, “Part of Speech Tagging in Manipuri with Hidden Markov Model,” IJCSI International Journal of Computer Science Issues, vol. 9, November 2012. [8] I. Rabbi, M. Abid Khad, and R. Ali, “Rule-Based Part of Speech Tagger for Pashto Language,” Proceedings of the Conference on Language & Technology, Pakistan, 2009. [9] K. Raju Singha, B. Syam Purkayastha, and K. Dhiren Singha, “Part of Speech Tagging in Manipuri: A Rule-based Approach,” International Journal of Computer Applications, Silchar, pp. 0975-8887, August 2012.

82


1380