A Hybrid Approach to Lao Word Segmentation using Longest Syllable Level Matching with Named Entities Recognition Arounyadeth Srithirath #, Pusadee Seresangtakul # # Department of Computer Science, Khon Kaen University Khon Kaen, Thailand, 40002 Email: tom_zmax@yahoo.com, pusadee@kku.ac.th Abstract The Lao language is written without words delimiter which makes it extremely difficult to process. The development of automatic word segmentation for natural language processing for the Lao language is an essential but challenging task. This paper proposes a longest syllable level match with named entities recognition approach for Lao word segmentation. Syllables were first extracted from the input text and then longest matching was applied. This is one of the techniques in the Dictionary Based approach with named entities recognition being used to combine them to form the words. The performance result obtained from this approach, in precision and recall, was 85.21% and 92.36%, Keywords Lao word segmentation, tokenization, syllable extraction, longest matching, dictionary based, named entities recognition. I. INTRODUCTION A Lao text is a string of symbols with no explicit word boundary, which is similar to other South East Asian languages for ex: Japanese, Chinese, Thai, etc. Spaces between syllables are rarely used in Lao language. In order to perform Lao language processing, especially in Rule-Based Machine Translation role, text must be first segmented into individual terms or words. Researchers have proposed several segmentation techniques to apply word segmentation in many different languages. These techniques can be classified into two main approaches: Dictionary Based (DCB) and Machine Learning Based (MLB) [2]. The DCB approach is simple and straight forward in that it basically looks up the series of characters in the dictionary for matching terms. The main problems for the DCB approach are parsing the words that are not in the dictionary and parsing ambiguity. The MLB approach aims to solve the problems that occur in the DCB approach by using a model classification of machine learning approach that has been learned from various character patterns inside a tagged corpus. In this paper, we will present the word segmentation process in Lao language by combining three main operations: syllable extraction, longest matching technique, which one of the techniques in DCB approach, and named entities recognition (NER) [4] in order to improve the segmentation quality. Our dictionary contains approximately 52,000 words and acquires sufficient lexicon to cover daily used in the general domain. II. BACKGROUND AND RELATED WORKS Recently, Sisouvanh Vanthanavong et al. (2011) proposed research on Lao Word Segmentation based on conditional random fields (CRF) [1] using a tagged corpus of approximately 100,000 words, which gave the precision and recall results of 80.29% and 78.45%, As Lao and Thai are very similar in both their spoken and writing system, Choochart Haruechaiyasak et al. (2008) have compared Thai word segmentation approaches [2] for both the DCB and MLB techniques. For DCB, they compared two algorithms: Longest Matching and Maximal Matching. For MLB, they compared four algorithms: Naive Bayes (NB), decision tree, support vector machine (SVM), and CRF. By using the ORCHID corpus which contains 113,404 words, the best performance was obtained from the CRF algorithm with the precision and recall result of 95.79% and 94.98%, Named entities recognition is an essential process widely used in natural language processing (NLP). Hutchatai Chanlekha et al. (2004) have presented Thai named entity extraction by incorporating the maximum entropy Model with simple heuristic information [3]. By combining the machinelearning and rule-based approaches, the evaluated result shows that the F-measures of person, location, and organization names are 90.44%, 82.16% and 89.87%, Nattapong Tongtep et al. (2010) have proposed the method on pattern-based extraction of named entities in Thai news documents [4], that focuses on rule-based approach and uses many techniques such as longest word matching, longest pattern matching, clue words, and word lists from dictionaries, etc., all combine together. The result was obtained from their method was approximately 68-100% correctness depending the named entities type. Although, previous research has shown that the MLB approach performs slightly better than the DCB approach, especially in CRF technique. However, the MLB performance depends mainly on the domain and size of the corpus [2]. The process to collect the corpus to cover every domain and character pattern would take a lot of time and effort in order for the training model could be trained effectively. On the 978-1-4799-0545-4/13/$31.00 2013 IEEE
other hand, even the DCB approach performs poorly on the unknown word, the problem can be overcome significantly by combing syllable extraction and Lao named entities recognition technique. III. METHODOLOGY Lao Word Segmentation is underpinned by three rudimentary operations: pre-processing, syllable extraction, and longest syllable level matching with NER, Fig. 1 illustrates the overview of the system. Lao Paragraphs or Sentences to III illustrates Lao consonants, vowels, and tone markers, TABLE I LAO CONSONANTS Consonant IPA Consonant IPA Consonant IPA Single Consonants ກ /k/ ຕ /t/ ຟ /f/ ຂ /k h / ຖ /t h / ມ /m/ ຄ /k h / ທ /t h / ຢ /j/ ງ /ŋ/ ນ /n/ ຣ /r/ ຈ /c/ ບ /b/ ລ /l/ Pre-processing Sentence Syllable Extraction List of Syllable Longest Syllable Level Matching with Name Entity Recognition Lao Segmented Words Dictionary Lao Initial Name Entity ສ /s/ ປ /p/ ວ /w/ ຊ /s/ ຜ /p h / ຫ /h/ ຍ /ɲ/ ຝ /f/ ອ /ʔ/ ດ /d/ ພ /p h / ຮ /h/ Double Consonants ຫງ /ŋ/ ຫນ/ໜ /n/ ຫລ/ຫ /r/ ຫຍ /ɲ/ ຫມ/ໝ /m/ ຫວ /w/ TABLE II LAO VOWELS Vowel IPA Vowel IPA Vowel IPA Short Vowels ະ /a/ ເ ະ /e/ /ɤ/ /i/ ແ ະ /ɛ/ ວະ /uə/ /ɯ/ ໂ ະ /o/ ເ ອ /ɯə/ Fig. 1 Lao word segmentation system overview The pre-processing operation will take Lao paragraphs or sentences as an input. The Lao language, however, rarely uses a space between syllables; however, the Lao language does use full stops (.) to determine the end of sentences. Therefore, the paragraph will be split into sentences by using full stop (.) in order to help longest syllable level matching with the NER operation parsing the words more accuracy. Syllable structure in Lao languages contains consonants, vowels and tone markers. Consonants occur on the baseline and can be divided into two categories: single consonants, which have 27 characters and double consonants, which has 6 characters. Vowels can occur between, before, above or below a consonantal character. There are 28 vowels in Lao language. They are divided into two main categories according to their sound: short vowels, which have 12 characters and long vowels, which have 12 characters and a set of special vowel which has 4 characters. There are 4 tone markers which always occur on top of consonantal characters. There are also 3 special symbols: ໆ indicating repetition of syllable; ຯ indicating and others (etc.), and indicating voice less of the final consonant of words that borrowed from other language. Table I /u/ າະ /ɔ/ ຍ /iə/ Long Vowels າ /aː/ ເ /eː/ /ɤː/ /iː/ ແ /ɛː/ ວ /uːə/ /ɯː/ ໂ /oː/ ເ ອ /ɯːə/ /uː/ /ɔː/ ຍ /iːə/ Special Vowels ໄ /ai/ ໃ /ai/ ເ າ /ao/ າ /am/ TABLE III LAO TONE MARKERS Tone Marker Tone IPA Tone Marker Tone IPA low /àː/ falling /âː/ high /áː/ rising /ǎː/ By determining the characteristics of the Lao writing system, it can be observed that word boundaries generally align with syllable boundaries. This means that instead of working directly at character level, which will lead to the incorrect segmentation of a sentence into single characters and small lexemes, it s useful to do Syllable Extraction first before doing
other operations. These rules and the algorithm have been proposed by Phonpasit Phissamay et al. (2004) in order to carry out syllable identification in Lao language [5]. In order to do longest syllable level matching, the series of syllables will be looked up in a dictionary using the forward technique. Fig. 2 describes the algorithm for Longest Syllable Level Matching. Given a set of extracted syllables S and a set of words in a dictionary D, the algorithm will output a set of longest syllable W. For example: the Lao sentence after doing the syllable extraction operation is ຂ ອຍ ໄປ ຕະ ຫ າດ /kʰɔːy ay tá l ːt go to t e ma ket i stly, t e algo it m will mark the index of current position denoted as CP and last position denoted as LP to the first index and last index of the syllable set, It will then begin a check from index of syllable set CP to LP ຂ ອຍໄປຕະຫ າດ against the dictionary, if there is no match; it will decrease LP by one and keep checking from index of syllable set CP to LP ຂ ອຍໄປຕະ against the dictionary again. It will keep doing until it has found match or a CP equal to LP. Following this, it will form the word from the index of syllable set CP to LP, and increase CP to LP+1 and reset LP to the last index of the syllable set. The algorithm will keep doing this until all syllables have been processed or CP is greater than LP. The word segmentation esult f om t is exam le is ຂ ອຍ ໄປ ຕະຫ າດ. Algorithm 1: Longest syllable level matching S = {s 0, s 1, s 2,, s n } # Set of extracted syllables D = {d 0, d 1, d 2,, d n } #Set of words the in dictionary W = {w 0, w 1, w 2,, w n } # Set of longest syllable #Initialize Variable Let CP = 0 # Current Position Marker Let LP = the length of set S Let # Last Position Marker If CP <= TmpLP and S[CP] is not SPACE Then If S[CP to TmpLP] D Then W S[CP to TmpLP] CP = TmpLP + 1 TmpLP = TmpLP 1 W S[CP] CP = CP + 1 TmpSL = LP until CP > LP return W Fig. 2 Algorithm for longest syllable level matching Lao personal names usually start with title that can be used as clue. In general, native Lao personal names are composed of title + one space + first name + one space + [last name] (last name is denoted as optional part) as shown in Fig. 3 and Table IV. Title First Name Last Name Fig. 3 Regular grammar for personal name written in Lao language TABLE IV EXAMPLE OF PERSON NAME WRITTEN IN LAO LANGUAGE No Title First Name Last Name 1 ທາວ ສກໃຈ ລດຕະນະ 2 ທານ ອານສອນ ໄພສານ 3 ນາງ ສ ແນດຕາ ພະວງສາ Furthermore, location expressions such as institute, company, school, university, office, district, city, village, town, province, country, etc. also have a title that can be used as a clue Gene ally, it s com osed of title + one s ace + location name as shown in Fig. 4 and Table V. Title Fig. 4 Regular grammar for location name written in Lao language TABLE V EXAMPLE OF LOCATION NAME WRITTEN IN LAO LANGUAGE No Title Location Name 1 ບ ລສດ ໄຊຍະສດທ ປ ກສາໄອທ 2 ອງການ ການຄາໂລກ 3 ບານ ສສງວອນ Location Name By determining the word boundary of personal name and location name in Lao language, a rule can be created to recognize named entities in Lao language. Fig. 5 describes the modified algorithm for longest matching using forward technique with NER. IV. EXPERIMENTAL AND EVALUATION Lao news documents were used to evaluate the performance in our approach and they were collected from websites in the following categories: General, Sport, and Education. Each category was taken from a Lao news publisher: Vientiane Mai [6]. Ten articles were randomly selected in each category, totalling thirty articles. The dictionary used contains approximately 52,000 words. The named entities were divided into two types for recognition: person name (PER) and location (LOC). Table VI shows the list of Lao initial titles and location names that can be used as clues to detect the named entities. Table VII shows the results of our approach before and after using NER. It can be observed that the word segmentation approach improve
significantly when using NER especially in general category where the named entities are most likely to occur. Algorithm 2: Longest Syllable Level Matching with NER S = {s 0, s 1, s 2,, s n } # Set of extracted syllables D = {d 0, d 1, d 2,, d n } #Set of words the in dictionary C = {c 0, c 1, c 2,, c n } # Set of clue words W = {w 0, w 1, w 2,, w n } # Set of longest syllable #Initialize Variable Let CP = 0 # Current Position Marker Let LP = the length of set S Let FlagClue = FALSE Let # Last Position Marker Let NI = 0 # Next Index of Space If CP <= TmpLP and S[CP] is not SPACE Then If S[CP to TmpLP] D Then If S[CP to TmpLP] C and S[TmpLP +1] is SPACE Then FlagClue = TRUE FlagClue = FALSE W S[CP to TmpLP] CP = TmpLP + 1 If FlagClue is TRUE Then NI = CP NI = NI + 1 until S[NI] is SPACE W S[CP to NI -1] CP = NI FlagClue = FALSE TmpLP = TmpLP 1 W S[CP] CP = CP + 1 until CP > LP return W PER ສຈ Prof. PER ຮສ ດຣ Ph.D. Assoc. Prof. PER ຜຊສ ດຣ Ph.D. Asst. Prof. PER ສຈ ດຣ Ph.D. Prof. PER ທ ານ Mr. PER ສ ບຕ Private 1st class PER ສ ບໂທ Corporal PER ສ ບເອກ Sergeant PER ຮ ອຍຕ Second Lieutenant PER ຮ ອຍໂທ First Lieutenant PER ຮ ອຍເອກ Captain PER ພ ນຕ Major PER ພນໂທ Lieutenant General PER ພນເອກ General PER ພະນະທ ານ Excellency LOC ອງການ Organization LOC ບ ລສດ Company LOC ບານ Village LOC ບານພກ Guest House LOC ມອງ District LOC ແຂວງ Province LOC ໂຮງແຮມ Hotel LOC ຮານ Restaurant LOC ລ ດວ ສາຫະກ ດ State enterprise Fig. 5 Algorithm for longest matching with NER TABLE VI LIST OF LAO INITIAL TITLE AND LOCATION NAME Type Lao Title English Title PER ທ າວ Mr. PER ນາງ Ms. PER ອຈ Teacher. PER ດຣ Dr. PER ຮສ Assoc. Prof. PER ຜຊສ Asst. Prof. Fig. 6 Comparison chart result of longest syllable level matching approach before and after using NER
TABLE VII EVALUATION RESULTS OF LONGEST SYLLABLE LEVEL MATCHING BEFORE AND AFTER USING NER Approach Categories Precision Recall F-Measure DCB Without NER DCB With NER General 76.59 86.21 81.12 Sport 79.41 88.67 83.79 Education 82.58 90.31 86.27 General 83.54 91.10 87.16 Sport 83.69 91.55 87.44 Education 87.43 93.79 90.50 TABLE VIII AVERAGE EVALUATION RESULTS OF LONGEST SYLLABLE LEVEL MATCHING BEFORE AND AFTER USING NER Approach Precision Recall F-Measure DCB Without NER 79.87 88.62 84.02 DCB With NER 85.21 92.36 88.64 The errors that most likely occur by using our approach are parsing word ambiguity and parsing unknown word. For ex: the Lao sentence ຂອຍໃຫການສະໜບສະໜນ ຈາ /kʰɔːy y kaːn sáná sán ːn c o/ (I give you the support). t a ses into ຂອຍ (I) [P onoun], ໃຫການ (give evidence) [Ve b], ສະໜບສະໜນ (to support) [Ve b], ຈາ you [P onoun], instead of ຂອຍ (I) [P onoun], ໃຫ (give) [Ve b], ການສະໜບສະໜນ (support) [Noun], ຈາ you [P onoun] V. CONCLUSION AND FUTURE WORK This paper presented the Lao word segmentation approach using longest syllable level matching with NER. This approach first extracted the syllables from the input text and then applied longest matching, which is one of the techniques in the DCB approach to combine them to form the words. We also proposed the technique on Lao named entities recognition, especially in person and location name domain, in order to improve the quality in word segmentation more accuracy. The experimental performance result was obtained from our approach with precision and recall of 85.21% and 92.36%, Future works will try to implement MLB approach to integrate with our approach in order to improve word segmentation performance, especially when parsing words that are not in the dictionary and parsing word ambiguity. REFERENCES [1] S. Vanthanavong and C. Haruechaiyasak, LaoWS: Lao Word Segmentation Based on Conditional Random Fields, in Conference on Human Language Technology for Development, 2011, p. 21-26. [2] C. Haruechaiyasak, S. Kongyoung, and M. Dailey, A comparative study on Thai wo d segmentation a oac es, in 5th Int. Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2008, p. 125-128. [3] H. Chanlekha and A. Kawt akul, Thai Named Entity Extraction by incorporating Maximum Entropy Model with Simple Heuristic nfo mation, in 1st Int. Joint Conference on NLP, 2004. [4] N. Tongtep and T. T ee amunkong, Pattern-based Extraction of Named Entities in T ai News Documents, Thammasat Int. J. Sc. Tech., Vol. 15, No. 1, pp. 70-81, January-March 2010. [5] P P issamay, et al, Syllabification of Lao Sc i t fo Line B eaking, Tech. Rep. of STEA, Lao PDR, 2004. [6] (2013) Vientaine Mai website. [Online]. Available: http://www.vientianemai.net/