A Hybrid Approach to Lao Word Segmentation using Longest Syllable Level Matching with Named Entities Recognition

Similar documents
Disambiguation of Thai Personal Name from Online News Articles

Linking Task: Identifying authors and book titles in verbose queries

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Learning Methods in Multilingual Speech Recognition

Prevalence of Oral Reading Problems in Thai Students with Cleft Palate, Grades 3-5

Cross Language Information Retrieval

Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape

Reducing Features to Improve Bug Prediction

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Mandarin Lexical Tone Recognition: The Gating Paradigm

Chinese Language Parsing with Maximum-Entropy-Inspired Parser

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Indian Institute of Technology, Kanpur

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT

Probabilistic Latent Semantic Analysis

Rule Learning with Negation: Issues Regarding Effectiveness

Rule Learning With Negation: Issues Regarding Effectiveness

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Constructing Parallel Corpus from Movie Subtitles

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Multilingual Sentiment and Subjectivity Analysis

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

ARNE - A tool for Namend Entity Recognition from Arabic Text

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Short Text Understanding Through Lexical-Semantic Analysis

Character Stream Parsing of Mixed-lingual Text

CS 598 Natural Language Processing

Switchboard Language Model Improvement with Conversational Data from Gigaword

Lecture 1: Machine Learning Basics

The Smart/Empire TIPSTER IR System

Problems of the Arabic OCR: New Attitudes

Named Entity Recognition: A Survey for the Indian Languages

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

ScienceDirect. Malayalam question answering system

Grade 4. Common Core Adoption Process. (Unpacked Standards)

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Australian Journal of Basic and Applied Sciences

Context Free Grammars. Many slides from Michael Collins

The stages of event extraction

Corrective Feedback and Persistent Learning for Information Extraction

Transfer of Training

First Grade Curriculum Highlights: In alignment with the Common Core Standards

A Comparison of Two Text Representations for Sentiment Analysis

Individual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Highlighting and Annotation Tips Foundation Lesson

Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain

AQUA: An Ontology-Driven Question Answering System

MARK¹² Reading II (Adaptive Remediation)

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Human Emotion Recognition From Speech

SARDNET: A Self-Organizing Feature Map for Sequences

Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation

Calibration of Confidence Measures in Speech Recognition

Task Tolerance of MT Output in Integrated Text Processes

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH

Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities

Netpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Speech Recognition at ICSI: Broadcast News and beyond

Speech Emotion Recognition Using Support Vector Machine

Learning From the Past with Experiment Databases

The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Prediction of Maximal Projection for Semantic Role Labeling

Degree Qualification Profiles Intellectual Skills

11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

WHEN THERE IS A mismatch between the acoustic

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

IMPROVING SPEAKING SKILL OF THE TENTH GRADE STUDENTS OF SMK 17 AGUSTUS 1945 MUNCAR THROUGH DIRECT PRACTICE WITH THE NATIVE SPEAKER

CEFR Overall Illustrative English Proficiency Scales

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Beyond the Pipeline: Discrete Optimization in NLP

Phonological Processing for Urdu Text to Speech System

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

A heuristic framework for pivot-based bilingual dictionary induction

Improving the Quality of MT Output using Novel Name Entity Translation Scheme

Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment

A Case Study: News Classification Based on Term Frequency

Machine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler

Distant Supervised Relation Extraction with Wikipedia and Freebase

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Matching Similarity for Keyword-Based Clustering

Word Segmentation of Off-line Handwritten Documents

Learning Computational Grammars

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

BYLINE [Heng Ji, Computer Science Department, New York University,

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

TEKS Correlations Proclamation 2017

Transcription:

A Hybrid Approach to Lao Word Segmentation using Longest Syllable Level Matching with Named Entities Recognition Arounyadeth Srithirath #, Pusadee Seresangtakul # # Department of Computer Science, Khon Kaen University Khon Kaen, Thailand, 40002 Email: tom_zmax@yahoo.com, pusadee@kku.ac.th Abstract The Lao language is written without words delimiter which makes it extremely difficult to process. The development of automatic word segmentation for natural language processing for the Lao language is an essential but challenging task. This paper proposes a longest syllable level match with named entities recognition approach for Lao word segmentation. Syllables were first extracted from the input text and then longest matching was applied. This is one of the techniques in the Dictionary Based approach with named entities recognition being used to combine them to form the words. The performance result obtained from this approach, in precision and recall, was 85.21% and 92.36%, Keywords Lao word segmentation, tokenization, syllable extraction, longest matching, dictionary based, named entities recognition. I. INTRODUCTION A Lao text is a string of symbols with no explicit word boundary, which is similar to other South East Asian languages for ex: Japanese, Chinese, Thai, etc. Spaces between syllables are rarely used in Lao language. In order to perform Lao language processing, especially in Rule-Based Machine Translation role, text must be first segmented into individual terms or words. Researchers have proposed several segmentation techniques to apply word segmentation in many different languages. These techniques can be classified into two main approaches: Dictionary Based (DCB) and Machine Learning Based (MLB) [2]. The DCB approach is simple and straight forward in that it basically looks up the series of characters in the dictionary for matching terms. The main problems for the DCB approach are parsing the words that are not in the dictionary and parsing ambiguity. The MLB approach aims to solve the problems that occur in the DCB approach by using a model classification of machine learning approach that has been learned from various character patterns inside a tagged corpus. In this paper, we will present the word segmentation process in Lao language by combining three main operations: syllable extraction, longest matching technique, which one of the techniques in DCB approach, and named entities recognition (NER) [4] in order to improve the segmentation quality. Our dictionary contains approximately 52,000 words and acquires sufficient lexicon to cover daily used in the general domain. II. BACKGROUND AND RELATED WORKS Recently, Sisouvanh Vanthanavong et al. (2011) proposed research on Lao Word Segmentation based on conditional random fields (CRF) [1] using a tagged corpus of approximately 100,000 words, which gave the precision and recall results of 80.29% and 78.45%, As Lao and Thai are very similar in both their spoken and writing system, Choochart Haruechaiyasak et al. (2008) have compared Thai word segmentation approaches [2] for both the DCB and MLB techniques. For DCB, they compared two algorithms: Longest Matching and Maximal Matching. For MLB, they compared four algorithms: Naive Bayes (NB), decision tree, support vector machine (SVM), and CRF. By using the ORCHID corpus which contains 113,404 words, the best performance was obtained from the CRF algorithm with the precision and recall result of 95.79% and 94.98%, Named entities recognition is an essential process widely used in natural language processing (NLP). Hutchatai Chanlekha et al. (2004) have presented Thai named entity extraction by incorporating the maximum entropy Model with simple heuristic information [3]. By combining the machinelearning and rule-based approaches, the evaluated result shows that the F-measures of person, location, and organization names are 90.44%, 82.16% and 89.87%, Nattapong Tongtep et al. (2010) have proposed the method on pattern-based extraction of named entities in Thai news documents [4], that focuses on rule-based approach and uses many techniques such as longest word matching, longest pattern matching, clue words, and word lists from dictionaries, etc., all combine together. The result was obtained from their method was approximately 68-100% correctness depending the named entities type. Although, previous research has shown that the MLB approach performs slightly better than the DCB approach, especially in CRF technique. However, the MLB performance depends mainly on the domain and size of the corpus [2]. The process to collect the corpus to cover every domain and character pattern would take a lot of time and effort in order for the training model could be trained effectively. On the 978-1-4799-0545-4/13/$31.00 2013 IEEE

other hand, even the DCB approach performs poorly on the unknown word, the problem can be overcome significantly by combing syllable extraction and Lao named entities recognition technique. III. METHODOLOGY Lao Word Segmentation is underpinned by three rudimentary operations: pre-processing, syllable extraction, and longest syllable level matching with NER, Fig. 1 illustrates the overview of the system. Lao Paragraphs or Sentences to III illustrates Lao consonants, vowels, and tone markers, TABLE I LAO CONSONANTS Consonant IPA Consonant IPA Consonant IPA Single Consonants ກ /k/ ຕ /t/ ຟ /f/ ຂ /k h / ຖ /t h / ມ /m/ ຄ /k h / ທ /t h / ຢ /j/ ງ /ŋ/ ນ /n/ ຣ /r/ ຈ /c/ ບ /b/ ລ /l/ Pre-processing Sentence Syllable Extraction List of Syllable Longest Syllable Level Matching with Name Entity Recognition Lao Segmented Words Dictionary Lao Initial Name Entity ສ /s/ ປ /p/ ວ /w/ ຊ /s/ ຜ /p h / ຫ /h/ ຍ /ɲ/ ຝ /f/ ອ /ʔ/ ດ /d/ ພ /p h / ຮ /h/ Double Consonants ຫງ /ŋ/ ຫນ/ໜ /n/ ຫລ/ຫ /r/ ຫຍ /ɲ/ ຫມ/ໝ /m/ ຫວ /w/ TABLE II LAO VOWELS Vowel IPA Vowel IPA Vowel IPA Short Vowels ະ /a/ ເ ະ /e/ /ɤ/ /i/ ແ ະ /ɛ/ ວະ /uə/ /ɯ/ ໂ ະ /o/ ເ ອ /ɯə/ Fig. 1 Lao word segmentation system overview The pre-processing operation will take Lao paragraphs or sentences as an input. The Lao language, however, rarely uses a space between syllables; however, the Lao language does use full stops (.) to determine the end of sentences. Therefore, the paragraph will be split into sentences by using full stop (.) in order to help longest syllable level matching with the NER operation parsing the words more accuracy. Syllable structure in Lao languages contains consonants, vowels and tone markers. Consonants occur on the baseline and can be divided into two categories: single consonants, which have 27 characters and double consonants, which has 6 characters. Vowels can occur between, before, above or below a consonantal character. There are 28 vowels in Lao language. They are divided into two main categories according to their sound: short vowels, which have 12 characters and long vowels, which have 12 characters and a set of special vowel which has 4 characters. There are 4 tone markers which always occur on top of consonantal characters. There are also 3 special symbols: ໆ indicating repetition of syllable; ຯ indicating and others (etc.), and indicating voice less of the final consonant of words that borrowed from other language. Table I /u/ າະ /ɔ/ ຍ /iə/ Long Vowels າ /aː/ ເ /eː/ /ɤː/ /iː/ ແ /ɛː/ ວ /uːə/ /ɯː/ ໂ /oː/ ເ ອ /ɯːə/ /uː/ /ɔː/ ຍ /iːə/ Special Vowels ໄ /ai/ ໃ /ai/ ເ າ /ao/ າ /am/ TABLE III LAO TONE MARKERS Tone Marker Tone IPA Tone Marker Tone IPA low /àː/ falling /âː/ high /áː/ rising /ǎː/ By determining the characteristics of the Lao writing system, it can be observed that word boundaries generally align with syllable boundaries. This means that instead of working directly at character level, which will lead to the incorrect segmentation of a sentence into single characters and small lexemes, it s useful to do Syllable Extraction first before doing

other operations. These rules and the algorithm have been proposed by Phonpasit Phissamay et al. (2004) in order to carry out syllable identification in Lao language [5]. In order to do longest syllable level matching, the series of syllables will be looked up in a dictionary using the forward technique. Fig. 2 describes the algorithm for Longest Syllable Level Matching. Given a set of extracted syllables S and a set of words in a dictionary D, the algorithm will output a set of longest syllable W. For example: the Lao sentence after doing the syllable extraction operation is ຂ ອຍ ໄປ ຕະ ຫ າດ /kʰɔːy ay tá l ːt go to t e ma ket i stly, t e algo it m will mark the index of current position denoted as CP and last position denoted as LP to the first index and last index of the syllable set, It will then begin a check from index of syllable set CP to LP ຂ ອຍໄປຕະຫ າດ against the dictionary, if there is no match; it will decrease LP by one and keep checking from index of syllable set CP to LP ຂ ອຍໄປຕະ against the dictionary again. It will keep doing until it has found match or a CP equal to LP. Following this, it will form the word from the index of syllable set CP to LP, and increase CP to LP+1 and reset LP to the last index of the syllable set. The algorithm will keep doing this until all syllables have been processed or CP is greater than LP. The word segmentation esult f om t is exam le is ຂ ອຍ ໄປ ຕະຫ າດ. Algorithm 1: Longest syllable level matching S = {s 0, s 1, s 2,, s n } # Set of extracted syllables D = {d 0, d 1, d 2,, d n } #Set of words the in dictionary W = {w 0, w 1, w 2,, w n } # Set of longest syllable #Initialize Variable Let CP = 0 # Current Position Marker Let LP = the length of set S Let # Last Position Marker If CP <= TmpLP and S[CP] is not SPACE Then If S[CP to TmpLP] D Then W S[CP to TmpLP] CP = TmpLP + 1 TmpLP = TmpLP 1 W S[CP] CP = CP + 1 TmpSL = LP until CP > LP return W Fig. 2 Algorithm for longest syllable level matching Lao personal names usually start with title that can be used as clue. In general, native Lao personal names are composed of title + one space + first name + one space + [last name] (last name is denoted as optional part) as shown in Fig. 3 and Table IV. Title First Name Last Name Fig. 3 Regular grammar for personal name written in Lao language TABLE IV EXAMPLE OF PERSON NAME WRITTEN IN LAO LANGUAGE No Title First Name Last Name 1 ທາວ ສກໃຈ ລດຕະນະ 2 ທານ ອານສອນ ໄພສານ 3 ນາງ ສ ແນດຕາ ພະວງສາ Furthermore, location expressions such as institute, company, school, university, office, district, city, village, town, province, country, etc. also have a title that can be used as a clue Gene ally, it s com osed of title + one s ace + location name as shown in Fig. 4 and Table V. Title Fig. 4 Regular grammar for location name written in Lao language TABLE V EXAMPLE OF LOCATION NAME WRITTEN IN LAO LANGUAGE No Title Location Name 1 ບ ລສດ ໄຊຍະສດທ ປ ກສາໄອທ 2 ອງການ ການຄາໂລກ 3 ບານ ສສງວອນ Location Name By determining the word boundary of personal name and location name in Lao language, a rule can be created to recognize named entities in Lao language. Fig. 5 describes the modified algorithm for longest matching using forward technique with NER. IV. EXPERIMENTAL AND EVALUATION Lao news documents were used to evaluate the performance in our approach and they were collected from websites in the following categories: General, Sport, and Education. Each category was taken from a Lao news publisher: Vientiane Mai [6]. Ten articles were randomly selected in each category, totalling thirty articles. The dictionary used contains approximately 52,000 words. The named entities were divided into two types for recognition: person name (PER) and location (LOC). Table VI shows the list of Lao initial titles and location names that can be used as clues to detect the named entities. Table VII shows the results of our approach before and after using NER. It can be observed that the word segmentation approach improve

significantly when using NER especially in general category where the named entities are most likely to occur. Algorithm 2: Longest Syllable Level Matching with NER S = {s 0, s 1, s 2,, s n } # Set of extracted syllables D = {d 0, d 1, d 2,, d n } #Set of words the in dictionary C = {c 0, c 1, c 2,, c n } # Set of clue words W = {w 0, w 1, w 2,, w n } # Set of longest syllable #Initialize Variable Let CP = 0 # Current Position Marker Let LP = the length of set S Let FlagClue = FALSE Let # Last Position Marker Let NI = 0 # Next Index of Space If CP <= TmpLP and S[CP] is not SPACE Then If S[CP to TmpLP] D Then If S[CP to TmpLP] C and S[TmpLP +1] is SPACE Then FlagClue = TRUE FlagClue = FALSE W S[CP to TmpLP] CP = TmpLP + 1 If FlagClue is TRUE Then NI = CP NI = NI + 1 until S[NI] is SPACE W S[CP to NI -1] CP = NI FlagClue = FALSE TmpLP = TmpLP 1 W S[CP] CP = CP + 1 until CP > LP return W PER ສຈ Prof. PER ຮສ ດຣ Ph.D. Assoc. Prof. PER ຜຊສ ດຣ Ph.D. Asst. Prof. PER ສຈ ດຣ Ph.D. Prof. PER ທ ານ Mr. PER ສ ບຕ Private 1st class PER ສ ບໂທ Corporal PER ສ ບເອກ Sergeant PER ຮ ອຍຕ Second Lieutenant PER ຮ ອຍໂທ First Lieutenant PER ຮ ອຍເອກ Captain PER ພ ນຕ Major PER ພນໂທ Lieutenant General PER ພນເອກ General PER ພະນະທ ານ Excellency LOC ອງການ Organization LOC ບ ລສດ Company LOC ບານ Village LOC ບານພກ Guest House LOC ມອງ District LOC ແຂວງ Province LOC ໂຮງແຮມ Hotel LOC ຮານ Restaurant LOC ລ ດວ ສາຫະກ ດ State enterprise Fig. 5 Algorithm for longest matching with NER TABLE VI LIST OF LAO INITIAL TITLE AND LOCATION NAME Type Lao Title English Title PER ທ າວ Mr. PER ນາງ Ms. PER ອຈ Teacher. PER ດຣ Dr. PER ຮສ Assoc. Prof. PER ຜຊສ Asst. Prof. Fig. 6 Comparison chart result of longest syllable level matching approach before and after using NER

TABLE VII EVALUATION RESULTS OF LONGEST SYLLABLE LEVEL MATCHING BEFORE AND AFTER USING NER Approach Categories Precision Recall F-Measure DCB Without NER DCB With NER General 76.59 86.21 81.12 Sport 79.41 88.67 83.79 Education 82.58 90.31 86.27 General 83.54 91.10 87.16 Sport 83.69 91.55 87.44 Education 87.43 93.79 90.50 TABLE VIII AVERAGE EVALUATION RESULTS OF LONGEST SYLLABLE LEVEL MATCHING BEFORE AND AFTER USING NER Approach Precision Recall F-Measure DCB Without NER 79.87 88.62 84.02 DCB With NER 85.21 92.36 88.64 The errors that most likely occur by using our approach are parsing word ambiguity and parsing unknown word. For ex: the Lao sentence ຂອຍໃຫການສະໜບສະໜນ ຈາ /kʰɔːy y kaːn sáná sán ːn c o/ (I give you the support). t a ses into ຂອຍ (I) [P onoun], ໃຫການ (give evidence) [Ve b], ສະໜບສະໜນ (to support) [Ve b], ຈາ you [P onoun], instead of ຂອຍ (I) [P onoun], ໃຫ (give) [Ve b], ການສະໜບສະໜນ (support) [Noun], ຈາ you [P onoun] V. CONCLUSION AND FUTURE WORK This paper presented the Lao word segmentation approach using longest syllable level matching with NER. This approach first extracted the syllables from the input text and then applied longest matching, which is one of the techniques in the DCB approach to combine them to form the words. We also proposed the technique on Lao named entities recognition, especially in person and location name domain, in order to improve the quality in word segmentation more accuracy. The experimental performance result was obtained from our approach with precision and recall of 85.21% and 92.36%, Future works will try to implement MLB approach to integrate with our approach in order to improve word segmentation performance, especially when parsing words that are not in the dictionary and parsing word ambiguity. REFERENCES [1] S. Vanthanavong and C. Haruechaiyasak, LaoWS: Lao Word Segmentation Based on Conditional Random Fields, in Conference on Human Language Technology for Development, 2011, p. 21-26. [2] C. Haruechaiyasak, S. Kongyoung, and M. Dailey, A comparative study on Thai wo d segmentation a oac es, in 5th Int. Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2008, p. 125-128. [3] H. Chanlekha and A. Kawt akul, Thai Named Entity Extraction by incorporating Maximum Entropy Model with Simple Heuristic nfo mation, in 1st Int. Joint Conference on NLP, 2004. [4] N. Tongtep and T. T ee amunkong, Pattern-based Extraction of Named Entities in T ai News Documents, Thammasat Int. J. Sc. Tech., Vol. 15, No. 1, pp. 70-81, January-March 2010. [5] P P issamay, et al, Syllabification of Lao Sc i t fo Line B eaking, Tech. Rep. of STEA, Lao PDR, 2004. [6] (2013) Vientaine Mai website. [Online]. Available: http://www.vientianemai.net/