A Hybrid Approach to Lao Word Segmentation using Longest Syllable Level Matching with Named Entities Recognition
|
|
- Bethany Elinor Bryant
- 6 years ago
- Views:
Transcription
1 A Hybrid Approach to Lao Word Segmentation using Longest Syllable Level Matching with Named Entities Recognition Arounyadeth Srithirath #, Pusadee Seresangtakul # # Department of Computer Science, Khon Kaen University Khon Kaen, Thailand, tom_zmax@yahoo.com, pusadee@kku.ac.th Abstract The Lao language is written without words delimiter which makes it extremely difficult to process. The development of automatic word segmentation for natural language processing for the Lao language is an essential but challenging task. This paper proposes a longest syllable level match with named entities recognition approach for Lao word segmentation. Syllables were first extracted from the input text and then longest matching was applied. This is one of the techniques in the Dictionary Based approach with named entities recognition being used to combine them to form the words. The performance result obtained from this approach, in precision and recall, was 85.21% and 92.36%, Keywords Lao word segmentation, tokenization, syllable extraction, longest matching, dictionary based, named entities recognition. I. INTRODUCTION A Lao text is a string of symbols with no explicit word boundary, which is similar to other South East Asian languages for ex: Japanese, Chinese, Thai, etc. Spaces between syllables are rarely used in Lao language. In order to perform Lao language processing, especially in Rule-Based Machine Translation role, text must be first segmented into individual terms or words. Researchers have proposed several segmentation techniques to apply word segmentation in many different languages. These techniques can be classified into two main approaches: Dictionary Based (DCB) and Machine Learning Based (MLB) [2]. The DCB approach is simple and straight forward in that it basically looks up the series of characters in the dictionary for matching terms. The main problems for the DCB approach are parsing the words that are not in the dictionary and parsing ambiguity. The MLB approach aims to solve the problems that occur in the DCB approach by using a model classification of machine learning approach that has been learned from various character patterns inside a tagged corpus. In this paper, we will present the word segmentation process in Lao language by combining three main operations: syllable extraction, longest matching technique, which one of the techniques in DCB approach, and named entities recognition (NER) [4] in order to improve the segmentation quality. Our dictionary contains approximately 52,000 words and acquires sufficient lexicon to cover daily used in the general domain. II. BACKGROUND AND RELATED WORKS Recently, Sisouvanh Vanthanavong et al. (2011) proposed research on Lao Word Segmentation based on conditional random fields (CRF) [1] using a tagged corpus of approximately 100,000 words, which gave the precision and recall results of 80.29% and 78.45%, As Lao and Thai are very similar in both their spoken and writing system, Choochart Haruechaiyasak et al. (2008) have compared Thai word segmentation approaches [2] for both the DCB and MLB techniques. For DCB, they compared two algorithms: Longest Matching and Maximal Matching. For MLB, they compared four algorithms: Naive Bayes (NB), decision tree, support vector machine (SVM), and CRF. By using the ORCHID corpus which contains 113,404 words, the best performance was obtained from the CRF algorithm with the precision and recall result of 95.79% and 94.98%, Named entities recognition is an essential process widely used in natural language processing (NLP). Hutchatai Chanlekha et al. (2004) have presented Thai named entity extraction by incorporating the maximum entropy Model with simple heuristic information [3]. By combining the machinelearning and rule-based approaches, the evaluated result shows that the F-measures of person, location, and organization names are 90.44%, 82.16% and 89.87%, Nattapong Tongtep et al. (2010) have proposed the method on pattern-based extraction of named entities in Thai news documents [4], that focuses on rule-based approach and uses many techniques such as longest word matching, longest pattern matching, clue words, and word lists from dictionaries, etc., all combine together. The result was obtained from their method was approximately % correctness depending the named entities type. Although, previous research has shown that the MLB approach performs slightly better than the DCB approach, especially in CRF technique. However, the MLB performance depends mainly on the domain and size of the corpus [2]. The process to collect the corpus to cover every domain and character pattern would take a lot of time and effort in order for the training model could be trained effectively. On the /13/$ IEEE
2 other hand, even the DCB approach performs poorly on the unknown word, the problem can be overcome significantly by combing syllable extraction and Lao named entities recognition technique. III. METHODOLOGY Lao Word Segmentation is underpinned by three rudimentary operations: pre-processing, syllable extraction, and longest syllable level matching with NER, Fig. 1 illustrates the overview of the system. Lao Paragraphs or Sentences to III illustrates Lao consonants, vowels, and tone markers, TABLE I LAO CONSONANTS Consonant IPA Consonant IPA Consonant IPA Single Consonants ກ /k/ ຕ /t/ ຟ /f/ ຂ /k h / ຖ /t h / ມ /m/ ຄ /k h / ທ /t h / ຢ /j/ ງ /ŋ/ ນ /n/ ຣ /r/ ຈ /c/ ບ /b/ ລ /l/ Pre-processing Sentence Syllable Extraction List of Syllable Longest Syllable Level Matching with Name Entity Recognition Lao Segmented Words Dictionary Lao Initial Name Entity ສ /s/ ປ /p/ ວ /w/ ຊ /s/ ຜ /p h / ຫ /h/ ຍ /ɲ/ ຝ /f/ ອ /ʔ/ ດ /d/ ພ /p h / ຮ /h/ Double Consonants ຫງ /ŋ/ ຫນ/ໜ /n/ ຫລ/ຫ /r/ ຫຍ /ɲ/ ຫມ/ໝ /m/ ຫວ /w/ TABLE II LAO VOWELS Vowel IPA Vowel IPA Vowel IPA Short Vowels ະ /a/ ເ ະ /e/ /ɤ/ /i/ ແ ະ /ɛ/ ວະ /uə/ /ɯ/ ໂ ະ /o/ ເ ອ /ɯə/ Fig. 1 Lao word segmentation system overview The pre-processing operation will take Lao paragraphs or sentences as an input. The Lao language, however, rarely uses a space between syllables; however, the Lao language does use full stops (.) to determine the end of sentences. Therefore, the paragraph will be split into sentences by using full stop (.) in order to help longest syllable level matching with the NER operation parsing the words more accuracy. Syllable structure in Lao languages contains consonants, vowels and tone markers. Consonants occur on the baseline and can be divided into two categories: single consonants, which have 27 characters and double consonants, which has 6 characters. Vowels can occur between, before, above or below a consonantal character. There are 28 vowels in Lao language. They are divided into two main categories according to their sound: short vowels, which have 12 characters and long vowels, which have 12 characters and a set of special vowel which has 4 characters. There are 4 tone markers which always occur on top of consonantal characters. There are also 3 special symbols: ໆ indicating repetition of syllable; ຯ indicating and others (etc.), and indicating voice less of the final consonant of words that borrowed from other language. Table I /u/ າະ /ɔ/ ຍ /iə/ Long Vowels າ /aː/ ເ /eː/ /ɤː/ /iː/ ແ /ɛː/ ວ /uːə/ /ɯː/ ໂ /oː/ ເ ອ /ɯːə/ /uː/ /ɔː/ ຍ /iːə/ Special Vowels ໄ /ai/ ໃ /ai/ ເ າ /ao/ າ /am/ TABLE III LAO TONE MARKERS Tone Marker Tone IPA Tone Marker Tone IPA low /àː/ falling /âː/ high /áː/ rising /ǎː/ By determining the characteristics of the Lao writing system, it can be observed that word boundaries generally align with syllable boundaries. This means that instead of working directly at character level, which will lead to the incorrect segmentation of a sentence into single characters and small lexemes, it s useful to do Syllable Extraction first before doing
3 other operations. These rules and the algorithm have been proposed by Phonpasit Phissamay et al. (2004) in order to carry out syllable identification in Lao language [5]. In order to do longest syllable level matching, the series of syllables will be looked up in a dictionary using the forward technique. Fig. 2 describes the algorithm for Longest Syllable Level Matching. Given a set of extracted syllables S and a set of words in a dictionary D, the algorithm will output a set of longest syllable W. For example: the Lao sentence after doing the syllable extraction operation is ຂ ອຍ ໄປ ຕະ ຫ າດ /kʰɔːy ay tá l ːt go to t e ma ket i stly, t e algo it m will mark the index of current position denoted as CP and last position denoted as LP to the first index and last index of the syllable set, It will then begin a check from index of syllable set CP to LP ຂ ອຍໄປຕະຫ າດ against the dictionary, if there is no match; it will decrease LP by one and keep checking from index of syllable set CP to LP ຂ ອຍໄປຕະ against the dictionary again. It will keep doing until it has found match or a CP equal to LP. Following this, it will form the word from the index of syllable set CP to LP, and increase CP to LP+1 and reset LP to the last index of the syllable set. The algorithm will keep doing this until all syllables have been processed or CP is greater than LP. The word segmentation esult f om t is exam le is ຂ ອຍ ໄປ ຕະຫ າດ. Algorithm 1: Longest syllable level matching S = {s 0, s 1, s 2,, s n } # Set of extracted syllables D = {d 0, d 1, d 2,, d n } #Set of words the in dictionary W = {w 0, w 1, w 2,, w n } # Set of longest syllable #Initialize Variable Let CP = 0 # Current Position Marker Let LP = the length of set S Let # Last Position Marker If CP <= TmpLP and S[CP] is not SPACE Then If S[CP to TmpLP] D Then W S[CP to TmpLP] CP = TmpLP + 1 TmpLP = TmpLP 1 W S[CP] CP = CP + 1 TmpSL = LP until CP > LP return W Fig. 2 Algorithm for longest syllable level matching Lao personal names usually start with title that can be used as clue. In general, native Lao personal names are composed of title + one space + first name + one space + [last name] (last name is denoted as optional part) as shown in Fig. 3 and Table IV. Title First Name Last Name Fig. 3 Regular grammar for personal name written in Lao language TABLE IV EXAMPLE OF PERSON NAME WRITTEN IN LAO LANGUAGE No Title First Name Last Name 1 ທາວ ສກໃຈ ລດຕະນະ 2 ທານ ອານສອນ ໄພສານ 3 ນາງ ສ ແນດຕາ ພະວງສາ Furthermore, location expressions such as institute, company, school, university, office, district, city, village, town, province, country, etc. also have a title that can be used as a clue Gene ally, it s com osed of title + one s ace + location name as shown in Fig. 4 and Table V. Title Fig. 4 Regular grammar for location name written in Lao language TABLE V EXAMPLE OF LOCATION NAME WRITTEN IN LAO LANGUAGE No Title Location Name 1 ບ ລສດ ໄຊຍະສດທ ປ ກສາໄອທ 2 ອງການ ການຄາໂລກ 3 ບານ ສສງວອນ Location Name By determining the word boundary of personal name and location name in Lao language, a rule can be created to recognize named entities in Lao language. Fig. 5 describes the modified algorithm for longest matching using forward technique with NER. IV. EXPERIMENTAL AND EVALUATION Lao news documents were used to evaluate the performance in our approach and they were collected from websites in the following categories: General, Sport, and Education. Each category was taken from a Lao news publisher: Vientiane Mai [6]. Ten articles were randomly selected in each category, totalling thirty articles. The dictionary used contains approximately 52,000 words. The named entities were divided into two types for recognition: person name (PER) and location (LOC). Table VI shows the list of Lao initial titles and location names that can be used as clues to detect the named entities. Table VII shows the results of our approach before and after using NER. It can be observed that the word segmentation approach improve
4 significantly when using NER especially in general category where the named entities are most likely to occur. Algorithm 2: Longest Syllable Level Matching with NER S = {s 0, s 1, s 2,, s n } # Set of extracted syllables D = {d 0, d 1, d 2,, d n } #Set of words the in dictionary C = {c 0, c 1, c 2,, c n } # Set of clue words W = {w 0, w 1, w 2,, w n } # Set of longest syllable #Initialize Variable Let CP = 0 # Current Position Marker Let LP = the length of set S Let FlagClue = FALSE Let # Last Position Marker Let NI = 0 # Next Index of Space If CP <= TmpLP and S[CP] is not SPACE Then If S[CP to TmpLP] D Then If S[CP to TmpLP] C and S[TmpLP +1] is SPACE Then FlagClue = TRUE FlagClue = FALSE W S[CP to TmpLP] CP = TmpLP + 1 If FlagClue is TRUE Then NI = CP NI = NI + 1 until S[NI] is SPACE W S[CP to NI -1] CP = NI FlagClue = FALSE TmpLP = TmpLP 1 W S[CP] CP = CP + 1 until CP > LP return W PER ສຈ Prof. PER ຮສ ດຣ Ph.D. Assoc. Prof. PER ຜຊສ ດຣ Ph.D. Asst. Prof. PER ສຈ ດຣ Ph.D. Prof. PER ທ ານ Mr. PER ສ ບຕ Private 1st class PER ສ ບໂທ Corporal PER ສ ບເອກ Sergeant PER ຮ ອຍຕ Second Lieutenant PER ຮ ອຍໂທ First Lieutenant PER ຮ ອຍເອກ Captain PER ພ ນຕ Major PER ພນໂທ Lieutenant General PER ພນເອກ General PER ພະນະທ ານ Excellency LOC ອງການ Organization LOC ບ ລສດ Company LOC ບານ Village LOC ບານພກ Guest House LOC ມອງ District LOC ແຂວງ Province LOC ໂຮງແຮມ Hotel LOC ຮານ Restaurant LOC ລ ດວ ສາຫະກ ດ State enterprise Fig. 5 Algorithm for longest matching with NER TABLE VI LIST OF LAO INITIAL TITLE AND LOCATION NAME Type Lao Title English Title PER ທ າວ Mr. PER ນາງ Ms. PER ອຈ Teacher. PER ດຣ Dr. PER ຮສ Assoc. Prof. PER ຜຊສ Asst. Prof. Fig. 6 Comparison chart result of longest syllable level matching approach before and after using NER
5 TABLE VII EVALUATION RESULTS OF LONGEST SYLLABLE LEVEL MATCHING BEFORE AND AFTER USING NER Approach Categories Precision Recall F-Measure DCB Without NER DCB With NER General Sport Education General Sport Education TABLE VIII AVERAGE EVALUATION RESULTS OF LONGEST SYLLABLE LEVEL MATCHING BEFORE AND AFTER USING NER Approach Precision Recall F-Measure DCB Without NER DCB With NER The errors that most likely occur by using our approach are parsing word ambiguity and parsing unknown word. For ex: the Lao sentence ຂອຍໃຫການສະໜບສະໜນ ຈາ /kʰɔːy y kaːn sáná sán ːn c o/ (I give you the support). t a ses into ຂອຍ (I) [P onoun], ໃຫການ (give evidence) [Ve b], ສະໜບສະໜນ (to support) [Ve b], ຈາ you [P onoun], instead of ຂອຍ (I) [P onoun], ໃຫ (give) [Ve b], ການສະໜບສະໜນ (support) [Noun], ຈາ you [P onoun] V. CONCLUSION AND FUTURE WORK This paper presented the Lao word segmentation approach using longest syllable level matching with NER. This approach first extracted the syllables from the input text and then applied longest matching, which is one of the techniques in the DCB approach to combine them to form the words. We also proposed the technique on Lao named entities recognition, especially in person and location name domain, in order to improve the quality in word segmentation more accuracy. The experimental performance result was obtained from our approach with precision and recall of 85.21% and 92.36%, Future works will try to implement MLB approach to integrate with our approach in order to improve word segmentation performance, especially when parsing words that are not in the dictionary and parsing word ambiguity. REFERENCES [1] S. Vanthanavong and C. Haruechaiyasak, LaoWS: Lao Word Segmentation Based on Conditional Random Fields, in Conference on Human Language Technology for Development, 2011, p [2] C. Haruechaiyasak, S. Kongyoung, and M. Dailey, A comparative study on Thai wo d segmentation a oac es, in 5th Int. Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology, 2008, p [3] H. Chanlekha and A. Kawt akul, Thai Named Entity Extraction by incorporating Maximum Entropy Model with Simple Heuristic nfo mation, in 1st Int. Joint Conference on NLP, [4] N. Tongtep and T. T ee amunkong, Pattern-based Extraction of Named Entities in T ai News Documents, Thammasat Int. J. Sc. Tech., Vol. 15, No. 1, pp , January-March [5] P P issamay, et al, Syllabification of Lao Sc i t fo Line B eaking, Tech. Rep. of STEA, Lao PDR, [6] (2013) Vientaine Mai website. [Online]. Available:
Disambiguation of Thai Personal Name from Online News Articles
Disambiguation of Thai Personal Name from Online News Articles Phaisarn Sutheebanjard Graduate School of Information Technology Siam University Bangkok, Thailand mr.phaisarn@gmail.com Abstract Since online
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationOCR for Arabic using SIFT Descriptors With Online Failure Prediction
OCR for Arabic using SIFT Descriptors With Online Failure Prediction Andrey Stolyarenko, Nachum Dershowitz The Blavatnik School of Computer Science Tel Aviv University Tel Aviv, Israel Email: stloyare@tau.ac.il,
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationLearning Methods in Multilingual Speech Recognition
Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex
More informationPrevalence of Oral Reading Problems in Thai Students with Cleft Palate, Grades 3-5
Prevalence of Oral Reading Problems in Thai Students with Cleft Palate, Grades 3-5 Prajima Ingkapak BA*, Benjamas Prathanee PhD** * Curriculum and Instruction in Special Education, Faculty of Education,
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationLip reading: Japanese vowel recognition by tracking temporal changes of lip shape
Lip reading: Japanese vowel recognition by tracking temporal changes of lip shape Koshi Odagiri 1, and Yoichi Muraoka 1 1 Graduate School of Fundamental/Computer Science and Engineering, Waseda University,
More informationReducing Features to Improve Bug Prediction
Reducing Features to Improve Bug Prediction Shivkumar Shivaji, E. James Whitehead, Jr., Ram Akella University of California Santa Cruz {shiv,ejw,ram}@soe.ucsc.edu Sunghun Kim Hong Kong University of Science
More informationMULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY
MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationIterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages
Iterative Cross-Training: An Algorithm for Learning from Unlabeled Web Pages Nuanwan Soonthornphisaj 1 and Boonserm Kijsirikul 2 Machine Intelligence and Knowledge Discovery Laboratory Department of Computer
More informationNCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches
NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science
More informationMandarin Lexical Tone Recognition: The Gating Paradigm
Kansas Working Papers in Linguistics, Vol. 0 (008), p. 8 Abstract Mandarin Lexical Tone Recognition: The Gating Paradigm Yuwen Lai and Jie Zhang University of Kansas Research on spoken word recognition
More informationChinese Language Parsing with Maximum-Entropy-Inspired Parser
Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art
More informationTwitter Sentiment Classification on Sanders Data using Hybrid Approach
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders
More informationTarget Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se
More informationIndian Institute of Technology, Kanpur
Indian Institute of Technology, Kanpur Course Project - CS671A POS Tagging of Code Mixed Text Ayushman Sisodiya (12188) {ayushmn@iitk.ac.in} Donthu Vamsi Krishna (15111016) {vamsi@iitk.ac.in} Sandeep Kumar
More informationEnhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion
More informationWE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT
WE GAVE A LAWYER BASIC MATH SKILLS, AND YOU WON T BELIEVE WHAT HAPPENED NEXT PRACTICAL APPLICATIONS OF RANDOM SAMPLING IN ediscovery By Matthew Verga, J.D. INTRODUCTION Anyone who spends ample time working
More informationProbabilistic Latent Semantic Analysis
Probabilistic Latent Semantic Analysis Thomas Hofmann Presentation by Ioannis Pavlopoulos & Andreas Damianou for the course of Data Mining & Exploration 1 Outline Latent Semantic Analysis o Need o Overview
More informationRule Learning with Negation: Issues Regarding Effectiveness
Rule Learning with Negation: Issues Regarding Effectiveness Stephanie Chua, Frans Coenen, and Grant Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX
More informationRule Learning With Negation: Issues Regarding Effectiveness
Rule Learning With Negation: Issues Regarding Effectiveness S. Chua, F. Coenen, G. Malcolm University of Liverpool Department of Computer Science, Ashton Building, Ashton Street, L69 3BX Liverpool, United
More informationChunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.
NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and
More informationConstructing Parallel Corpus from Movie Subtitles
Constructing Parallel Corpus from Movie Subtitles Han Xiao 1 and Xiaojie Wang 2 1 School of Information Engineering, Beijing University of Post and Telecommunications artex.xh@gmail.com 2 CISTR, Beijing
More informationThe Internet as a Normative Corpus: Grammar Checking with a Search Engine
The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a
More informationExperiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling
Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling Notebook for PAN at CLEF 2013 Andrés Alfonso Caurcel Díaz 1 and José María Gómez Hidalgo 2 1 Universidad
More informationMultilingual Sentiment and Subjectivity Analysis
Multilingual Sentiment and Subjectivity Analysis Carmen Banea and Rada Mihalcea Department of Computer Science University of North Texas rada@cs.unt.edu, carmen.banea@gmail.com Janyce Wiebe Department
More informationPredicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks
Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com
More informationARNE - A tool for Namend Entity Recognition from Arabic Text
24 ARNE - A tool for Namend Entity Recognition from Arabic Text Carolin Shihadeh DFKI Stuhlsatzenhausweg 3 66123 Saarbrücken, Germany carolin.shihadeh@dfki.de Günter Neumann DFKI Stuhlsatzenhausweg 3 66123
More informationBANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS
Daffodil International University Institutional Repository DIU Journal of Science and Technology Volume 8, Issue 1, January 2013 2013-01 BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS Uddin, Sk.
More informationShort Text Understanding Through Lexical-Semantic Analysis
Short Text Understanding Through Lexical-Semantic Analysis Wen Hua #1, Zhongyuan Wang 2, Haixun Wang 3, Kai Zheng #4, Xiaofang Zhou #5 School of Information, Renmin University of China, Beijing, China
More informationCharacter Stream Parsing of Mixed-lingual Text
Character Stream Parsing of Mixed-lingual Text Harald Romsdorfer and Beat Pfister Speech Processing Group Computer Engineering and Networks Laboratory ETH Zurich {romsdorfer,pfister}@tik.ee.ethz.ch Abstract
More informationCS 598 Natural Language Processing
CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@
More informationSwitchboard Language Model Improvement with Conversational Data from Gigaword
Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword
More informationLecture 1: Machine Learning Basics
1/69 Lecture 1: Machine Learning Basics Ali Harakeh University of Waterloo WAVE Lab ali.harakeh@uwaterloo.ca May 1, 2017 2/69 Overview 1 Learning Algorithms 2 Capacity, Overfitting, and Underfitting 3
More informationThe Smart/Empire TIPSTER IR System
The Smart/Empire TIPSTER IR System Chris Buckley, Janet Walz Sabir Research, Gaithersburg, MD chrisb,walz@sabir.com Claire Cardie, Scott Mardis, Mandar Mitra, David Pierce, Kiri Wagstaff Department of
More informationProblems of the Arabic OCR: New Attitudes
Problems of the Arabic OCR: New Attitudes Prof. O.Redkin, Dr. O.Bernikova Department of Asian and African Studies, St. Petersburg State University, St Petersburg, Russia Abstract - This paper reviews existing
More informationNamed Entity Recognition: A Survey for the Indian Languages
Named Entity Recognition: A Survey for the Indian Languages Padmaja Sharma Dept. of CSE Tezpur University Assam, India 784028 psharma@tezu.ernet.in Utpal Sharma Dept.of CSE Tezpur University Assam, India
More informationProduct Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments
Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments Vijayshri Ramkrishna Ingale PG Student, Department of Computer Engineering JSPM s Imperial College of Engineering &
More informationScienceDirect. Malayalam question answering system
Available online at www.sciencedirect.com ScienceDirect Procedia Technology 24 (2016 ) 1388 1392 International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015) Malayalam
More informationGrade 4. Common Core Adoption Process. (Unpacked Standards)
Grade 4 Common Core Adoption Process (Unpacked Standards) Grade 4 Reading: Literature RL.4.1 Refer to details and examples in a text when explaining what the text says explicitly and when drawing inferences
More informationImproved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form
Orthographic Form 1 Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form The development and testing of word-retrieval treatments for aphasia has generally focused
More informationAustralian Journal of Basic and Applied Sciences
AENSI Journals Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Journal home page: www.ajbasweb.com Feature Selection Technique Using Principal Component Analysis For Improving Fuzzy C-Mean
More informationContext Free Grammars. Many slides from Michael Collins
Context Free Grammars Many slides from Michael Collins Overview I An introduction to the parsing problem I Context free grammars I A brief(!) sketch of the syntax of English I Examples of ambiguous structures
More informationThe stages of event extraction
The stages of event extraction David Ahn Intelligent Systems Lab Amsterdam University of Amsterdam ahn@science.uva.nl Abstract Event detection and recognition is a complex task consisting of multiple sub-tasks
More informationCorrective Feedback and Persistent Learning for Information Extraction
Corrective Feedback and Persistent Learning for Information Extraction Aron Culotta a, Trausti Kristjansson b, Andrew McCallum a, Paul Viola c a Dept. of Computer Science, University of Massachusetts,
More informationTransfer of Training
Transfer of Training Objective Material : To see if Transfer of training is possible : Drawing Boar with a screen, Eight copies of a star pattern with double lines Experimenter : E and drawing pins. Subject
More informationFirst Grade Curriculum Highlights: In alignment with the Common Core Standards
First Grade Curriculum Highlights: In alignment with the Common Core Standards ENGLISH LANGUAGE ARTS Foundational Skills Print Concepts Demonstrate understanding of the organization and basic features
More informationA Comparison of Two Text Representations for Sentiment Analysis
010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational
More informationIndividual Component Checklist L I S T E N I N G. for use with ONE task ENGLISH VERSION
L I S T E N I N G Individual Component Checklist for use with ONE task ENGLISH VERSION INTRODUCTION This checklist has been designed for use as a practical tool for describing ONE TASK in a test of listening.
More information1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature
1 st Grade Curriculum Map Common Core Standards Language Arts 2013 2014 1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature Key Ideas and Details
More informationHighlighting and Annotation Tips Foundation Lesson
English Highlighting and Annotation Tips Foundation Lesson About this Lesson Annotating a text can be a permanent record of the reader s intellectual conversation with a text. Annotation can help a reader
More informationBootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain
Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain Andreas Vlachos Computer Laboratory University of Cambridge Cambridge, CB3 0FD, UK av308@cl.cam.ac.uk Caroline Gasperin Computer
More informationAQUA: An Ontology-Driven Question Answering System
AQUA: An Ontology-Driven Question Answering System Maria Vargas-Vera, Enrico Motta and John Domingue Knowledge Media Institute (KMI) The Open University, Walton Hall, Milton Keynes, MK7 6AA, United Kingdom.
More informationMARK¹² Reading II (Adaptive Remediation)
MARK¹² Reading II (Adaptive Remediation) Scope & Sequence : Scope & Sequence documents describe what is covered in a course (the scope) and also the order in which topics are covered (the sequence). These
More informationClass-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification
Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification Tomi Kinnunen and Ismo Kärkkäinen University of Joensuu, Department of Computer Science, P.O. Box 111, 80101 JOENSUU,
More informationHuman Emotion Recognition From Speech
RESEARCH ARTICLE OPEN ACCESS Human Emotion Recognition From Speech Miss. Aparna P. Wanare*, Prof. Shankar N. Dandare *(Department of Electronics & Telecommunication Engineering, Sant Gadge Baba Amravati
More informationSARDNET: A Self-Organizing Feature Map for Sequences
SARDNET: A Self-Organizing Feature Map for Sequences Daniel L. James and Risto Miikkulainen Department of Computer Sciences The University of Texas at Austin Austin, TX 78712 dljames,risto~cs.utexas.edu
More informationRole of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation
Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Vivek Kumar Rangarajan Sridhar, John Chen, Srinivas Bangalore, Alistair Conkie AT&T abs - Research 180 Park Avenue, Florham Park,
More informationCalibration of Confidence Measures in Speech Recognition
Submitted to IEEE Trans on Audio, Speech, and Language, July 2010 1 Calibration of Confidence Measures in Speech Recognition Dong Yu, Senior Member, IEEE, Jinyu Li, Member, IEEE, Li Deng, Fellow, IEEE
More informationTask Tolerance of MT Output in Integrated Text Processes
Task Tolerance of MT Output in Integrated Text Processes John S. White, Jennifer B. Doyon, and Susan W. Talbott Litton PRC 1500 PRC Drive McLean, VA 22102, USA {white_john, doyon jennifer, talbott_susan}@prc.com
More informationTHE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING
SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,
More informationCLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH
ISSN: 0976-3104 Danti and Bhushan. ARTICLE OPEN ACCESS CLASSIFICATION OF TEXT DOCUMENTS USING INTEGER REPRESENTATION AND REGRESSION: AN INTEGRATED APPROACH Ajit Danti 1 and SN Bharath Bhushan 2* 1 Department
More informationMultilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities
Multilingual Document Clustering: an Heuristic Approach Based on Cognate Named Entities Soto Montalvo GAVAB Group URJC Raquel Martínez NLP&IR Group UNED Arantza Casillas Dpt. EE UPV-EHU Víctor Fresno GAVAB
More informationNetpix: A Method of Feature Selection Leading. to Accurate Sentiment-Based Classification Models
Netpix: A Method of Feature Selection Leading to Accurate Sentiment-Based Classification Models 1 Netpix: A Method of Feature Selection Leading to Accurate Sentiment-Based Classification Models James B.
More informationModule 12. Machine Learning. Version 2 CSE IIT, Kharagpur
Module 12 Machine Learning 12.1 Instructional Objective The students should understand the concept of learning systems Students should learn about different aspects of a learning system Students should
More informationSpeech Recognition at ICSI: Broadcast News and beyond
Speech Recognition at ICSI: Broadcast News and beyond Dan Ellis International Computer Science Institute, Berkeley CA Outline 1 2 3 The DARPA Broadcast News task Aspects of ICSI
More informationSpeech Emotion Recognition Using Support Vector Machine
Speech Emotion Recognition Using Support Vector Machine Yixiong Pan, Peipei Shen and Liping Shen Department of Computer Technology Shanghai JiaoTong University, Shanghai, China panyixiong@sjtu.edu.cn,
More informationLearning From the Past with Experiment Databases
Learning From the Past with Experiment Databases Joaquin Vanschoren 1, Bernhard Pfahringer 2, and Geoff Holmes 2 1 Computer Science Dept., K.U.Leuven, Leuven, Belgium 2 Computer Science Dept., University
More informationThe 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, / X
The 9 th International Scientific Conference elearning and software for Education Bucharest, April 25-26, 2013 10.12753/2066-026X-13-154 DATA MINING SOLUTIONS FOR DETERMINING STUDENT'S PROFILE Adela BÂRA,
More informationReading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-
New York Grade 7 Core Performance Indicators Grades 7 8: common to all four ELA standards Throughout grades 7 and 8, students demonstrate the following core performance indicators in the key ideas of reading,
More informationPrediction of Maximal Projection for Semantic Role Labeling
Prediction of Maximal Projection for Semantic Role Labeling Weiwei Sun, Zhifang Sui Institute of Computational Linguistics Peking University Beijing, 100871, China {ws, szf}@pku.edu.cn Haifeng Wang Toshiba
More informationDegree Qualification Profiles Intellectual Skills
Degree Qualification Profiles Intellectual Skills Intellectual Skills: These are cross-cutting skills that should transcend disciplinary boundaries. Students need all of these Intellectual Skills to acquire
More information11/29/2010. Statistical Parsing. Statistical Parsing. Simple PCFG for ATIS English. Syntactic Disambiguation
tatistical Parsing (Following slides are modified from Prof. Raymond Mooney s slides.) tatistical Parsing tatistical parsing uses a probabilistic model of syntax in order to assign probabilities to each
More informationPOS tagging of Chinese Buddhist texts using Recurrent Neural Networks
POS tagging of Chinese Buddhist texts using Recurrent Neural Networks Longlu Qin Department of East Asian Languages and Cultures longlu@stanford.edu Abstract Chinese POS tagging, as one of the most important
More informationWHEN THERE IS A mismatch between the acoustic
808 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 3, MAY 2006 Optimization of Temporal Filters for Constructing Robust Features in Speech Recognition Jeih-Weih Hung, Member,
More information2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases
POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz
More informationIMPROVING SPEAKING SKILL OF THE TENTH GRADE STUDENTS OF SMK 17 AGUSTUS 1945 MUNCAR THROUGH DIRECT PRACTICE WITH THE NATIVE SPEAKER
IMPROVING SPEAKING SKILL OF THE TENTH GRADE STUDENTS OF SMK 17 AGUSTUS 1945 MUNCAR THROUGH DIRECT PRACTICE WITH THE NATIVE SPEAKER Mohamad Nor Shodiq Institut Agama Islam Darussalam (IAIDA) Banyuwangi
More informationCEFR Overall Illustrative English Proficiency Scales
CEFR Overall Illustrative English Proficiency s CEFR CEFR OVERALL ORAL PRODUCTION Has a good command of idiomatic expressions and colloquialisms with awareness of connotative levels of meaning. Can convey
More informationPAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))
Ohio Academic Content Standards Grade Level Indicators (Grade 11) A. ACQUISITION OF VOCABULARY Students acquire vocabulary through exposure to language-rich situations, such as reading books and other
More informationBeyond the Pipeline: Discrete Optimization in NLP
Beyond the Pipeline: Discrete Optimization in NLP Tomasz Marciniak and Michael Strube EML Research ggmbh Schloss-Wolfsbrunnenweg 33 69118 Heidelberg, Germany http://www.eml-research.de/nlp Abstract We
More informationPhonological Processing for Urdu Text to Speech System
Phonological Processing for Urdu Text to Speech System Sarmad Hussain Center for Research in Urdu Language Processing, National University of Computer and Emerging Sciences, B Block, Faisal Town, Lahore,
More informationHoughton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)
Houghton Mifflin Reading Correlation to the Standards for English Language Arts (Grade1) 8.3 JOHNNY APPLESEED Biography TARGET SKILLS: 8.3 Johnny Appleseed Phonemic Awareness Phonics Comprehension Vocabulary
More informationA heuristic framework for pivot-based bilingual dictionary induction
2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,
More informationImproving the Quality of MT Output using Novel Name Entity Translation Scheme
Improving the Quality of MT Output using Novel Name Entity Translation Scheme Deepti Bhalla Department of Computer Science Banasthali University Rajasthan, India deeptibhalla0600@gmail.com Nisheeth Joshi
More informationEvaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment
Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment Akiko Sakamoto, Kazuhiko Abe, Kazuo Sumita and Satoshi Kamatani Knowledge Media Laboratory,
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationMachine Learning and Data Mining. Ensembles of Learners. Prof. Alexander Ihler
Machine Learning and Data Mining Ensembles of Learners Prof. Alexander Ihler Ensemble methods Why learn one classifier when you can learn many? Ensemble: combine many predictors (Weighted) combina
More informationDistant Supervised Relation Extraction with Wikipedia and Freebase
Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational
More informationhave to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,
A Language-Independent, Data-Oriented Architecture for Grapheme-to-Phoneme Conversion Walter Daelemans and Antal van den Bosch Proceedings ESCA-IEEE speech synthesis conference, New York, September 1994
More informationTABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards
TABE 9&10 Revised 8/2013- with reference to College and Career Readiness Standards LEVEL E Test 1: Reading Name Class E01- INTERPRET GRAPHIC INFORMATION Signs Maps Graphs Consumer Materials Forms Dictionary
More informationMatching Similarity for Keyword-Based Clustering
Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web
More informationWord Segmentation of Off-line Handwritten Documents
Word Segmentation of Off-line Handwritten Documents Chen Huang and Sargur N. Srihari {chuang5, srihari}@cedar.buffalo.edu Center of Excellence for Document Analysis and Recognition (CEDAR), Department
More informationLearning Computational Grammars
Learning Computational Grammars John Nerbonne, Anja Belz, Nicola Cancedda, Hervé Déjean, James Hammerton, Rob Koeling, Stasinos Konstantopoulos, Miles Osborne, Franck Thollard and Erik Tjong Kim Sang Abstract
More informationMETHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS
METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS Ruslan Mitkov (R.Mitkov@wlv.ac.uk) University of Wolverhampton ViktorPekar (v.pekar@wlv.ac.uk) University of Wolverhampton Dimitar
More informationDickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks
3rd Grade- 1st Nine Weeks R3.8 understand, make inferences and draw conclusions about the structure and elements of fiction and provide evidence from text to support their understand R3.8A sequence and
More informationBYLINE [Heng Ji, Computer Science Department, New York University,
INFORMATION EXTRACTION BYLINE [Heng Ji, Computer Science Department, New York University, hengji@cs.nyu.edu] SYNONYMS NONE DEFINITION Information Extraction (IE) is a task of extracting pre-specified types
More informationCLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction
CLASSIFICATION OF PROGRAM Critical Elements Analysis 1 Program Name: Macmillan/McGraw Hill Reading 2003 Date of Publication: 2003 Publisher: Macmillan/McGraw Hill Reviewer Code: 1. X The program meets
More informationTEKS Correlations Proclamation 2017
and Skills (TEKS): Material Correlations to the Texas Essential Knowledge and Skills (TEKS): Material Subject Course Publisher Program Title Program ISBN TEKS Coverage (%) Chapter 114. Texas Essential
More information