Cross Language POS taggers for Resource Poor Languages
|
|
- Randall Bell
- 5 years ago
- Views:
Transcription
1 Cross Language POS taggers for Resource Poor Languages April 22, Introduction POS tagger is one of the basic requirements of any language for the advancement of its linguistic research. There are many languages which do not have POS taggers. Reasons for lacking POS taggers vary. One reason for this is due to the lack of other basic resources like corpora, lexicons or morphological analyzers. With the advent of Web, corpora is no longer a major problem (see (Kilgarriff et al., 2010)). With technical advances in lexicography (Atkins and Rundell, 2008), lexicon building and morphological analyzers has also been addressed to a decent extent where the next stages of research can boot. The other reason for lack of POS taggers is partly due the lack of many or any research groups working on a particular language. So these languages do not have any annotated data to build efficient taggers. But this problem can be addressed if research/resources for a resource-rich language (source language) can be used for a resource-poor language (target language). If these languages are typologically related, efficient taggers can be built. In this work, we aim to build POS taggers for resource-poor languages benefiting from the resources of their typologically-related resource-rich languages. 2 Related Work There are many existing methods which build POS taggers of a target language without using any of its annotated data. Yarowsky et al. (2001); Yarowsky and Ngai (2001); Das and Petrov (2011) build POS taggers of a target language using parallel corpus of the target language and a source language. The source language is expected to have a POS tagger. First, the source language tools annotate the source side of the parallel corpora. Later, these annotations are projected to the target language side using the information of alignments in the parallel corpora. This way annotated corpora is built for the target language from which POS taggers are built. Other methods which make use of parallel corpora are (Snyder et al., 2008; Naseem et al., 2009). They use unsupervised approaches based on hierarchical 1
2 Bayesian models, using Markov Chain Monte Carlo sampling techniques for inference, gaining from the information shared across languages. The main disadvantage of the above methods is that they heavily rely on parallel corpora which itself is a costly resource for resource-poor languages. Hana et al. (2004); Feldman et al. (2006) propose a method for developing POS tagger (includes morphological analyzer) for a target language using another typologically related language. Their motivation is also similar to us that they aim to develop taggers for resource-less languages. The method is described in Section 3 3 Hana et al. (2004) Hana et al. aim to develop a tagger for Russian using Czech. They use use a HMM based tagging model. Even though the languages Czech and Russian are free-word order, they describe that HMM based tagger works well. A HMM tagger is based on two probabilities - the transition and emission probabilities. Transition probabilities describe the conditional probability of a tag for a current word of interest given the tags of previous words. Based on the intuition that transition probabilities across typologically related languages remain the same, they treat the transition probabilities of Russian to be the same as Czech. Emission probabilities describe the conditional probability of a tag given a word. Since Hana et al. do not use a bilingual lexicon, they could not use emission probabilities of Czech for Russian. Also, since they do not use annotated data for Russian, it is not straightforward to get emission probabilities. To overcome this, they develop a light paradigm-based (a set of rules) lexicon of Russian which describes all the possible tags (including morphological information) for a given word form. For a given word form, the distribution of its possible tags is treated to be uniform. Using this assumption, emission probabilities are calculated. Apart from this, to prevent errors in transition probabilities due to differences in languages, they remove patterns in Czech training corpus which do not occur in Russian. They call this Russification. After Russification, the transition behaviour of Czech is expected to be the same as Russian. Results show that Russification improved the performance. Also to prevent errors and over-generation by light morphological analyzer of Russian, they use certain filters. These are also found to improve the performance. Adding to these, they also train separate models for main POS tag and other morphological features such as gender, number, case and tense. Based on a voting scheme, they finally arrive at a tag for each word. Training separate models for each tag type also helped improving the accuracy of the tagger. 2
3 4 Our Focus: Target and Source Languages We aim to develop POS taggers for Dravidian languages like Kannada and Malayalam using Telugu or Tamil as source languages. Dravidian languages are spoken by more than 200 million with Telugu, Tamil, Kannada and Malayalam spoken by 75, 65, 35 and 33 million respectively (src: Wikipedia). Though the numbers are huge, the resources for these languages are relatively poorer compared to the Indo-Aryan languages which are other major Indian languages. Even among Dravidian, Kannada and Malayalam are resource-poor compared to Telugu and Tamil. Since these languages are highly morphological rich, they pose extra difficulty to build resources. Majority of the existing research in computational linguistics in Dravidian languages focused on Telugu and Tamil as a result of which resources for these languages are notable compared to Kannada and Malayalam. 4.1 Kannada and Telugu Telugu is known to be highly influenced by Kannada making the languages slightly mutually intelligible (Datta, 1998, pg. 1690). Until 13th century both the languages have same script. In the later years, the script has changed but still close similarities can be observed. Both the scripts belong to the same script family. To build Kannada POS tagger, we aim to take advantage of Telugu resources. 4.2 Tamil and Malayalam There are studies (Asher and Kumari, 1997) which say Malayalam and Tamil originated from Ancient Tamil and some say Malayalam is a dialect of Tamil. Both the writing scripts belong to the same family. Tamil is widely spoken and is resource-richer than Malayalam. So we aim to use Tamil resources to build Malayalam tagger. 4.3 Tagset All the Indian languages have similarities in morphological properties and syntactic behaviour. The only main difference is agglutinative behaviour of Dravidian languages. Observing these similarities and differences in Indian languages, Bharati et al. (2006) proposed a common POS tagset for all Indian languages. We aim to use this tagset. 4.4 Resources available Some linguistic tools for Dravidian languages are made available by the Indian Government initiative called Indian Language Machine Translation where 3
4 many universities formed a consortium to develop linguistic resources for Indian languages 1. These tools include morphological analyzers, POS taggers, transliteration tools for converting these languages to ASCII encoding called wx-format. All these POS taggers are built using the method (Avinesh and Karthik, 2007) in which conditional random fields (CRF) models are trained on manually annotated corpora. The training corpora sizes of Kannada and Malayalam are very small compared to Telugu and Tamil. [Serge: We may have to use already trained models of Telugu and Tamil since the annotated corpora are not freely available.] We create annotated corpora of Telugu and Tamil by tagging large corpora using the existing tools. These corpora are later used to create models for Kannada and Malayalam respectively. 5 Our Method Our POS tags involves main part-of-speech information along with morphological information of words such as case, gender, number, tense, aspect. Our method is inspired from the method of (Hana et al., 2004). Our contributions will be in learning transition and emission probabilities of the target language. 5.1 Estimating Transition Probabilities The transition probabilities of Kannada and Malayalam are learned from Telugu and Tamil respectively. Our contribution in learning transition probabilities is to minimize the use of manual intervention in providing linguistic information of the target language such as the combination of features (both morphological and POS tag information) which are useful to tag a target word. We aim to learn this automatically using decision trees. (Schmid and Laws, 2008) use decision trees to compute the transition probabilities along with best feature selection. 5.2 Estimating Emission Probabilities Our major contribution will be estimating in the emission probabilities. Since the languages we deal with are slightly mutually intelligible, we try to exploit this. We use different schemes for estimating emission probabilities Edit-distance (Hana et al., 2004) assume uniform distribution of POS tags in the target language without using the information of the source language. But one can make use of source language information if at all there exists a translation system. Since the languages we deal are slightly mutually intelligible, we transliterate both source and target languages to ASCII and then use similarities between 1 Tools for 9 Indian languages downloads/shallow_parser.php 4
5 words in the source and target. Similarity between source language words and target language is measured using edit-distance methods Bilingual Lexicon The emission probabilities of the source language are converted to target language using bilingual lexicon. This bilingual lexicon can be a direct one or a pivot-language based where both bilingual lexicons of source and target exists in common with another language Uniform Distribution Similar to (Hana et al., 2004), we use uniform tag distribution for a given word over all its possible tags Relative Frequencies Rather than using uniform distribution over all tags for a given word, we use the information of relative frequencies of POS tags of the source language. 5.3 Further refinement [Serge s Idea:] An initial HMM tagger is build using above transition and emission probabilities. To make the tagger more accurate, we tag large corpora of the target language using the initial tagger developed. We then observe common error patterns in tagging of the target language. Using simple regular expressions, we correct these errors to build a near-gold standard data for the target language. We use this data to then train a new decision tree-based HMM model. Our expectation is that this new tagger will be more accurate than the previous one. But we should take care not to loose major useful information which is learned from the source language. 6 Evaluation [This is something we have not discussed yet.] We aim to use manual gold standard data and compare our taggers performance with the gold standard. We also compare our models with exisiting taggers. References Asher, R. E. and Kumari, T. C. (1997). Malayalam. Atkins, S. B. T. and Rundell, M. (2008). The Oxford Guide to Practical Lexicography. Oxford University Press, Oxford. 5
6 Avinesh, P. V. S. and Karthik, G. (2007). Part-Of-Speech Tagging and Chunking using Conditional Random Fields and Transformation-Based Learning. In Proceedings of the IJCAI and the Workshop On Shallow Parsing for South Asian Languages (SPSAL), pages Bharati, A., Sangal, R., Sharma, D. M., and Bai, L. (2006). Anncorra: Annotating corpora guidelines for pos and chunk annotation for indian languages. In Technical Report (TR-LTRC-31), LTRC, IIIT-Hyderabad. Das, D. and Petrov, S. (2011). Unsupervised part-of-speech tagging with bilingual graph-based projections. In Proceedings of ACL Datta, A. (1998). The Encyclopaedia Of Indian Literature, volume 2. Feldman, A., Hana, J., and Brew, C. (2006). A cross-language approach to rapid creation of new morpho-syntactically annotated resources. In Proceedings of LREC, pages Hana, J., Feldman, A., and Brew, C. (2004). A Resource-light Approach to Russian Morphology: Tagging Russian using Czech resources. In Proceedings of EMNLP 2004, Barcelona, Spain. Kilgarriff, A., Reddy, S., Pomikálek, J., and PVS, A. (2010). A corpus factory for many languages. In Proceedings of the Seventh conference on International Language Resources and Evaluation (LREC 10), Valletta, Malta. European Language Resources Association (ELRA). Naseem, T., Snyder, B., Eisenstein, J., and Barzilay, R. (2009). Multilingual part-of-speech tagging: Two unsupervised approaches. J. Artif. Intell. Res. (JAIR), 36: Schmid, H. and Laws, F. (2008). Estimation of conditional probabilities with decision trees and an application to fine-grained pos tagging. In Proceedings of the 22nd International Conference on Computational Linguistics - Volume 1, COLING 08, pages , Stroudsburg, PA, USA. Association for Computational Linguistics. Snyder, B., Naseem, T., Eisenstein, J., and Barzilay, R. (2008). Unsupervised multilingual learning for pos tagging. In Proceedings of the Conference on Empirical Methods in Natural Language Processing, EMNLP 08, pages , Stroudsburg, PA, USA. Association for Computational Linguistics. Yarowsky, D. and Ngai, G. (2001). Inducing multilingual pos taggers and np bracketers via robust projection across aligned corpora. In Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies, NAACL 01, pages 1 8, Stroudsburg, PA, USA. Association for Computational Linguistics. 6
7 Yarowsky, D., Ngai, G., and Wicentowski, R. (2001). Inducing multilingual text analysis tools via robust projection across aligned corpora. In Proceedings of the first international conference on Human language technology research, HLT 01, pages 1 8, Stroudsburg, PA, USA. Association for Computational Linguistics. 7
Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels
Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels Jörg Tiedemann Uppsala University Department of Linguistics and Philology firstname.lastname@lingfil.uu.se Abstract
More informationDEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS
DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS Julia Tmshkina Centre for Text Techitology, North-West University, 253 Potchefstroom, South Africa 2025770@puk.ac.za
More informationIndian Institute of Technology, Kanpur
Indian Institute of Technology, Kanpur Course Project - CS671A POS Tagging of Code Mixed Text Ayushman Sisodiya (12188) {ayushmn@iitk.ac.in} Donthu Vamsi Krishna (15111016) {vamsi@iitk.ac.in} Sandeep Kumar
More informationTwo methods to incorporate local morphosyntactic features in Hindi dependency
Two methods to incorporate local morphosyntactic features in Hindi dependency parsing Bharat Ram Ambati, Samar Husain, Sambhav Jain, Dipti Misra Sharma and Rajeev Sangal Language Technologies Research
More informationLinking Task: Identifying authors and book titles in verbose queries
Linking Task: Identifying authors and book titles in verbose queries Anaïs Ollagnier, Sébastien Fournier, and Patrice Bellot Aix-Marseille University, CNRS, ENSAM, University of Toulon, LSIS UMR 7296,
More informationNamed Entity Recognition: A Survey for the Indian Languages
Named Entity Recognition: A Survey for the Indian Languages Padmaja Sharma Dept. of CSE Tezpur University Assam, India 784028 psharma@tezu.ernet.in Utpal Sharma Dept.of CSE Tezpur University Assam, India
More informationIntroduction. Beáta B. Megyesi. Uppsala University Department of Linguistics and Philology Introduction 1(48)
Introduction Beáta B. Megyesi Uppsala University Department of Linguistics and Philology beata.megyesi@lingfil.uu.se Introduction 1(48) Course content Credits: 7.5 ECTS Subject: Computational linguistics
More informationWeb as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics
(L615) Markus Dickinson Department of Linguistics, Indiana University Spring 2013 The web provides new opportunities for gathering data Viable source of disposable corpora, built ad hoc for specific purposes
More informationSemi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.
Semi-supervised methods of text processing, and an application to medical concept extraction Yacine Jernite Text-as-Data series September 17. 2015 What do we want from text? 1. Extract information 2. Link
More informationA High-Quality Web Corpus of Czech
A High-Quality Web Corpus of Czech Johanka Spoustová, Miroslav Spousta Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics Charles University Prague, Czech Republic {johanka,spousta}@ufal.mff.cuni.cz
More informationEnhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities
Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities Yoav Goldberg Reut Tsarfaty Meni Adler Michael Elhadad Ben Gurion
More informationGrammar Extraction from Treebanks for Hindi and Telugu
Grammar Extraction from Treebanks for Hindi and Telugu Prasanth Kolachina, Sudheer Kolachina, Anil Kumar Singh, Samar Husain, Viswanatha Naidu,Rajeev Sangal and Akshar Bharati Language Technologies Research
More informationNCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches
NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches Yu-Chun Wang Chun-Kai Wu Richard Tzong-Han Tsai Department of Computer Science
More informationModeling full form lexica for Arabic
Modeling full form lexica for Arabic Susanne Alt Amine Akrout Atilf-CNRS Laurent Romary Loria-CNRS Objectives Presentation of the current standardization activity in the domain of lexical data modeling
More informationTraining and evaluation of POS taggers on the French MULTITAG corpus
Training and evaluation of POS taggers on the French MULTITAG corpus A. Allauzen, H. Bonneau-Maynard LIMSI/CNRS; Univ Paris-Sud, Orsay, F-91405 {allauzen,maynard}@limsi.fr Abstract The explicit introduction
More informationTHE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING
SISOM & ACOUSTICS 2015, Bucharest 21-22 May THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING MarilenaăLAZ R 1, Diana MILITARU 2 1 Military Equipment and Technologies Research Agency, Bucharest,
More informationExploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data
Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data Maja Popović and Hermann Ney Lehrstuhl für Informatik VI, Computer
More informationDistant Supervised Relation Extraction with Wikipedia and Freebase
Distant Supervised Relation Extraction with Wikipedia and Freebase Marcel Ackermann TU Darmstadt ackermann@tk.informatik.tu-darmstadt.de Abstract In this paper we discuss a new approach to extract relational
More informationLessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities
Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities Simon Clematide, Isabel Meraner, Noah Bubenhofer, Martin Volk Institute of Computational Linguistics
More informationChinese Language Parsing with Maximum-Entropy-Inspired Parser
Chinese Language Parsing with Maximum-Entropy-Inspired Parser Heng Lian Brown University Abstract The Chinese language has many special characteristics that make parsing difficult. The performance of state-of-the-art
More informationOnline Updating of Word Representations for Part-of-Speech Tagging
Online Updating of Word Representations for Part-of-Speech Tagging Wenpeng Yin LMU Munich wenpeng@cis.lmu.de Tobias Schnabel Cornell University tbs49@cornell.edu Hinrich Schütze LMU Munich inquiries@cislmu.org
More informationVariations of the Similarity Function of TextRank for Automated Summarization
Variations of the Similarity Function of TextRank for Automated Summarization Federico Barrios 1, Federico López 1, Luis Argerich 1, Rosita Wachenchauzer 12 1 Facultad de Ingeniería, Universidad de Buenos
More information2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases
POS Tagging Problem Part-of-Speech Tagging L545 Spring 203 Given a sentence W Wn and a tagset of lexical categories, find the most likely tag T..Tn for each word in the sentence Example Secretariat/P is/vbz
More informationEnsemble Technique Utilization for Indonesian Dependency Parser
Ensemble Technique Utilization for Indonesian Dependency Parser Arief Rahman Institut Teknologi Bandung Indonesia 23516008@std.stei.itb.ac.id Ayu Purwarianti Institut Teknologi Bandung Indonesia ayu@stei.itb.ac.id
More informationProceedings of the 19th COLING, , 2002.
Crosslinguistic Transfer in Automatic Verb Classication Vivian Tsang Computer Science University of Toronto vyctsang@cs.toronto.edu Suzanne Stevenson Computer Science University of Toronto suzanne@cs.toronto.edu
More informationPredicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks
Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks Devendra Singh Chaplot, Eunhee Rhim, and Jihie Kim Samsung Electronics Co., Ltd. Seoul, South Korea {dev.chaplot,eunhee.rhim,jihie.kim}@samsung.com
More informationScienceDirect. Malayalam question answering system
Available online at www.sciencedirect.com ScienceDirect Procedia Technology 24 (2016 ) 1388 1392 International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015) Malayalam
More informationTarget Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data
Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data Ebba Gustavii Department of Linguistics and Philology, Uppsala University, Sweden ebbag@stp.ling.uu.se
More informationLearning Methods in Multilingual Speech Recognition
Learning Methods in Multilingual Speech Recognition Hui Lin Department of Electrical Engineering University of Washington Seattle, WA 98125 linhui@u.washington.edu Li Deng, Jasha Droppo, Dong Yu, and Alex
More informationCS 598 Natural Language Processing
CS 598 Natural Language Processing Natural language is everywhere Natural language is everywhere Natural language is everywhere Natural language is everywhere!"#$%&'&()*+,-./012 34*5665756638/9:;< =>?@ABCDEFGHIJ5KL@
More informationPOS tagging of Chinese Buddhist texts using Recurrent Neural Networks
POS tagging of Chinese Buddhist texts using Recurrent Neural Networks Longlu Qin Department of East Asian Languages and Cultures longlu@stanford.edu Abstract Chinese POS tagging, as one of the most important
More informationA heuristic framework for pivot-based bilingual dictionary induction
2013 International Conference on Culture and Computing A heuristic framework for pivot-based bilingual dictionary induction Mairidan Wushouer, Toru Ishida, Donghui Lin Department of Social Informatics,
More informationA Simple Surface Realization Engine for Telugu
A Simple Surface Realization Engine for Telugu Sasi Raja Sekhar Dokkara, Suresh Verma Penumathsa Dept. of Computer Science Adikavi Nannayya University, India dsairajasekhar@gmail.com,vermaps@yahoo.com
More informationSemi-supervised Training for the Averaged Perceptron POS Tagger
Semi-supervised Training for the Averaged Perceptron POS Tagger Drahomíra johanka Spoustová Jan Hajič Jan Raab Miroslav Spousta Institute of Formal and Applied Linguistics Faculty of Mathematics and Physics,
More informationSpecification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments
Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments Cristina Vertan, Walther v. Hahn University of Hamburg, Natural Language Systems Division Hamburg,
More informationAnnotation Projection for Discourse Connectives
SFB 833 / Univ. Tübingen Penn Discourse Treebank Workshop Annotation projection Basic idea: Given a bitext E/F and annotation for F, how would the annotation look for E? Examples: Word Sense Disambiguation
More informationChunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.
NLP Lab Session Week 8 October 15, 2014 Noun Phrase Chunking and WordNet in NLTK Getting Started In this lab session, we will work together through a series of small examples using the IDLE window and
More informationAccurate Unlexicalized Parsing for Modern Hebrew
Accurate Unlexicalized Parsing for Modern Hebrew Reut Tsarfaty and Khalil Sima an Institute for Logic, Language and Computation, University of Amsterdam Plantage Muidergracht 24, 1018TV Amsterdam, The
More informationExtracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models
Extracting Opinion Expressions and Their Polarities Exploration of Pipelines and Joint Models Richard Johansson and Alessandro Moschitti DISI, University of Trento Via Sommarive 14, 38123 Trento (TN),
More informationEdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar
EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar Chung-Chi Huang Mei-Hua Chen Shih-Ting Huang Jason S. Chang Institute of Information Systems and Applications, National Tsing Hua University,
More informationESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly
ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly Inflected Languages Classical Approaches to Tagging The slides are posted on the web. The url is http://chss.montclair.edu/~feldmana/esslli10/.
More informationParsing of part-of-speech tagged Assamese Texts
IJCSI International Journal of Computer Science Issues, Vol. 6, No. 1, 2009 ISSN (Online): 1694-0784 ISSN (Print): 1694-0814 28 Parsing of part-of-speech tagged Assamese Texts Mirzanur Rahman 1, Sufal
More informationWord Sense Disambiguation
Word Sense Disambiguation D. De Cao R. Basili Corso di Web Mining e Retrieval a.a. 2008-9 May 21, 2009 Excerpt of the R. Mihalcea and T. Pedersen AAAI 2005 Tutorial, at: http://www.d.umn.edu/ tpederse/tutorials/advances-in-wsd-aaai-2005.ppt
More informationSpoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers
Spoken Language Parsing Using Phrase-Level Grammars and Trainable Classifiers Chad Langley, Alon Lavie, Lori Levin, Dorcas Wallace, Donna Gates, and Kay Peterson Language Technologies Institute Carnegie
More informationBYLINE [Heng Ji, Computer Science Department, New York University,
INFORMATION EXTRACTION BYLINE [Heng Ji, Computer Science Department, New York University, hengji@cs.nyu.edu] SYNONYMS NONE DEFINITION Information Extraction (IE) is a task of extracting pre-specified types
More informationCross Language Information Retrieval
Cross Language Information Retrieval RAFFAELLA BERNARDI UNIVERSITÀ DEGLI STUDI DI TRENTO P.ZZA VENEZIA, ROOM: 2.05, E-MAIL: BERNARDI@DISI.UNITN.IT Contents 1 Acknowledgment.............................................
More informationTransliteration Systems Across Indian Languages Using Parallel Corpora
Transliteration Systems Across Indian Languages Using Parallel Corpora Rishabh Srivastava and Riyaz Ahmad Bhat Language Technologies Research Center IIIT-Hyderabad, India {rishabh.srivastava, riyaz.bhat}@research.iiit.ac.in
More informationSpeech Translation for Triage of Emergency Phonecalls in Minority Languages
Speech Translation for Triage of Emergency Phonecalls in Minority Languages Udhyakumar Nallasamy, Alan W Black, Tanja Schultz, Robert Frederking Language Technologies Institute Carnegie Mellon University
More information1. Introduction. 2. The OMBI database editor
OMBI bilingual lexical resources: Arabic-Dutch / Dutch-Arabic Carole Tiberius, Anna Aalstein, Instituut voor Nederlandse Lexicologie Jan Hoogland, Nederlands Instituut in Marokko (NIMAR) In this paper
More informationSEMAFOR: Frame Argument Resolution with Log-Linear Models
SEMAFOR: Frame Argument Resolution with Log-Linear Models Desai Chen or, The Case of the Missing Arguments Nathan Schneider SemEval July 16, 2010 Dipanjan Das School of Computer Science Carnegie Mellon
More informationMULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY
MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY Chen, Hsin-Hsi Department of Computer Science and Information Engineering National Taiwan University Taipei, Taiwan E-mail: hh_chen@csie.ntu.edu.tw Abstract
More informationAn Evaluation of POS Taggers for the CHILDES Corpus
City University of New York (CUNY) CUNY Academic Works Dissertations, Theses, and Capstone Projects Graduate Center 9-30-2016 An Evaluation of POS Taggers for the CHILDES Corpus Rui Huang The Graduate
More informationDevelopment of the First LRs for Macedonian: Current Projects
Development of the First LRs for Macedonian: Current Projects Ruska Ivanovska-Naskova Faculty of Philology- University St. Cyril and Methodius Bul. Krste Petkov Misirkov bb, 1000 Skopje, Macedonia rivanovska@flf.ukim.edu.mk
More informationTowards a corpus-based online dictionary. of Italian Word Combinations
Towards a corpus-based online dictionary of Italian Word Combinations Castagnoli Sara 1, Lebani E. Gianluca 2, Lenci Alessandro 2, Masini Francesca 1, Nissim Malvina 3, Piunno Valentina 4 1 University
More informationProgressive Aspect in Nigerian English
ISLE 2011 17 June 2011 1 New Englishes Empirical Studies Aspect in Nigerian Languages 2 3 Nigerian English Other New Englishes Explanations Progressive Aspect in New Englishes New Englishes Empirical Studies
More informationExploiting Wikipedia as External Knowledge for Named Entity Recognition
Exploiting Wikipedia as External Knowledge for Named Entity Recognition Jun ichi Kazama and Kentaro Torisawa Japan Advanced Institute of Science and Technology (JAIST) Asahidai 1-1, Nomi, Ishikawa, 923-1292
More informationThe Choice of Features for Classification of Verbs in Biomedical Texts
The Choice of Features for Classification of Verbs in Biomedical Texts Anna Korhonen University of Cambridge Computer Laboratory 15 JJ Thomson Avenue Cambridge CB3 0FD, UK alk23@cl.cam.ac.uk Yuval Krymolowski
More informationCorpus Linguistics (L615)
(L615) Basics of Markus Dickinson Department of, Indiana University Spring 2013 1 / 23 : the extent to which a sample includes the full range of variability in a population distinguishes corpora from archives
More informationShort Text Understanding Through Lexical-Semantic Analysis
Short Text Understanding Through Lexical-Semantic Analysis Wen Hua #1, Zhongyuan Wang 2, Haixun Wang 3, Kai Zheng #4, Xiaofang Zhou #5 School of Information, Renmin University of China, Beijing, China
More informationIntroduction to Text Mining
Prelude Overview Introduction to Text Mining Tutorial at EDBT 06 René Witte Faculty of Informatics Institute for Program Structures and Data Organization (IPD) Universität Karlsruhe, Germany http://rene-witte.net
More informationIraide Ibarretxe Antuñano Universidad de Zaragoza
ATLANTIS Journal of the Spanish Association of Anglo-American Studies 34.1 ( June 2012): 163 69 issn 0210-6124 Hans Boas, ed. 2010: Contrastive Studies in Construction Grammar. Amsterdam/ Philadephia:
More informationThe Karlsruhe Institute of Technology Translation Systems for the WMT 2011
The Karlsruhe Institute of Technology Translation Systems for the WMT 2011 Teresa Herrmann, Mohammed Mediani, Jan Niehues and Alex Waibel Karlsruhe Institute of Technology Karlsruhe, Germany firstname.lastname@kit.edu
More informationThe NICT Translation System for IWSLT 2012
The NICT Translation System for IWSLT 2012 Andrew Finch Ohnmar Htun Eiichiro Sumita Multilingual Translation Group MASTAR Project National Institute of Information and Communications Technology Kyoto,
More informationEnd-to-End SMT with Zero or Small Parallel Texts 1. Abstract
End-to-End SMT with Zero or Small Parallel Texts 1 Abstract We use bilingual lexicon induction techniques, which learn translations from monolingual texts in two languages, to build an end-to-end statistical
More informationMethods for the Qualitative Evaluation of Lexical Association Measures
Methods for the Qualitative Evaluation of Lexical Association Measures Stefan Evert IMS, University of Stuttgart Azenbergstr. 12 D-70174 Stuttgart, Germany evert@ims.uni-stuttgart.de Brigitte Krenn Austrian
More informationRole of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation
Role of Pausing in Text-to-Speech Synthesis for Simultaneous Interpretation Vivek Kumar Rangarajan Sridhar, John Chen, Srinivas Bangalore, Alistair Conkie AT&T abs - Research 180 Park Avenue, Florham Park,
More informationModeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures
Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures Ulrike Baldewein (ulrike@coli.uni-sb.de) Computational Psycholinguistics, Saarland University D-66041 Saarbrücken,
More informationINTERNAL ASSIGNMENT QUESTIONS P.G. Diploma in English Language & Teaching ANNUAL EXAMINATIONS ( )
INTERNAL ASSIGNMENT QUESTIONS P.G. Diploma in English Language & Teaching ANNUAL EXAMINATIONS (2016-2017) PROF. G. RAM REDDY CENTRE FOR DISTANCE EDUCATION (RECOGNISED BY THE DISTANCE EDUCATION BUREAU,
More informationLearning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models
Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models Stephan Gouws and GJ van Rooyen MIH Medialab, Stellenbosch University SOUTH AFRICA {stephan,gvrooyen}@ml.sun.ac.za
More informationA Case Study: News Classification Based on Term Frequency
A Case Study: News Classification Based on Term Frequency Petr Kroha Faculty of Computer Science University of Technology 09107 Chemnitz Germany kroha@informatik.tu-chemnitz.de Ricardo Baeza-Yates Center
More informationApplications of memory-based natural language processing
Applications of memory-based natural language processing Antal van den Bosch and Roser Morante ILK Research Group Tilburg University Prague, June 24, 2007 Current ILK members Principal investigator: Antal
More informationLanguage Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus
Language Acquisition Fall 2010/Winter 2011 Lexical Categories Afra Alishahi, Heiner Drenhaus Computational Linguistics and Phonetics Saarland University Children s Sensitivity to Lexical Categories Look,
More informationUnsupervised Dependency Parsing without Gold Part-of-Speech Tags
Unsupervised Dependency Parsing without Gold Part-of-Speech Tags Valentin I. Spitkovsky valentin@cs.stanford.edu Angel X. Chang angelx@cs.stanford.edu Hiyan Alshawi hiyan@google.com Daniel Jurafsky jurafsky@stanford.edu
More informationAutomated Identification of Domain Preferences of Collocations
Automated Identification of Domain Preferences of Collocations Jelena Kallas 1, Vit Suchomel 2, Maria Khokhlova 3 1 Institute of the Estonian Language, Estonia 2 Masaryk University, Czech Republic 3 St.
More informationTwitter Sentiment Classification on Sanders Data using Hybrid Approach
IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 17, Issue 4, Ver. I (July Aug. 2015), PP 118-123 www.iosrjournals.org Twitter Sentiment Classification on Sanders
More informationTHE VERB ARGUMENT BROWSER
THE VERB ARGUMENT BROWSER Bálint Sass sass.balint@itk.ppke.hu Péter Pázmány Catholic University, Budapest, Hungary 11 th International Conference on Text, Speech and Dialog 8-12 September 2008, Brno PREVIEW
More informationNatural Language Processing. George Konidaris
Natural Language Processing George Konidaris gdk@cs.brown.edu Fall 2017 Natural Language Processing Understanding spoken/written sentences in a natural language. Major area of research in AI. Why? Humans
More informationarxiv: v1 [cs.cl] 2 Apr 2017
Word-Alignment-Based Segment-Level Machine Translation Evaluation using Word Embeddings Junki Matsuo and Mamoru Komachi Graduate School of System Design, Tokyo Metropolitan University, Japan matsuo-junki@ed.tmu.ac.jp,
More informationBigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora
Bigrams in registers, domains, and varieties: a bigram gravity approach to the homogeneity of corpora Stefan Th. Gries Department of Linguistics University of California, Santa Barbara stgries@linguistics.ucsb.edu
More informationA Syllable Based Word Recognition Model for Korean Noun Extraction
are used as the most important terms (features) that express the document in NLP applications such as information retrieval, document categorization, text summarization, information extraction, and etc.
More informationA Comparison of Two Text Representations for Sentiment Analysis
010 International Conference on Computer Application and System Modeling (ICCASM 010) A Comparison of Two Text Representations for Sentiment Analysis Jianxiong Wang School of Computer Science & Educational
More informationLearning Computational Grammars
Learning Computational Grammars John Nerbonne, Anja Belz, Nicola Cancedda, Hervé Déjean, James Hammerton, Rob Koeling, Stasinos Konstantopoulos, Miles Osborne, Franck Thollard and Erik Tjong Kim Sang Abstract
More informationMatching Similarity for Keyword-Based Clustering
Matching Similarity for Keyword-Based Clustering Mohammad Rezaei and Pasi Fränti University of Eastern Finland {rezaei,franti}@cs.uef.fi Abstract. Semantic clustering of objects such as documents, web
More informationWhat Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017
What Can Neural Networks Teach us about Language? Graham Neubig a2-dlearn 11/18/2017 Supervised Training of Neural Networks for Language Training Data Training Model this is an example the cat went to
More informationNTU Student Dashboard
NTU Student Dashboard 28,000 Students > 45% Widening Participation Background > 93% Employability < 5% Drop-out Rate Our Starting Point Three Drivers: HERE Project (part of What Works? Student Retention
More informationRule discovery in Web-based educational systems using Grammar-Based Genetic Programming
Data Mining VI 205 Rule discovery in Web-based educational systems using Grammar-Based Genetic Programming C. Romero, S. Ventura, C. Hervás & P. González Universidad de Córdoba, Campus Universitario de
More informationSINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)
SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF) Hans Christian 1 ; Mikhael Pramodana Agus 2 ; Derwin Suhartono 3 1,2,3 Computer Science Department,
More informationReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology
ReinForest: Multi-Domain Dialogue Management Using Hierarchical Policies and Knowledge Ontology Tiancheng Zhao CMU-LTI-16-006 Language Technologies Institute School of Computer Science Carnegie Mellon
More informationSeminar - Organic Computing
Seminar - Organic Computing Self-Organisation of OC-Systems Markus Franke 25.01.2006 Typeset by FoilTEX Timetable 1. Overview 2. Characteristics of SO-Systems 3. Concern with Nature 4. Design-Concepts
More informationA Semantic Similarity Measure Based on Lexico-Syntactic Patterns
A Semantic Similarity Measure Based on Lexico-Syntactic Patterns Alexander Panchenko, Olga Morozova and Hubert Naets Center for Natural Language Processing (CENTAL) Université catholique de Louvain Belgium
More informationUsing dialogue context to improve parsing performance in dialogue systems
Using dialogue context to improve parsing performance in dialogue systems Ivan Meza-Ruiz and Oliver Lemon School of Informatics, Edinburgh University 2 Buccleuch Place, Edinburgh I.V.Meza-Ruiz@sms.ed.ac.uk,
More informationThe Internet as a Normative Corpus: Grammar Checking with a Search Engine
The Internet as a Normative Corpus: Grammar Checking with a Search Engine Jonas Sjöbergh KTH Nada SE-100 44 Stockholm, Sweden jsh@nada.kth.se Abstract In this paper some methods using the Internet as a
More informationACTL5103 Stochastic Modelling For Actuaries. Course Outline Semester 2, 2014
UNSW Australia Business School School of Risk and Actuarial Studies ACTL5103 Stochastic Modelling For Actuaries Course Outline Semester 2, 2014 Part A: Course-Specific Information Please consult Part B
More informationBootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain
Bootstrapping and Evaluating Named Entity Recognition in the Biomedical Domain Andreas Vlachos Computer Laboratory University of Cambridge Cambridge, CB3 0FD, UK av308@cl.cam.ac.uk Caroline Gasperin Computer
More informationLinguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1
Linguistics 1 Linguistics Matthew Gordon, Chair Interdepartmental Program in the College of Arts and Science 223 Tate Hall (573) 882-6421 gordonmj@missouri.edu Kibby Smith, Advisor Office of Multidisciplinary
More informationLanguage Independent Passage Retrieval for Question Answering
Language Independent Passage Retrieval for Question Answering José Manuel Gómez-Soriano 1, Manuel Montes-y-Gómez 2, Emilio Sanchis-Arnal 1, Luis Villaseñor-Pineda 2, Paolo Rosso 1 1 Polytechnic University
More informationSwitchboard Language Model Improvement with Conversational Data from Gigaword
Katholieke Universiteit Leuven Faculty of Engineering Master in Artificial Intelligence (MAI) Speech and Language Technology (SLT) Switchboard Language Model Improvement with Conversational Data from Gigaword
More informationEvaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment
Evaluation of a Simultaneous Interpretation System and Analysis of Speech Log for User Experience Assessment Akiko Sakamoto, Kazuhiko Abe, Kazuo Sumita and Satoshi Kamatani Knowledge Media Laboratory,
More informationEACL th Conference of the European Chapter of the Association for Computational Linguistics. Proceedings of the 2nd International Workshop on
EACL-2006 11 th Conference of the European Chapter of the Association for Computational Linguistics Proceedings of the 2nd International Workshop on Web as Corpus Chairs: Adam Kilgarriff Marco Baroni April
More informationProblems of the Arabic OCR: New Attitudes
Problems of the Arabic OCR: New Attitudes Prof. O.Redkin, Dr. O.Bernikova Department of Asian and African Studies, St. Petersburg State University, St Petersburg, Russia Abstract - This paper reviews existing
More information