FST Based Morphological Analyzer for Hindi Language

Similar documents
HinMA: Distributed Morphology based Hindi Morphological Analyzer

DCA प रय जन क य म ग नद शक द र श नद श लय मह म ग ध अ तरर य ह द व व व लय प ट ह द व व व लय, ग ध ह स, वध (मह र ) DCA-09 Project Work Handbook

S. RAZA GIRLS HIGH SCHOOL

क त क ई-व द य लय पत र क 2016 KENDRIYA VIDYALAYA ADILABAD

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

ScienceDirect. Malayalam question answering system

Parsing of part-of-speech tagged Assamese Texts

Question (1) Question (2) RAT : SEW : : NOW :? (A) OPY (B) SOW (C) OSZ (D) SUY. Correct Option : C Explanation : Question (3)

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities


Indian Institute of Technology, Kanpur

LING 329 : MORPHOLOGY

Development of the First LRs for Macedonian: Current Projects

Derivational and Inflectional Morphemes in Pak-Pak Language

The Prague Bulletin of Mathematical Linguistics NUMBER 95 APRIL

Semantic Modeling in Morpheme-based Lexica for Greek

ह द स ख! Hindi Sikho!

Constructing Parallel Corpus from Movie Subtitles

Named Entity Recognition: A Survey for the Indian Languages

Linking Task: Identifying authors and book titles in verbose queries

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

A Simple Surface Realization Engine for Telugu

ENGLISH Month August

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Cross Language Information Retrieval

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Detection of Multiword Expressions for Hindi Language using Word Embeddings and WordNet-based Features

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

A Syllable Based Word Recognition Model for Korean Noun Extraction

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

The Karlsruhe Institute of Technology Translation Systems for the WMT 2011

THE ROLE OF DECISION TREES IN NATURAL LANGUAGE PROCESSING

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Exploiting Phrasal Lexica and Additional Morpho-syntactic Language Resources for Statistical Machine Translation with Scarce Training Data

What the National Curriculum requires in reading at Y5 and Y6

Developing a TT-MCTAG for German with an RCG-based Parser

CS 598 Natural Language Processing

A Case Study: News Classification Based on Term Frequency

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

1. Introduction. 2. The OMBI database editor

Learning Computational Grammars

Applications of memory-based natural language processing

ARNE - A tool for Namend Entity Recognition from Arabic Text

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

A Comparison of Two Text Representations for Sentiment Analysis

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Modeling full form lexica for Arabic

Phonological Processing for Urdu Text to Speech System

Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger

The Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek

AQUA: An Ontology-Driven Question Answering System

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

The taming of the data:

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Language Independent Passage Retrieval for Question Answering

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Prediction of Maximal Projection for Semantic Role Labeling

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

Ensemble Technique Utilization for Indonesian Dependency Parser

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Lessons from a Massive Open Online Course (MOOC) on Natural Language Processing for Digital Humanities

Language Acquisition Fall 2010/Winter Lexical Categories. Afra Alishahi, Heiner Drenhaus

Two methods to incorporate local morphosyntactic features in Hindi dependency

THE VERB ARGUMENT BROWSER

On-Line Data Analytics

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

BULATS A2 WORDLIST 2

Reducing Features to Improve Bug Prediction

Adding syntactic structure to bilingual terminology for improved domain adaptation

Modeling function word errors in DNN-HMM based LVCSR systems

Using dialogue context to improve parsing performance in dialogue systems

The Role of the Head in the Interpretation of English Deverbal Compounds

Cross-Lingual Dependency Parsing with Universal Dependencies and Predicted PoS Labels

A Bayesian Learning Approach to Concept-Based Document Classification

Universiteit Leiden ICT in Business

Semi-supervised Training for the Averaged Perceptron POS Tagger

An Evaluation of POS Taggers for the CHILDES Corpus

Memory-based grammatical error correction

Defragmenting Textual Data by Leveraging the Syntactic Structure of the English Language

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

CREATING SHARABLE LEARNING OBJECTS FROM EXISTING DIGITAL COURSE CONTENT

Modeling function word errors in DNN-HMM based LVCSR systems

Online Updating of Word Representations for Part-of-Speech Tagging

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Module 12. Machine Learning. Version 2 CSE IIT, Kharagpur

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

STANDARDS. Essential Question: How can ideas, themes, and stories connect people from different times and places? BIN/TABLE 1

Grammars & Parsing, Part 1:

Training and evaluation of POS taggers on the French MULTITAG corpus

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Noisy SMS Machine Translation in Low-Density Languages

Transcription:

FST Based Morphological Analyzer for Hindi Language Deepak Kumar 1, Manjeet Singh 2, and Seema Shukla 3 1 Department of Information Technology, JSS Academy of Technical Education Noida, Uttar Pradesh, India deepakk799@gmail.com 2 Department of Information Technology, JSS Academy of Technical Education Noida, Uttar Pradesh, India manjeet207@gmail.com 3 Department of Computer Science & Engineering, JSS Academy of Technical Education Noida, Uttar Pradesh, India seemashukla@gmail.com Abstract Hindi being a highly inflectional language, FST (Finite State Transducer) based approach is most efficient for developing a morphological analyzer for this language. The work presented in this paper uses the SFST (Stuttgart Finite State Transducer) tool for generating the FST. A lexicon of root words is created. Rules are then added for generating inflectional and derivational words from these root words. The Morph Analyzer developed was used in a Part Of Speech (POS) Tagger based on Stanford POS Tagger. The system was first trained using a manually tagged corpus and MAXENT (Maximum Entropy) approach of Stanford POS tagger was then used for tagging input sentences. The morphological analyzer gives approximately 97% correct results. POS tagger gives an accuracy of approximately 87% for the sentences that have the words known to the trained model file, and 80% accuracy for the sentences that have the words unknown to the trained model file. Keywords: Morphological Analyzer, Finite State Transducer, POS Tagger, Lexicon Generator. 1. Introduction In recent years, large collections of digitized Hindi documents were created due to the improvement in Information Technology. An efficient accessing system has to take into account the morphology of the language through a systematic linguistic study in order to reveal words that are significant to users, such as historians, linguists etc and describe morpho-phonological rules. The result of that study is the starting point for the construction of computational morphologies providing ability to search documents using word root and locate all the corresponding inflected words [1,2]. A highly inflectional language has the capability of generating hundreds of words from a single root. Hence, morphological analysis is vital for high level applications to understand various words in the language and is the foundation for applications like Information Retrieval, POS Tagging, Chunking and ultimately Machine Translation [3,4]. The available Morphological Analyzers for Hindi language follow stemming based, corpus based and paradigm based approaches. The corpus based approach [5] has the disadvantage that large volume of data needs to be processed. The stemming based approach [6] follows the technique of dividing the word to its corresponding stems and the suffix or prefix and then process these. To analyze a word it must have the corresponding stems in its dictionary. This approach may successfully analyze regular words but it cannot analyze irregular words. A paradigm defines all the word forms of a given stem and also provides a feature structure with every word form. The paradigm approach [7] alone cannot successfully analyze the morphology of the word-forms because of the inflection by the circumfix and gives better results when combined with one of the other approaches. The approach for designing Morphological Analyzers is dominated by those for agglutinative languages i.e. languages like English that show low degree of inflection. Though agglutinative languages show high morpheme per word ratio and have complex morphotactic structures, the absence of fusion at morpheme boundaries makes the task of segmentation fluent once the model for implementation of morphotactics is ready. On this background, a morphological analyzer for highly inflectional language like Hindi which has the tendency to overlay the morphemes in a way that aggravates the task of segmentation presents an interesting case study [8].

2. Methodology The development of morphological analyzer consists of two phases lexicon generation and generation of morphological processor, which is done using the SFST tool. This morph-analyzer was then tested for its effectiveness by using it in a Hindi parts of speech (POS) tagger which was developed through the API provided by Stanford POS Tagger. SFST was developed by the Institute for Natural Language Processing, University of Stuttgart. It comprises a compiler, which translates finite state transducer programs into minimized transducers and a wide range of transducer. Also, it supports UTF-8 character coding which is important for the implementation of Hindi computational morphologies [9,10]. POS Tagger for Hindi language is developed using the Stanford POS Tagger that provides a Java API. This API follows the log-linear approach for POS tagger [11,12]. 2.2 Morphological Processor Fig.2 illustrates the approach used for developing the morphological processor which has two components - the Morphological Analyzer and Morphological Generator. Both analyzer and generator require a dictionary of the root words, file containing FST rules, and the dictionary of indeclinable words. The inflectional and derivational rules are hand written and then coded with the help of SFST tool to generate the.fst files. These.fst files combined with the lexicon generate the Finite State Transducer. The dictionary indeclinable words contains the words that have some specific grammatical word structure such as अत करण. Some specific different rules need to be written for these indeclinable words. The analyzer takes as input the surface form and produces the result as the grammatical structure of the word or the lexicon form. The Generator takes as input the lexicon form and produce the corresponding surface form. 2.1 Lexicon Generation The lexicon was generated with the help using the raw corpus collected from the LDC-IL (Linguistic Data Consortium) for Indian Languages. Fig. 1 shows the approach used for lexicon generation. Unique words from the corpus are extracted and sorted to make the task of processing of the words manually easier. These words are then manually classified into various classes and according to their inflection, and derivations types. The total number of base words contained in the classified lexicon files is 91930. Classified lexicon file of the base words for nouns contains 50520 base words. The lexicon file of pronouns contains 81 base words. The lexicon file of adjectives contains 33006 base words. The lexicon file of verbs contains 8513 base words. The lexicon file of adverbs contains 1559 base words. The lexicon file of particles contains 12 base words. There are 9930 base words for both adjectives and nouns. 2.3 POS Tagger Fig. 2 Morphological Processor. Fig 3 illustrates the system model for the POS Tagger which contains two modules the training module and the tagger module. The training of the model file has been done by collecting the text corpus from the Hindi news paper website Dainic Jagaran and then tagging the corpus manually. This tagged corpus is used to train the tagger resulting in the generation of the trained model file and dictionary words. The trained model file contains all the probable part-of-speech that may be assigned to each word. The dictionary contains all the root words from the corpus. Fig. 1 Lexicon Generator.

म ल म लन कह न कह नय अर लड क म ज़ म ज़ म ल <Noun><masculine><sg > म ल <Noun><feminine>< कह न <Noun><masculine>< कह न <Noun><masculine>< pl> Case लडक <Noun><Vocative> म ज़<Noun><Masculine>< म ज़<Noun><Masculine><pl> Fig. 3 POS Tagger For tagging of a Hindi sentence, the firstly the Hindi sentence is tokenized. Each of these tokens is then searched in the dictionary and assigned all the probable parts-of-speech as per the trained model file. If the token is not a root word then the morphological specification provides the way to find out the base word-form(s) for that token. MAXENT Tagger (provided along with Stanford Tagger) is then use to calculate the probabilities of each of the parts-of-speech for a given token. The token is tagged with the part-of-speech with maximum probability. श र श रन श र<Noun><Masculine>< श र<Noun><feminine>< Table 2 Results for Noun Derivations Derivation from शमर ब शमर Noun शमर<Noun><Masculine><s g> ब शमर<Noun><Masculine>< 3. Experimental Results For testing the morph-analyzer and the POS Tagger an interface was developed using Java technology. This interface contains the keyboard that is provided by Google Hindi Keyboard. Which is available under Google Transliteration IME that provides an input method editor which allows users to enter text in one of the supported languages using a roman keyboard. 3.1 Morphological Analyzer The morphological analyzer was tested with about 4000 inflectional, derivational and compound words. Some of the sample results are shown in tables 1, 2, 3. Table 1 Results for Noun s म ठ मठ ई म ठ <Noun><Masculine><s कम न कम न पन प व त प व तत g> मठ ई<Noun><Masculin e>< कम न <Noun><Masculine >< कम न पन<Noun><Mascul ine>< प व त<Noun><Masculine>< प व तत <Noun><Masculine >< Table 3 Results for Verb लडक लडक Type लड क <Noun><masculine><s g>लडक <Noun><feminine>< ज रह ज रह Type Person Person ज <Verb><Indicative><Masc uline><progressive>< ज <Verb><Indicative><Masc uline><progressive><pl>

पढ पढ ज ज त करत करत पढ <Verb><Indicative><Masc uline>पढ <Verb><Indicative> <Faminine> ज <Verb><present>ज <Verb ><Transitive>ज <Verb><Dati ve>ज <Verb><Imprative><In timate>ज <Verb><Indicative ><Masculine><Perfectiv><sg > कर<Verb><Indicative><Mas culine><habitual><कर<v erb><indicative><masculine> <Habitual><pl> Table 4 depicts the performance of the Morph- Analyzer. Table 4 Performance Evaluation for Morph Analyzer Word Type Noun s,derivations, s,derivations, Verb s,derivations, Adverb s,derivations, 3.2 POS Tagger No. of Percentage of Correct Results 1000 93 1000 95 1000 92 1000 98 The Part-of-Speech (POS) Tagger was tested with about 100 sentences. Some of the sample results are shown in tables 4, 5. Table 5 Results for Simple Sentence Tagging Sentence Tagged Sentence म घर ज रह ह म /PR_PRI घर/N_NN ज /V_VM रह /V_AUX ह /V_AUX / यह सभ अल करण प क त बद ध ह कर स प णर म दर क एक म ल क सद श घ र यह/PR_PRI सभ /JJ अल करण/N_NN प क त बद ध/JJ ह कर/V_VM स प णर/JJ म दर/N_NN क /PSP एक/QT_QTC म ल /N_NN क /PSP सद श/JJ घ र /N_NN Table 6 Results for the POS Tagger of Tagging the Sentence with Ambiguity words Sentence आम आदम आम ख त उसक ख त स ख य एक आम आदम आम ब चत Tagged Sentence आम/JJ आदम /N_NN आम/N_NN ख त /V_VM उसक /PR_PRI ख त /N_NN स ख य /JJ एक/QT_QTC ह /V_AUX / आम/JJ आदम /N_NN आम/N_NN ब चत /V_VM Table 7 depicts the performance of POS Tagger for Hindi Language. Table 7 Performance table of POS Tagger Sentence No. of Sentences Accuracy of Correct Results (In Percentage) Sentence with Known Sentence with Unknown 4. Conclusion 100 93 100 87 The Morph-Analyzer was successfully developed and used in POS tagger. It can, similarly, be used in other NLP applications. The Morph Analyzer can be enhanced by combining the paradigm approach with the FST approach.

References [1] A. Ralli, E. Galiotou, Greek : A Challenging Case for the Parsing Techniques of PC- KIMMO v.2, International Journal of Computational Intelligence, vol. 1, no. 2, pp. 152-162, 2004. [2] Karttunen, Lauri. 1983. KIMMO: a general morphological processor. in Texas Linguistic Forum 22 pp 163-186. [3] Antworth, E. L. PC-KIMMO: a two-level processor for morphological analysis. Presented at academic computing, Summer Institute of Linguistics, Dallas 1990. [4] Kaplan, Ronald M. and Martin Kay. Phonological rules and finite-state transducers. In Proceeding of Linguistic Society of America Meeting Handbook, Fifty- Sixth Annual Meeting. New York. Abstract 1981. [5] Mathias Creutz, Krista Lagus. Unsupervised Morpheme Segmentation and Morphology Induction from Text Corpora Using Morfessor 1.0, In Proceeding of Neural Networks Research Centre, Helsinki University of Technology, Finland, 2006. [6] Ananthakrishnan Ramanathan, Durgesh D Rao, A Lightweight Stemmer for Hindi, In Proceeding of National Centre for Software Technology Rain Tree Marg, Sector 7, CBD Belapur Navi Mumbai, 2004. [7] Bapat, Mugdha et.al, "A Paradigm-based Finite State Morphological Analyzer for Marathi", COLING, Beijing, August 2010. [8] Singh. S, Gupta. K, Shrivastav. M and Bhattacharyya. P, Morphological Richness Offset Resource Demand Experience in constructing a POS Tagger for Hindi, presented at COLING/ACL, pp. 779 786, 2006. [9] H. Schmid, A Programming Language for Finite State Transducers, In Proc. Of FSMNLP 2005, Helsinki, Finland, 2005. [10] Helmut Schmid, Arne Fitschen, Ulrich Hied SMOR: A german computational morphology covering derivation, composition, and inflection. In Proceedings of the IVth International Conference on Language Resources and Evaluation (LREC 2004), Lisbon, Portugal, pp. 1263 1266, 2004. [11] Aniket Dalal, Kumar Nagaraj, Uma Sawant, Sandeep Shelke, Hindi Part-of-Speech Tagging and Chunking : A Maximum Entropy Approach. In Proceedings of the NLPAI Machine Learning Contest 2006 NLPAI, 2006. [12] Kristina Toutanova and Christopher D. Manning. Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In Proceedings of EMNLP-VLC, Hong Kong, 2006.