ScienceDirect. Malayalam question answering system

Similar documents
Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Linking Task: Identifying authors and book titles in verbose queries

Cross Language Information Retrieval

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Procedia - Social and Behavioral Sciences 154 ( 2014 )

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

AQUA: An Ontology-Driven Question Answering System

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Performance Analysis of Optimized Content Extraction for Cyrillic Mongolian Learning Text Materials in the Database

Procedia - Social and Behavioral Sciences 197 ( 2015 )

Procedia - Social and Behavioral Sciences 180 ( 2015 )

ScienceDirect. A Framework for Clustering Cardiac Patient s Records Using Unsupervised Learning Techniques

Taxonomy of the cognitive domain: An example of architectural education program

Procedia - Social and Behavioral Sciences 237 ( 2017 )

Language Independent Passage Retrieval for Question Answering

Procedia - Social and Behavioral Sciences 191 ( 2015 ) WCES Why Do Students Choose To Study Information And Communications Technology?

Procedia - Social and Behavioral Sciences 143 ( 2014 ) CY-ICER Teacher intervention in the process of L2 writing acquisition

Quality Framework for Assessment of Multimedia Learning Materials Version 1.0

ScienceDirect. Noorminshah A Iahad a *, Marva Mirabolghasemi a, Noorfa Haszlinna Mustaffa a, Muhammad Shafie Abd. Latif a, Yahya Buntat b

Procedia - Social and Behavioral Sciences 209 ( 2015 )

Available online at ScienceDirect. Procedia Computer Science 54 (2015 )

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Procedia - Social and Behavioral Sciences 136 ( 2014 ) LINELT 2013

Institutional repository policies: best practices for encouraging self-archiving

Procedia - Social and Behavioral Sciences 146 ( 2014 )

Is M-learning versus E-learning or are they supporting each other?

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA 2013

Memory-based grammatical error correction

Introduction to Text Mining

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Applications of memory-based natural language processing

LEGO training. An educational program for vocational professions

Cross-Lingual Text Categorization

Chamilo 2.0: A Second Generation Open Source E-learning and Collaboration Platform

A Case Study: News Classification Based on Term Frequency

Procedia - Social and Behavioral Sciences 98 ( 2014 ) International Conference on Current Trends in ELT

STATUS OF OPAC AND WEB OPAC IN LAW UNIVERSITY LIBRARIES IN SOUTH INDIA

A Syllable Based Word Recognition Model for Korean Noun Extraction

Teacher s competences for the use of web pages in teaching as a part of technical education teacher s ICT competences

The Smart/Empire TIPSTER IR System

Constructing Parallel Corpus from Movie Subtitles

Educational system gaps in Romania. Roberta Mihaela Stanef *, Alina Magdalena Manole

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

A heuristic framework for pivot-based bilingual dictionary induction

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Development of the First LRs for Macedonian: Current Projects

ARNE - A tool for Namend Entity Recognition from Arabic Text

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Named Entity Recognition: A Survey for the Indian Languages

Procedia - Social and Behavioral Sciences 197 ( 2015 )

Lexical Collocations (Verb + Noun) Across Written Academic Genres In English

On document relevance and lexical cohesion between query terms

International Conference on Education and Educational Psychology (ICEEPSY 2012)

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

The Name of the Concept STUDENT in Russian and English Languages: on Lexicographical Material

Indian Institute of Technology, Kanpur

Modern Trends in Higher Education Funding. Tilea Doina Maria a, Vasile Bleotu b

Using interactive simulation-based learning objects in introductory course of programming

International Conference on Current Trends in ELT

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Trend Survey on Japanese Natural Language Processing Studies over the Last Decade

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Procedia - Social and Behavioral Sciences 200 ( 2015 )

A Bayesian Learning Approach to Concept-Based Document Classification

Available online at ScienceDirect. Procedia Engineering 131 (2015 ) World Conference: TRIZ FUTURE, TF

Physical and psychosocial aspects of science laboratory learning environment

PSIWORLD Keywords: self-directed learning; personality traits; academic achievement; learning strategies; learning activties.

Procedia - Social and Behavioral Sciences 226 ( 2016 ) 27 34

Parsing of part-of-speech tagged Assamese Texts

CS 598 Natural Language Processing

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Procedia - Social and Behavioral Sciences 198 ( 2015 ) Begoña Soneira Beloso*

Management of time resources for learning through individual study in higher education

Exploiting Wikipedia as External Knowledge for Named Entity Recognition

The Role of the Head in the Interpretation of English Deverbal Compounds

Matching Similarity for Keyword-Based Clustering

A sustainable framework for technical and vocational education in malaysia

Procedia - Social and Behavioral Sciences 46 ( 2012 ) WCES 2012

Multiobjective Optimization for Biomedical Named Entity Recognition and Classification

CROSS LANGUAGE INFORMATION RETRIEVAL: IN INDIAN LANGUAGE PERSPECTIVE

A study of the capabilities of graduate students in writing thesis and the advising quality of faculty members to pursue the thesis

Procedia - Social and Behavioral Sciences 228 ( 2016 )

Abdul Rahman Chik a*, Tg. Ainul Farha Tg. Abdul Rahman b

Modeling Attachment Decisions with a Probabilistic Parser: The Case of Head Final Structures

Study of Social Networking Usage in Higher Education Environment

Evaluation for Scenario Question Answering Systems

Procedia - Social and Behavioral Sciences 154 ( 2014 )

Procedia - Social and Behavioral Sciences 191 ( 2015 ) WCES 2014

THE VERB ARGUMENT BROWSER

Semantic Modeling in Morpheme-based Lexica for Greek

Universiteit Leiden ICT in Business

Experts Retrieval with Multiword-Enhanced Author Topic Model

ESSLLI 2010: Resource-light Morpho-syntactic Analysis of Highly

Procedia - Social and Behavioral Sciences 228 ( 2016 ) 39 44

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Ensemble Technique Utilization for Indonesian Dependency Parser

Procedia - Social and Behavioral Sciences 171 ( 2015 ) ICEEPSY 2014

Questions, Pictures, Answers: Introducing Pictures in Question-Answering Systems

Transcription:

Available online at www.sciencedirect.com ScienceDirect Procedia Technology 24 (2016 ) 1388 1392 International Conference on Emerging Trends in Engineering, Science and Technology (ICETEST - 2015) Malayalam question answering system Seena I T a,*, Sini G M a, Binu R b a M TechComputational Linguistics, Dept. of Computer Science and Engg. Govt. Engineering College, Sreekrishnapuram, Kerala-678633,India b Asst. Professor, Dept. of Computer Science and Engg. Govt. Engineering College, Sreekrishnapuram Kerala -678633,India Abstract Question answering system, an important part in natural language processing aims at automatically finding concise answers to arbitrary questions phrased in natural language. The goal is difficult while considering the agglutinative nature of south Indian language Malayalam. Studies indicating that the usage of Malayalam documents on the web is increasing. In this paper we aim at retrieving factoid answers for the questions in Malayalam from a given set of Malayalam documents under a closed domain. TnT tagger is used to train the corpus of words inorder to find the precise factoid answers. 2016 2016 The The Authors. Authors. Published Published by Elsevier by Elsevier Ltd. Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under under responsibility of the of the organizing organizing committee committee of ICETEST of ICETEST 2015 2015. Keywords: Question answering system;tnt tagger 1.Introduction Question Answering (QA) is a fast growing research area that combines research from different, but related, fields which are Information Retrieval (IR), Information Extraction (IE) and Natural Language Processing (NLP)[1]. Malayalam language belongs to the Dravidian family of languages and is one of the four major languages of this family with a rich literary tradition. And also rich in morphological inflections ie, adding of suffixes to the root or the stem words. Due to this agglutinative nature researchers find difficulties in Malayalam based question answering system. The most widely used search engine Google is trying to incorporate many languages. Malayalam newspapers and other kinds of documents are quite common. Same as dealing with English language, getting exact answer from a set of documents in Malayalam for a particular question is difficult. Malayalam based question 2212-0173 2016 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/). Peer-review under responsibility of the organizing committee of ICETEST 2015 doi:10.1016/j.protcy.2016.05.155

I.T. Seena et al. / Procedia Technology 24 ( 2016 ) 1388 1392 1389 answering is way to ask natural language questions in Malayalam and get the precise answers as per the users wish. Basically the question answering system is of two types : Web based and IR/IE based[4]. 2.Literature Review 2.1. Closed domain question answering system a survey Shubangi tripude et al. [1] done survey on closed domain question answering system. This provides an overview about question answering architecture, various QA models and a question answering system for the domain of legal documents in Indian laws. 2.2. Chodhyothari:Question answering system for Malayalam Sreejith et al. [2] done a Malayalam based question answering system which process a number of documents and extract the answers from it based on the question words given. Methodology includes the NLP tools for finding the precise answers. Question Type Analysing Document Selection & processing Answer Extraction Fig. 1. (a) Basic model of question answering system 3. Overview Malayalam based question answering system helps the users to ask the natural language questions in Malayalam. Each and every user is expecting a precise answers for their questions. Surfing on web is a time consuming task due to the enormous number of documents in it. Here we propose a system to give exact answers for the question based on the domain specific documents.the basic model of QAS is shown in Fig 1. By analysing the type of the question, it is easy to identify the answer for eg: a question word Aaru expects a subject noun which is a name as its answer. From a set of documents from a single domain, identify a sentence in a document which is highly ranked with respect to words in the question. Extracting the answer key from the particular s entence being the next step which is done with the help of answer extraction module.

1390 I.T. Seena et al. / Procedia Technology 24 ( 2016 ) 1388 1392 4. Malayalam QA system modules Malayalam question answering system can be viewed mainly as three modules. Question words are in different forms eg: Aaru, Evide, Eppol etc. Each form have its morphological inflections also. So the first module is based on question type analyzing. After identifying the question, next step is to find the document which contain the answer sentence. So document selection and its processing being the next important part. After getting the exact sentence which contain the answer word, extracting the same being the final step in Malayalam QAS. In general the main modules in Malayalam question answering system are : (i). Question type analyzing (ii).document selection and processing (iii). Answer extraction 4.1. Question type analyzing Analyzing the question being an important part in Malayalam QAS systems. Natural language question asked by the user may be in any form. Systems have to deal with all type of answerable questions. It is not possible to expect any format for the question. Question word may or may not appear as the first word in the question. Position of the question word doesn t much cause any problem,but the inflections of question words is a main area to focus. Eg: The question word Aare have inflections like Aareyanu, Aarellam etc. morphological inflections of particular question words are also considered in our work. In Malayalam mainly evide, aaru, eppol etc be th e question words,here we focus the same. The main task in question type analyzing module is to identify the question words. Based on the study about different question patterns in Malayalam,we collect different question word types and used the same to identify the correct answer sense. Inorder to identity the answer sentences from the document, a collection of keywords is required. The remaining words in the question except the question word are chosen as keywords and lemmatized. 4.2. Document selection and processing From a set of documents under a single domain, each document is selected and used a sentence tokenizer to split each document into sentences and stored in an array. Each element in the array is taken and split into words using a word tokenizer and lemmatize it. In order to find the rank for each sentence use a method of pattern matching to compare the words in the question and the sentence in the document. Highly ranked sentence(sentence which contain most of the words as common in question) is selected as the answer candidate. After obtaining the exact answer sentence, the machine learning tool called TnT Tagger is used to tag the words as subject-noun,object-noun,location information,temporal information etc to extract the answer word from the sentence. TnT tagger uses second order markov model to do the part of speech tagging.[4]. A domain specific corpus is used to train the Tagger 4.3. Answer Extraction Each question word expects a particular tag as its answer key. Eg: Question word Aaru expects a personnoun as its answer. This stage analyze the question word and its corresponding expected answer tag and find the answer key from the tagged corpus. From the question word,we found the subject object relationship and extract the

I.T. Seena et al. / Procedia Technology 24 ( 2016 ) 1388 1392 1391 correct answer word from the sentence. Eg: Aare is a question word expecting object-noun as the answer word. Similarly the question word Aaru expects the subject-noun as the answer key. 5. Implementation Malayalam question answering system is implemented under the domain of personalities in Kerala sports. We collect Malayalam documents which contain the details of a personality in kerala sports. As a first stage,we create an array of question words. For a given question, use word tokenizer to split the question sentence into words and find the question word by using a pattern matching technique ie. compare each word in the question with the words in the array(list of question words). After identifying the question word, place the remaining words in an array to find the most matched answer sentence. Stemming is done on the remaining words to find the root form. Root form of words helps to find the best matched sentences in the document. Sentences in the documents are also splitted and lemmatize in the same way and rank the the sentences based on pattern matching(if a sentence have more words in common as in the question then it is ranked higher). Finally the matched answer sentence is obtained. A rule based approach is used to find the appropriate answer key for a particular question word from the tagged corpus. eg: Aaru Aare Evide Eppol SubNoun ObjNoun Loc Time 6. Experimental results Experiment starts by selecting a domain and collected a set of documents related to the domain. We studied the different question representations in Malayalam which includes almost all question word patterns in Malayalam. We conducted the experiment with a set of questions under the specific domain and the answer set indicates 70 % accuracy in factoid type answers for the the questions. 7. Conclusion and future work Due to the agglutinative nature of the South Indian languages, less works are done especially for Malayalam language. As the people seek for the exact answer for their queries, it is necessary for the user to have a specific system which gives the exact answer. Malayalam question answering system be a good start up for the upcoming works in Malayalam language. In this paper we only focus on the factoid answer to the question. There are many problems evolved in complex type of questions. The main problem in simple question answering system is to find the anaphoric resolution. As a future work anaphoric resolution can also be included in the Malayalam question answering which improves the efficiency of the system. Semantic based question answering needs a specific representation for each sentence in the document. Future research should focus on the specific representation for sentences, then it would be a great scope in the field of Malayalam question answering system.

1392 I.T. Seena et al. / Procedia Technology 24 ( 2016 ) 1388 1392 References [1] Shubhangi T ripude, Dr A S Alvi. Closed domain question answering system a survey International journal on informative and futuristic research;may 2015 [2] Sreejith c, Nibbesh K, PC ReghuRaj.Chodhyothari Question answering system for malayalam as a part of CERD,Center for Engineering Research and development;2013 [3]Unmesh sasikumar, Sindhu L.A survey of natural language question answering system,in proceedings of International Journal of Computer Applications (0975-8887) Volume 108-No 15; December 2014 [4] T horsten brants. T nt statistical T agger,in proceedings with sixth natural language processing conference ANLP 2000 AMY 3; 2000