Using a Wordnet Ontology to Improve the Search of the Digital Dialect Dictionary

Similar documents
AQUA: An Ontology-Driven Question Answering System

Leveraging Sentiment to Compute Word Similarity

Linking Task: Identifying authors and book titles in verbose queries

Chunk Parsing for Base Noun Phrases using Regular Expressions. Let s first let the variable s0 be the sentence tree of the first sentence.

Vocabulary Usage and Intelligibility in Learner Language

Introduction to HPSG. Introduction. Historical Overview. The HPSG architecture. Signature. Linguistic Objects. Descriptions.

THE VERB ARGUMENT BROWSER

CEFR Overall Illustrative English Proficiency Scales

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

What the National Curriculum requires in reading at Y5 and Y6

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Emmaus Lutheran School English Language Arts Curriculum

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

a) analyse sentences, so you know what s going on and how to use that information to help you find the answer.

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Multilingual Sentiment and Subjectivity Analysis

A Bayesian Learning Approach to Concept-Based Document Classification

Probabilistic Latent Semantic Analysis

Loughton School s curriculum evening. 28 th February 2017

Florida Reading Endorsement Alignment Matrix Competency 1

Myths, Legends, Fairytales and Novels (Writing a Letter)

ScienceDirect. Malayalam question answering system

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Derivational and Inflectional Morphemes in Pak-Pak Language

Developing Grammar in Context

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Ontologies vs. classification systems

The College Board Redesigned SAT Grade 12

Rendezvous with Comet Halley Next Generation of Science Standards

Using dialogue context to improve parsing performance in dialogue systems

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

A Case Study: News Classification Based on Term Frequency

Developing a TT-MCTAG for German with an RCG-based Parser

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Let's Learn English Lesson Plan

Lemmatization of Multi-word Lexical Units: In which Entry?

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

On document relevance and lexical cohesion between query terms

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

California Department of Education English Language Development Standards for Grade 8

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

The CESAR Project: Enabling LRT for 70M+ Speakers

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

Analysis of Lexical Structures from Field Linguistics and Language Engineering

Memory-based grammatical error correction

Presentation Exercise: Chapter 32

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Controlled vocabulary

Detecting English-French Cognates Using Orthographic Edit Distance

Language Acquisition Chart

Linking the Ohio State Assessments to NWEA MAP Growth Tests *

Subject: Opening the American West. What are you teaching? Explorations of Lewis and Clark

National Literacy and Numeracy Framework for years 3/4

The MEANING Multilingual Central Repository

UKLO Round Advanced solutions and marking schemes. 6 The long and short of English verbs [15 marks]

Applications of memory-based natural language processing

Modeling full form lexica for Arabic

AN ANALYSIS OF GRAMMTICAL ERRORS MADE BY THE SECOND YEAR STUDENTS OF SMAN 5 PADANG IN WRITING PAST EXPERIENCES

Analyzing Linguistically Appropriate IEP Goals in Dual Language Programs

Combining a Chinese Thesaurus with a Chinese Dictionary

Coast Academies Writing Framework Step 4. 1 of 7

Compositional Semantics

1. Introduction. 2. The OMBI database editor

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Writing a composition

Reading Grammar Section and Lesson Writing Chapter and Lesson Identify a purpose for reading W1-LO; W2- LO; W3- LO; W4- LO; W5-

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

Problems of the Arabic OCR: New Attitudes

Primary English Curriculum Framework

Cross-Lingual Text Categorization

Course Outline for Honors Spanish II Mrs. Sharon Koller

Mercer County Schools

1.2 Interpretive Communication: Students will demonstrate comprehension of content from authentic audio and visual resources.

Grade 5: Module 3A: Overview

The stages of event extraction

METHODS FOR EXTRACTING AND CLASSIFYING PAIRS OF COGNATES AND FALSE FRIENDS

CS 598 Natural Language Processing

5 th Grade Language Arts Curriculum Map

Participate in expanded conversations and respond appropriately to a variety of conversational prompts

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Grade 7. Prentice Hall. Literature, The Penguin Edition, Grade Oregon English/Language Arts Grade-Level Standards. Grade 7

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Prentice Hall Literature Common Core Edition Grade 10, 2012

Building Vocabulary Knowledge by Teaching Paraphrasing with the Use of Synonyms Improves Comprehension for Year Six ESL Students

Predicting Student Attrition in MOOCs using Sentiment Analysis and Neural Networks

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

2.1 The Theory of Semantic Fields

BULATS A2 WORDLIST 2

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Integrating Semantic Knowledge into Text Similarity and Information Retrieval

Common Core State Standards for English Language Arts

Advanced Grammar in Use

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Australian Journal of Basic and Applied Sciences

Sample Goals and Benchmarks

Transcription:

SW4CH 2017 Nicosia, Cyprus, September 24-27, 2017 Using a Wordnet Ontology to Improve the Search of the Digital Dialect Dictionary Miljana Mladenović, evox Solutions, Belgrade, Serbia Ranka Stanković, University of Belgrade, Faculty of Mining and Geology Cvetana Krstev, University of Belgrade, Faculty of Philology https://sw4ch2017.ensma.fr/

We will present Method for automatic relating between dialect term and corresponding terms in standard language, www.vranje.co.rs The method uses SWRL rules defined in the Serbian WordNet ontology to identify sets of synonymous words. It also uses e-dictionaries to produce correct lemmas in the standard language that users usually use for search. The method was applied and evaluated on verbs and a group of nouns derived from verbs (verbal nouns). We compared results obtained by the system with human evaluators and achieved the accuracy of 89.7%. 1/34

Digital dictionary of the South Serbian dialect http://www.vranje.co.rs 1st implementation of an on-line dialect vocabulary for Serbian, produced from traditional dialect dictionaries ~20,000 entries: POS, linguistic information, sound (pronunciation), usage examples, dialect phrases, geolocation, etymology, semantic data, social networks and crowdsourcing. Search by a term, by boolean metadata queries browsing by the 1st letter

Standard look-up for on-line dictionary. If user is not familiar with a dialect? Connecting the standard language and the dialect to enable dialect dictionary search using the standard language terms

Typical keyword based search

Boolean query

Semantic search

First letter search (filter)

Geolocated search results

Lexical entry geolocation

Resources for improvement of searching performances Serbian morphological e-dictionaries and grammars to produce all inflected forms of standard terms 140,000 lemmas & 5 million forms; 18,000 multi-word lemmas Serbian WordNet (SWN) OWL2 ontology rules expressed in Semantic Web Rule Language (SWRL) to generate synonymous groups on the basis of the indirect synonymy relation. University of Belgrade Human Language Technology Group

Use of morphological e-dictionaries Headword of the verb entry is the present tense, first person singular User search for verbs using ifinitive Infinitive form (lemma) of dialect verb and verb in the standard Serbian (from definition) was added After separation of all synonyms aligned with a dialect, infinitive forms were attached to the original form. For 3,452 verb entries 7,353 synonyms were detected - batalim_bataliti batalen, ostavim_ostaviti, napustim_napustiti - batisujem kvarim_kvariti, upropašćujem_upropašćivati - bednim se lepo se odevam_odevati, doterujem_doterivati se - begam_begati begaj, ja bega_begati, ti bega_begati, begajeći, bežim_bežati

Use of morphological e-dictionaries Lemma was assigned for 505 dialect forms out of 3,452 dialect forms given in first person singular, present tense. Infinitive forms were assigned to 4,384 word forms in standard Serbian that were connected to dialect forms (out of 7,353). Not lemmatized words that consisted of word not presented in e- dictioanries, or adjectives used to describe verbs Relation between verbal nouns and verbs was established in some entries but not systematically. In e-dictionaries all verbal nouns are marked with a special marker -> 700 relation were established.

Finding the set of near synonyms by using the WordNet ontology Serbian WordNet (SWN), based on Princeton WordNet (PWN) has more than 22,000 concepts (synsets) SWN ontology has currently 2,243 verb synsets defined as ontology individuals belonging to the VerbSynset class: <rdf:type rdf:resource="&swn30;verbsynset"/> Rules: generate synonymous pairs of verbs found in the SWN ontology not based only on the relation of direct synonymy. Broader set of synonyms for each verb defined in SWN ontology prodused using relations: synonym, similar to, also see, verb group, hyponym.

Reasoning rules in the SWN ontology Eclipse Java EE IDE Luna and Apache Jena for reasoning at the level of OWL 2 language by converting OWL rules into the Jena rules format. "[rule1:(?a eg:label?b)(?a eg:synonym?c)(?c eg:label?e) -> (?b eg:indirectsynonymy?e)]" "[rule2:(?a eg:label?b)(?a eg:similar_to?c)(?c eg:label?e) -> (?b eg:indirectsynonymy?e)].. "[rule6:(?a eg:similar_to?c)(?a eg:label?b)(?c eg:synonym?d) (?d eg:label?e) -> (?b eg:indirectsynonymy?e)] 33 reasoning rules for indirectsynonymy relation after inferencing, 6,430 indirectsynonymy related pairs of verbs.

Architecture of the system for building a resource that improves the dialect dictionary search tool Extract definitions of verbs in a dialect dictionary, given in standard language E-dictionaries of a standard language morphological transformations for lemma generation Index inverting Digital Dialect Dictionary Table: dictionary verb entry related with equivalent in standard language Table: dictionary verb entry related with equivalent standard language lemma of a verb Inverted index table: standard language verb lemma related to equivalent dialect entries SWN ontology Synonym pairs of standard language verbs Expanded Inverted index table: relation between all standard language verb synonym lemmas and equivalent dialect entries Jena inferencing tool Standard language verb lemma linking to synonyms

Example 1) Definition extraction 2) Lemmati zation 3) Inverted table 4) Inference rules 5) Join isabim "(imp. isabi; aor. ja isabi, ti isabi; r.pr. isabija, -ila, -ilo) svr. iskvarim, upropastim. isabim isabi; ja isabi; ti isabi; isabija; iskvarim_iskvariti; upropasti_upropastiti upropastiti isabim batišem dokrajišem istrovim izabim izakam oznobim profućkam upropastiti unerediti, uništiti, uprskati, zabrljati, zakrmačiti, zasvinjiti Upropastiti, unerediti, uništiti, uprskati, zabrljati, zakrmašiti, zasvinjiti isabim, batišem, dokrajišem, istrovim, izabim, izakam, oznobim, profućkam

Evaluation Estimation of the accuracy of pairing the DD and SL entries: 2 language experts annotated the inverted (step 3) Infinitive SL has similar meaning as DD verb? 1 - yes 2 - not clear 3 - no Automatic procedure: DD headwords not related to any infinitive Infinitive classified ~ take a part in relations 1) related 2) unrelated Human marks 1 with related true positives. Human marks 2 and 3 compared to related false positives. Comparing with the unrelated set false and true negatives.

Evaluation The confusion matrix whether dictionary entries are correctly aligned with standard language entries P = tp=(tp + fp) = 1.000 R = tp=(tp + fn)) = 0.874 F1 = 2PR=(P + R) = 0.933 Accuracy= 0.897 Remarks method is completely precise FN: shortcomings in the DD typos, non-standard verb forms, missing SL verb in definition, misineterpreted DD verb

Conclusion Method for improving search of the DD with key-terms in SL SL e-dictionaries lemmatize verb forms Serbian WordNet based SWRL rules identifies sets of synonymous words for each verb and verbal noun defined in the ontology Join two sets of synonym words (from DD and from SL) Evaluation of the method with data provided by humans Accuracy =89.7%. Future work experiment with other POS try to expand the set of ontological rules used in this system