Automatic Detection of Copulatives in Northern Sotho corpora

Similar documents
Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Modeling full form lexica for Arabic

THE VERB ARGUMENT BROWSER

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

Enhancing Unlexicalized Parsing Performance using a Wide Coverage Lexicon, Fuzzy Tag-set Mapping, and EM-HMM-based Lexical Probabilities

Project in the framework of the AIM-WEST project Annotation of MWEs for translation

Specifying a shallow grammatical for parsing purposes

Linking Task: Identifying authors and book titles in verbose queries

Today we examine the distribution of infinitival clauses, which can be

GERM 3040 GERMAN GRAMMAR AND COMPOSITION SPRING 2017

LING 329 : MORPHOLOGY

Emmaus Lutheran School English Language Arts Curriculum

Advanced Grammar in Use

EdIt: A Broad-Coverage Grammar Checker Using Pattern Grammar

The Role of the Head in the Interpretation of English Deverbal Compounds

1. Introduction. 2. The OMBI database editor

Target Language Preposition Selection an Experiment with Transformation-Based Learning and Aligned Bilingual Data

Development of the First LRs for Macedonian: Current Projects

DEVELOPMENT OF A MULTILINGUAL PARALLEL CORPUS AND A PART-OF-SPEECH TAGGER FOR AFRIKAANS

The taming of the data:

Stefan Engelberg (IDS Mannheim), Workshop Corpora in Lexical Research, Bucharest, Nov [Folie 1] 6.1 Type-token ratio

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Linguistic Variation across Sports Category of Press Reportage from British Newspapers: a Diachronic Multidimensional Analysis

The development of a new learner s dictionary for Modern Standard Arabic: the linguistic corpus approach

The Acquisition of English Grammatical Morphemes: A Case of Iranian EFL Learners

Writing a composition

Cross Language Information Retrieval

Developing Grammar in Context

CHILDREN S POSSESSIVE STRUCTURES: A CASE STUDY 1. Andrew Radford and Joseph Galasso, University of Essex

Progressive Aspect in Nigerian English

2. Theoretical framework of Simultaneous Feedback

cambridge occasional papers in linguistics Volume 8, Article 3: 41 55, 2015 ISSN

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

Intra-talker Variation: Audience Design Factors Affecting Lexical Selections

Memory-based grammatical error correction

On the Notion Determiner

Corpus Linguistics (L615)

Derivational and Inflectional Morphemes in Pak-Pak Language

Parsing of part-of-speech tagged Assamese Texts

A Bayesian Learning Approach to Concept-Based Document Classification

ELD CELDT 5 EDGE Level C Curriculum Guide LANGUAGE DEVELOPMENT VOCABULARY COMMON WRITING PROJECT. ToolKit

Indian Institute of Technology, Kanpur

Adjectives tell you more about a noun (for example: the red dress ).

The Discourse Anaphoric Properties of Connectives

Using a Native Language Reference Grammar as a Language Learning Tool

Underlying and Surface Grammatical Relations in Greek consider

Approaches to control phenomena handout Obligatory control and morphological case: Icelandic and Basque

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Analysis of Lexical Structures from Field Linguistics and Language Engineering

Written by: YULI AMRIA (RRA1B210085) ABSTRACT. Key words: ability, possessive pronouns, and possessive adjectives INTRODUCTION

The Acquisition of Person and Number Morphology Within the Verbal Domain in Early Greek

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

CS 598 Natural Language Processing

Ch VI- SENTENCE PATTERNS.

Intermediate Academic Writing

Phonological and Phonetic Representations: The Case of Neutralization

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Applications of memory-based natural language processing

Semi-supervised methods of text processing, and an application to medical concept extraction. Yacine Jernite Text-as-Data series September 17.

Procedia - Social and Behavioral Sciences 141 ( 2014 ) WCLTA Using Corpus Linguistics in the Development of Writing

Using dialogue context to improve parsing performance in dialogue systems

Word Stress and Intonation: Introduction

Words come in categories

Sample Goals and Benchmarks

AN EXPERIMENTAL APPROACH TO NEW AND OLD INFORMATION IN TURKISH LOCATIVES AND EXISTENTIALS

Development of a Library 2.0 service model for an African library

The Smart/Empire TIPSTER IR System

Effectiveness of Electronic Dictionary in College Students English Learning

EAGLE: an Error-Annotated Corpus of Beginning Learner German

AQUA: An Ontology-Driven Question Answering System

ENGBG1 ENGBL1 Campus Linguistics. Meeting 2. Chapter 7 (Morphology) and chapter 9 (Syntax) Pia Sundqvist

1.2 Interpretive Communication: Students will demonstrate comprehension of content from authentic audio and visual resources.

BULATS A2 WORDLIST 2

Collocations of Nouns: How to Present Verb-noun Collocations in a Monolingual Dictionary

Procedia - Social and Behavioral Sciences 154 ( 2014 )

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 12 December 2011 ISSN

The College Board Redesigned SAT Grade 12

The Internet as a Normative Corpus: Grammar Checking with a Search Engine

Loughton School s curriculum evening. 28 th February 2017

Constraining X-Bar: Theta Theory

2/15/13. POS Tagging Problem. Part-of-Speech Tagging. Example English Part-of-Speech Tagsets. More Details of the Problem. Typical Problem Cases

An Evaluation of POS Taggers for the CHILDES Corpus

AN ANALYSIS OF GRAMMTICAL ERRORS MADE BY THE SECOND YEAR STUDENTS OF SMAN 5 PADANG IN WRITING PAST EXPERIENCES

knarrator: A Model For Authors To Simplify Authoring Process Using Natural Language Processing To Portuguese

Phenomena of gender attraction in Polish *

Minimalism is the name of the predominant approach in generative linguistics today. It was first

Derivational: Inflectional: In a fit of rage the soldiers attacked them both that week, but lost the fight.

What the National Curriculum requires in reading at Y5 and Y6

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Proposed syllabi of Foundation Course in French New Session FIRST SEMESTER FFR 100 (Grammar,Comprehension &Paragraph writing)

Character Stream Parsing of Mixed-lingual Text

Methods for the Qualitative Evaluation of Lexical Association Measures

Subject: Opening the American West. What are you teaching? Explorations of Lewis and Clark

BANGLA TO ENGLISH TEXT CONVERSION USING OPENNLP TOOLS

Copyright 2002 by the McGraw-Hill Companies, Inc.

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

FOREWORD.. 5 THE PROPER RUSSIAN PRONUNCIATION. 8. УРОК (Unit) УРОК (Unit) УРОК (Unit) УРОК (Unit) 4 80.

Welcome to the Purdue OWL. Where do I begin? General Strategies. Personalizing Proofreading

Basic Syntax. Doug Arnold We review some basic grammatical ideas and terminology, and look at some common constructions in English.

Transcription:

Automatic Detection of Copulatives in Northern Sotho corpora Gertrud Faaß and Elsabé Taljard Universities of Hildesheim and Pretoria 5th international Conference on Bantu Languages Paris, June 12th to 15th, 2013 June 14th, 2013 Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 1 / 32

Project Background Scientific e-lexicography for Africa, SeLA Universites of Hildesheim, Pretoria, Stellenbosch, South Africa (UNISA), and Windhoek Prototype e-dictionaries for several of the South African National languages (June 2012 May 2015) Several sub-projects: specifically: acquisition tools and data Our task: a corpus linguistic study of the NSO copulative: Which of the described constellations exist in the available corpus? What are the frequencies of occurrence? Can we learn anything about typical complements? Theoretical background: Lexicographic Function Theory Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 2 / 32

The Function Theory Main Development: Centlex in Aarhus (see URL in link list) Central notion is the purpose ( function ) of a dictionary, e.g. I need to understand words, phrases or sentences reception I need to generate words, phrases or sentences myself production Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 3 / 32

What do we need? Production purposes A database containing all possible forms of NSO copulatives Add glosses, translations and examples (if possible, from corpora) Guide users in their text production, e.g. by means of a decision tree: Selection of appropriate copulatives Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 4 / 32

Example: Decision Tree Production purposes: Experimental work by project team members Copyright: Bothma and Prinsloo Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 5 / 32

Example: Decision Tree Production purposes: Experimental work by project team members Open question: What to do for reception purpose? Copyright: Bothma and Prinsloo Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 6 / 32

Intro: What is a copula? A simple account A copula links a subject with its complement(s) In English: to be, i.e. I am, you are, (s)he/it is, we are,... General: Possible verbal modifications 1 person (1st/2nd/3rd) number (sg/pl) tense (non-past(present and future)/past) aspect (simple/progressive/perfect/perfect progressive) mood (indicative/imperative/emphatic/progressive/subjunctive) Leads to 3x2x3x4x5 = 360 possible constellations (of which a number are homographs) Polarity not specifically described to have (association) not a copulative in English 1 Origin of these definitions: Wikipedia: See list of urls on last slide. Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 7 / 32

Copula of Northern Sotho For students A Handbook of the Northern Sotho language Ziervogel (1988:63): There are two kinds of copulatives, viz. (a) the copulative of identification and (b) the copulative of description Ziervogel does not refer to Lyons (1968), however Lyons had described these categories before: Identifying copulative : Lyons (1968:389): Apples are fruit ( sortal ) Descriptive copulative : Lyons (1968:389): Apples are sweet ( characterizing ) Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 8 / 32

Copula of Northern Sotho For students Northern Sotho for First-Years (Van Wyk et al. (1992:31)): The complement is always non-verbal... There are three types [...] identifying, descriptive and associative constructions The associative describes association, but also possession in the sense of to be with (e.g. another person, money, etc.) O na le t shelete na? CSPERS 2sg VCOP PART con N09 PART ques $. You are with money hm? Have you got money (with you)? Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 9 / 32

Copula of Northern Sotho For scholars A linguistic Analysis of Northern Sotho (Poulos and Louwrens (1994:291 et seq.)): (1) The identifying copulative (2) The descriptive copulative (3) The assocative copulative (4) The locational copulative N.B. The locational and descriptive copulas are morphologically identical; the distinction is based on the different nature of their complements. Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 10 / 32

Copula of Northern Sotho: Poulos and Louwrens System Modifications Copulative categories (identifying/descriptive/associative): polarity (pos/neg) 1st and 2nd person in singular and plural classes (altogether 13) present (principal/participial) future (principal) past (principal/participial) potential, subjunctive, consecutive, habitual infinitive, imperative The descriptive 1 and the associative copulatives 2 have a compound tense. No classification into tense/aspect/mood Poulos and Louwrens describe 1,328 possible constellations 1 p. 311: Diaparô di bê di le mêêtse The clothes were wet. 2 p. 315: Ke bê ke na le ntlô ka gê bê ke šoma ka maatla matšatšing ao I had a house because I used to work hard in those days. Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 11 / 32

Copula of Northern Sotho For lexicographers N.B. The Lemmatization of Copulatives in Northern Sotho (Prinsloo (2002:28)) two types of copulatives can be distinguished, namely static (in a state of rest) and dynamic (in motion or changing) copulatives express three different semantic relations between a subject and a complement, namely identification/equality, descriptive or associative Prinsloo estimates there are 2,040 different possible constellations, for the dynamic copulative only, including the potential forms which we have not included yet. Lombard (1985:192 et seq.) describes the three categories: identifying and descriptive and associative in a similar way Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 12 / 32

Copula of Northern Sotho - Terminology Differentiation between stative and inchoative Lyons (1968:389): static copulative: John has a book dynamic copulative: The book became valuable N.B.: In this presentation, we refer to the static form of the copula as stative and to the dynamic form as inchoative Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 13 / 32

Copula of Northern Sotho For (computational) linguists A morpho-syntactic description of Northern Sotho as a basis for an automated translation from Northern Sotho into English (Faaß (2010:125 et seq.)) An attempt to describe all possible constellations, with the exception of potential forms, relying on Prinsloo (2002) and Poulos and Louwrens (1994) however, restricted for space reasons (similar to Poulos and Louwrens) Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 14 / 32

Copula of Northern Sotho Constellations based on Faaß (2010:128, Table 3.30) Copulative Identifying Descriptive Associative Category stative inchoative stative inchoative stative inchoative Tense pres x x x x x x past x x x x x x fut x x x Mood/Aspect indicative x x x x x x situative x x x x x x relative x x x x x x consecutive x x x habitual x x x infinitive x x x imperative x x x Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 15 / 32

Copula of Northern Sotho Other categories person number (only for person) class polarity Our table currently contains 2,116 constellations (929 types; thus: many homographs!) Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 16 / 32

Reception? How to extract corpus examples? Very problematic from the start: Faaß et al. (2009): many homographs (syncretism on the orthographic level): e.g. a is 8-ways ambiguous Lombard (1985): the categories were mainly described on semantic and only partially on morpho-syntactic grounds No training data for statistical analysis (yet) available manual inspection necessary Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 17 / 32

Searching corpora for copulatives Pretoria Sepedi C orpus, PSC (De Schryver and Prinsloo (2000)) Current size: 8,007,653 tokens (including punctuation), sources/contents not defined exactly Part-of-speech tagged (cf. Taljard et al. (2008), Faaß et al. (2009)) and encoded in CorpusWorkBench (CWB, see link list) CWB allows for automated (offline) queries by means of scripts (e.g. perl) and macros Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 18 / 32

Copulative constellations in detail Homography: o tlo ba as a case in point N.B. o: tlo ba (cf. Faaß et al. (2009)) subject concord of class 1, 3 subject concord of 2nd person singular object concord of class 3 future tense morpheme (exchangable with tla) subject, object, and possessive concord of class 2 demonstrative of class 2 auxiliary and copulative verb stem Heuristic taggers select the most frequent part of speech occuring in the training data unreliable for such homographs while words with only one part of speech or with few differences in their distribution are easily identified and usually tagged correctly. Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 19 / 32

o tlo ba as a case in point Excerpt of the overview of all constellations no. copulative motion tense mood polarity pers/class 1 identifying inchoative future indicative positive 2nd pers.sg 2 identifying inchoative future situative positive 2nd pers.sg 3 descriptive inchoative future indicative positive 2nd pers.sg 4 descriptive inchoative future indicative positive class 01 5 descriptive inchoative future indicative positive class 03 Cases 1-2 underspecification: indicative vs. situative Cases 3-5 homography o for 2nd.person.sg/classes 01 and 03 Cases 1-2/3-5 underspecification: identifying/descriptive o tlo ba may also precede a verb stem as part of a transitive future tense verb, where ba stands for an omitted or moved object noun of class 2 Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 20 / 32

Task defition Complements should be identified: Identifying typical complements or complement types might help to differentiate not only verbs from copulative constellations, but underspecified constellations, too. Subjects should be identified: Identifying a copulative s subject will help to avoid disambiguation problems caused by homography (concordial agreement). Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 21 / 32

Nominal complements: Definition of a few constellations N.B. A typical noun chunk may consist of a noun alone This noun might be accompanied by A demonstrative (possibly followed by an adjective) An emphatic pronoun A quantitative pronoun A possessive concord follwed by a possessive pronoun or another noun chunk Each of the accompanying units or unit groups might appear alone as well... This is no exhaustive description, see e.g. Faaß (2010:175 et seq.) for a (hopefully) complete overview Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 22 / 32

Corpus Queries Method: Steps of the search procedure One macro for all: nominal complements constants (defined by their parts of speech) copula variables (defined as tokens) Execute the macros (= run the query) with each of the 929 copula types in CWB (making use of the perl interface) Extend the table of constellations with the frequencies of occurrences found in the corpus Generate a table containing all matches found in the corpus (copula and complement) Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 23 / 32

Results: a first attempt Is it a copulative at all? We randomly chose 200 constellations found by the tool: Results: 187 were correctly identified as copulatives 13 incorrect: corpus errors, annotation problems, minor macro errors Our general complement definitions (noun chunks) are correct! Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 24 / 32

Results: associative copulative 1,203 constellations (599 types) found with a nominal chunk as a complement Homography of single items (e.g. a, o, etc.) Underspecification (e.g. indicative/situative constellations) However, the forms do not occur in the other types of copulatives (identifying/descriptive) Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 25 / 32

Results for o tlo ba Frequency of occurrence in total: 364 Frequency of occurrence followed by a noun chunk (as described above): 40 Frequency of occurrence followed and preceded by a noun chunk: 33 Manual inspection of the 33 sentences: 17 identifying copulative: 11 descriptive copulatives with complements of a specific type (see next slide) Preceeding noun chunk is usually not the subject we need a grammar Following noun chunk is usually the object and the descriptives seem tot be distinguishable from the identifying by their morphosyntactic properties 7 others: problems of corpus preparation/cases where semantics are not consistent with morphological features Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 26 / 32

Finding typical complements Ongoing work Identify descriptive copulatives When inspecting overall results, typical complements were identified: e.g. nouns with a locative ending (see locational constellations, cf. Poulos and Louwrens (1994) above) Nouns and pronouns and demonstratives of class 14 All the other homographous constellations found are currently assumed to be of an identifying character (verification outstanding) Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 27 / 32

Frequencies of occurrences # of occ. constellations types 0 618 350 1-99 1,212 477 100-999 210 81 1,000-4,999 69 18 > 5,000 7 3 sums 2,116 929 Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 28 / 32

Overall results Still tentative Associative forms can be differentiated from the other copulatives easily and 216 such constellations are not homographous at all Differentiation between identifying and descriptive copulatives might be possible by complement definition of the descriptive forms (verification outstanding) Outstanding: Differentation between situative and identifying copulatives However, for lexicographic reception purposes: Distinguishing these constellations is not necessary for translation, rather worth a linguistic study Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 29 / 32

Future work Add potential forms of the copulatives to our table, make it an accessible database Examine the constellations not found in the corpus: too rarely used for complexity reasons or just described by linguists to fill the paradigm? From underspecification to specification Write a little grammar so that the homographs can be disambiguated at least partially For lexicography: If typical complements are known, we can provide typical examples for text production General task: Work towards a cleaner corpus Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 30 / 32

References De Schryver and Prinsloo (2000). G-M. De Schryver and D.J. Prinsloo. 2000. The compilation of electronic corpora with special reference to the African languages. Southern African Linguistics and Applied Language studies, SALALS 18(1-4):89 106. Faaß et al.(2009). G. Faaß, U. Heid, E. Taljard, and D.J. Prinsloo. 2009. Part-of-Speech tagging in Northern Sotho: disambiguating polysemous function words. In Proceedings of the EACL2009 Workshop on Language Technologies for African languages AfLaT 2009. 38 45. The 12th Conference of the European Chapter of the Association for Computational Linguistics; Mar 30 - April 3rd, 2009. Athens. Lombard (1985). D.P. Lombard. 1985. Introduction to the grammar of Northern Sotho. Pretoria: J.L. van Schaik. Louwrens (1991). L.J. Louwrens. 1991. Aspects of the Northern Sotho Grammar. Pretoria: via Afrika. Poulos and Louwrens (1994). G. Poulos and L.J. Louwrens. 1994. A Linguistic Analysis of Northern Sotho. Pretoria: via Afrika. Prinsloo (2000). D.J. Prinsloo. 2002. The Lemmatization of Copulatives in Northern Sotho. In Lexikos 12, 21 43. Stellenbosch: Buro van die WAT. Taljard et al. (2008). E. Taljard, G. Faaß, U. Heid, and D.J. Prinsloo. 2008. On the development of a tagset for Northern Sotho with special reference to the issue of standardization. Literator special edition on Human Language Technology, 29(1):111 137. Van Wyk et al. (1992). E.B. Van Wyk, P.S. Groenewald, D.J. Prinsloo, J.H.M. Kock, and E. Taljard. 1992. Northern Sotho for first years. Pretoria:J.L. van Schaik. Ziervogel (1998). D. Ziervogel. 3rd edition 1988. A Handbook of the Northern Sotho Language. Pretoria:J.L. van Schaik. Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 31 / 32

link list Scientific e-lexicography for Africa (SeLA): http://www.uni-hildesheim.de/iwist-cl/projects/sela/ Permanent links from wikipedia: (1) tense: http://en.wikipedia.org/w/index.php?title=grammatical_tense&oldid=555002677 (2) aspect: http://en.wikipedia.org/w/index.php?title=grammatical_aspect&oldid=551547626: (3) mood: http://en.wikipedia.org/w/index.php?title=grammatical_mood&oldid=542114879 CentLex: http://bcom.au.dk/research/academicareas/centreforlexicography/research/ Corpus WorkBench: http://sourceforge.net/projects/cwb/?source=directory Faaß/Taljard (Hildesheim/Pretoria) Copulatives in Corpora June 14th, 2013 32 / 32