Tamil Hyphenator. P. David Prabhakar Department of Tamil, Madras Christian College, Chennai , India

Similar documents
Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Florida Reading Endorsement Alignment Matrix Competency 1

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Derivational and Inflectional Morphemes in Pak-Pak Language

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

Coast Academies Writing Framework Step 4. 1 of 7

Prentice Hall Literature: Timeless Voices, Timeless Themes Gold 2000 Correlated to Nebraska Reading/Writing Standards, (Grade 9)

PAGE(S) WHERE TAUGHT If sub mission ins not a book, cite appropriate location(s))

Literature and the Language Arts Experiencing Literature

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Oakland Unified School District English/ Language Arts Course Syllabus

What the National Curriculum requires in reading at Y5 and Y6

Grade 11 Language Arts (2 Semester Course) CURRICULUM. Course Description ENGLISH 11 (2 Semester Course) Duration: 2 Semesters Prerequisite: None

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Primary English Curriculum Framework

National Literacy and Numeracy Framework for years 3/4

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Oakland Unified School District English/ Language Arts Course Syllabus

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Alignment of Iowa Assessments, Form E to the Common Core State Standards Levels 5 6/Kindergarten. Standard

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Improved Effects of Word-Retrieval Treatments Subsequent to Addition of the Orthographic Form

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

LING 329 : MORPHOLOGY

Linking Task: Identifying authors and book titles in verbose queries

Writing a composition

CELTA. Syllabus and Assessment Guidelines. Third Edition. University of Cambridge ESOL Examinations 1 Hills Road Cambridge CB1 2EU United Kingdom

Using a Native Language Reference Grammar as a Language Learning Tool

Cross Language Information Retrieval

4 th Grade Reading Language Arts Pacing Guide

Grade 4. Common Core Adoption Process. (Unpacked Standards)

A Minimalist Approach to Code-Switching. In the field of linguistics, the topic of bilingualism is a broad one. There are many

The College Board Redesigned SAT Grade 12

TEKS Comments Louisiana GLE

Grade 5: Module 3A: Overview

The Multi-genre Research Project

Guidelines for Writing an Internship Report

Test Blueprint. Grade 3 Reading English Standards of Learning

Common Core State Standards for English Language Arts

Proof Theory for Syntacticians

HISTORY COURSE WORK GUIDE 1. LECTURES, TUTORIALS AND ASSESSMENT 2. GRADES/MARKS SCHEDULE

Timeline. Recommendations

MASTER S THESIS GUIDE MASTER S PROGRAMME IN COMMUNICATION SCIENCE

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

TCH_LRN 531 Frameworks for Research in Mathematics and Science Education (3 Credits)

Myths, Legends, Fairytales and Novels (Writing a Letter)

Teaching Vocabulary Summary. Erin Cathey. Middle Tennessee State University

correlated to the Nebraska Reading/Writing Standards Grades 9-12

Physics 270: Experimental Physics

Achievement Level Descriptors for American Literature and Composition

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Parsing of part-of-speech tagged Assamese Texts

Mercer County Schools

Abbey Academies Trust. Every Child Matters

DOWNSTEP IN SUPYIRE* Robert Carlson Societe Internationale de Linguistique, Mali

STANDARDS. Essential Question: How can ideas, themes, and stories connect people from different times and places? BIN/TABLE 1

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Mandarin Lexical Tone Recognition: The Gating Paradigm

Correspondence between the DRDP (2015) and the California Preschool Learning Foundations. Foundations (PLF) in Language and Literacy

Word Stress and Intonation: Introduction

UKLO Round Advanced solutions and marking schemes. 6 The long and short of English verbs [15 marks]

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Reading Project. Happy reading and have an excellent summer!

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Arabic Orthography vs. Arabic OCR

Student Name: OSIS#: DOB: / / School: Grade:

Year 4 National Curriculum requirements

Welcome to the Purdue OWL. Where do I begin? General Strategies. Personalizing Proofreading

Considerations for Aligning Early Grades Curriculum with the Common Core

After being introduced, first grade skills are taught ongoing throughout the year.

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Phonological Processing for Urdu Text to Speech System

USING DRAMA IN ENGLISH LANGUAGE TEACHING CLASSROOMS TO IMPROVE COMMUNICATION SKILLS OF LEARNERS

MISSISSIPPI OCCUPATIONAL DIPLOMA EMPLOYMENT ENGLISH I: NINTH, TENTH, ELEVENTH AND TWELFTH GRADES

GENERAL COMMENTS Some students performed well on the 2013 Tamil written examination. However, there were some who did not perform well.

APA Basics. APA Formatting. Title Page. APA Sections. Title Page. Title Page

South Carolina English Language Arts

Large Kindergarten Centers Icons

NAPLAN State report: Year 3

Degree Qualification Profiles Intellectual Skills

The Task. A Guide for Tutors in the Rutgers Writing Centers Written and edited by Michael Goeller and Karen Kalteissen

MARK¹² Reading II (Adaptive Remediation)

Case government vs Case agreement: modelling Modern Greek case attraction phenomena in LFG

LANGUAGE IN INDIA Strength for Today and Bright Hope for Tomorrow Volume 11 : 3 March 2011 ISSN

5 Guidelines for Learning to Spell

Highlighting and Annotation Tips Foundation Lesson

Progressive Aspect in Nigerian English

Comprehension Recognize plot features of fairy tales, folk tales, fables, and myths.

Writing Research Articles

Emmaus Lutheran School English Language Arts Curriculum

Holy Family Catholic Primary School SPELLING POLICY

Description: Pricing Information: $0.99

Underlying and Surface Grammatical Relations in Greek consider

Missouri GLE FIRST GRADE. Communication Arts Grade Level Expectations and Glossary

Transcription:

Tamil Hyphenator P. David Prabhakar Department of Tamil, Madras Christian College, Chennai-600 059, India E-mail:davidprabhakar@rediffmail.com Abstract Hyphenation is an integral part of text processing for any language. Hyphenation splits word into sub-strings and it is used to breakup words at the end of a text line. The Hyphenation principles differ from language to language. Some of these differences are related to differences in the nature of syllables. Other differences raise the unifying nature of compound words. This paper aims to analyze the existing principles of manual hyphenation in Tamil and tries to suggest a model for developing a Tamil Hyphenator can be used by word processors. Since modern Tamil prose has more than 400 years of history, many of the existing hyphenation procedures are based on poetry, which has acai and ciir as sub-strings in a line (ati). It is believed that the hyphenation should be meaningful and parallel to the pronunciation. As Tamil is inflectionally rich, using a hyphenated dictionary is not a viable solution. Building a Tamil hyphenator is the ideal solution for Tamil. In this paper, an attempt is made to formulate hyphenation rules based on the acai patterns (neer and nirai) and it also recommends further morphological processing to incorporate the concepts like sandhi rules and structure of compound forms. Introduction Hyphenation is an integral part of text processing for any language. The hyphenation rule declares allowed hyphenation points, separated by spaces in which each hyphenation point is indicated by a - character and it is used to break up words at the end of text line. The Greek word huphen means literally `under one. It is derived from hupo `under and hen, the neuter accusative case of heis `a mark. Hyphenation principles differ from language to language. Some of these differences are related to differences in the nature of syllables other differences raise the unifying nature of compound words. This paper tries to exemplify the existing principles of manual hyphenation in Tamil and also suggests the methods and rules for developing a Tamil hyphenator, which can be used in Word Processors. The hyphen exists to help fulfill the true purpose of the printed pages, its readability. In composing a column of justified type, it is sometimes necessary to divide a word so as to align the right-hand margin without allowing either too much or too little white space between the words. The hyphenation suggests the hyphen positions, so that it can be used with other applications which adjust the inter word spacing. Tamil Internet 2004, Singapore 151

Objectives The aim of this paper is two fold: 1. To understand the existing hyphenation 2. To suggest methods for Tamil hyphenation. Conventional hyphenation principles in Tamil Tamil was written continuously without word spaces, punctuation or word-divisions in the first books printed (16 th Cent. A.D) in Tamil. Similarly, we cannot find even word boundaries in the inscriptions, copper plates, palm leafs, bond-paper, and dairies in Tamil. It is also interesting to note the early writings of monks (5 th Cent. A.D) who wrote their Latin without any space between words, and without a formal procedure for word division. Later, we find inter, intra word spacing depending upon the intuition of printer or typist. We can trace at least two principles found in the Tamil hyphenation practice. They are as follows: 1. Single characters are avoided at the end of text line and at the initial position of text lines. (e.g. po, ka, etc.) 2. Dotted consonants are avoided in the initial position of text lines.. (e.g. p, k, t, etc.,) One can assume that, these two principles exist due to the influence of Tamil prosody. Because, 1. Monosyllabic words do not exist in Tamil prosody. 2. Single character has acai status only in a very few occurrences (e.g. when CV pattern occurs as a last syllable in a ciir. 3. Dotted consonants (consonants without vowel) did not occur in the initial position of a word. And it has no value in deciding the acai type. Along with the existing hyphenation principles, we need to elaborate and formulate acceptable hyphenation principles for computational implementation. Basis of Tamil hyphenation There are dictionaries like Oxford dictionary of spelling (1986), which also show word-breaks. Chicago manual of style contains a huge chart listing various sorts of phrases that are or are not to be hyphenated. We have no such dictionaries in Tamil. Tamil Style Manual (2001) published by Mozhi Trust, A Handbook for Journalists published by Si. Paa. Adhithanar, are notable style manuals in Tamil. There is no suggestion on hyphenation in the above mentioned manuals. Tamil Internet 2004, Singapore 152

Lexicographer s approach to hyphenation is not suitable for Tamil. It is possible to have a dictionary with large lists of words, with every approved hyphen-point marked can be stored on a hard disk. But such dictionary requires more memory capacity than most of the DTP publishers require for a total system. As Tamil is inflectionally a rich language, a single verb stem in Tamil can be inflected for more than 2000 different forms by affixing various auxiliaries, tense markers, PNG markers, clitics etc. Similarly, noun forms can be generated by affixing plural marker, caariyai, case markers, affixes, emphatic markers, clitics etc. Both the verb and noun form constructions can be split meaningfully so as to meet the needs of hyphenation. Prabhakar (2001) has developed an algorithm to split and label constituents of verb forms. A pattern for the description of noun inflections is found in Kothandaraman (1981). These two references can be used for building the hyphenator whenever a morphological analysis is needed. Hence it is true that, Dictionary based solution is not possible for Tamil hyphenation. Unfortunately, no such dictionaries exist for total solution to any language. And, hyphenation rules are just guidelines. We may think of it as only pertaining to grammar, but hyphenation operates on an aesthetic level, too. An appropriate method can be suggested for Tamil hyphenation. It has two levels. They are: 1. Acai processing 2. Morphological processing. Before we go into the details of the above, let us have a look at a few observations regarding hyphenation. One can find variations in the hyphen placement within a language. Lexicographer may differ from typographer s approach. American practice may differ from British practice. US usage is to break words according to pronunciation while UK usage is to break them by etymology. British practice tends to favour morphological while, Americans favour syllabic breaks. According to Prof. Walter W. Skeat, editor of Oxford s etymological dictionary, nothing is gained by pretending to keep the root intact, when the spoken utterance does nothing of the kind. His advice was to divide all words by sound and pronunciation, but not to expect to establish rigid and invariable rules (as quoted in McIntosh & Fawthrop, 2000). Based on these observations, we suggest a syllable based hyphenation for Tamil. If needed, morphological word divisions can be taken into account. A sample of 1500 words (continuous text from a Tamil daily) was taken for this study. Based on the acai patterns, hyphenation positions were marked. It shows that, 93-94% of words splits are parallel to the manual hyphenation. Further processing with a sub-set of rules handles another 2-3% of words properly. It is felt that, morphological processing is necessary for the remaining words. Tamil Internet 2004, Singapore 153

Rules for Tamil hyphenation 1. Avoid divisions wherever possible 2. Single characters at the beginning or end of word do not improve readability of the text, hence should be avoided 3. Dotted consonants at the beginning of any split should be avoided as in conventional practice. A word which has 5 characters with 3 consecutive consonants cannot be split. 4. Hyphenation positions have to be marked based on acai patterns.when single character gets acai status at the initial position of a word, the succeeding acai should be added to the preceding acai. 5. Morphological process can be admitted for further processing when sub-set of rules fail to overcome the difficulties in the acai based splits. Acai processing Acai processing is a kind of pattern matching. It is to recognize all possible character sequences which occur within a word. acai in Tamil prosody is different from the syllable. Kothandaraman (1976:58) says that, acai is an intermediary node between a foot and a syllable. This paper suggests that, Tamil hyphenation can be handled on the basis of acai patterns found in the Tamil prosody. In Tamil, there are two types of acai. 1. neer acai 2. nirai acai Both the forms of acai can be predicted on the basis of patterns. neer and nirai patterns are given below. Note that nirai pattern has two vowels while neer pattern has one. 1. nirai patterns (C) VCV: (C) (C) VCV (C) 2. neer patterns (C) V: (C) (C) V (C) Sub rules to acai processing The following sub rules can be deployed simultaneously. 1. A character (composed of consonants and a short U) can be added to preceding acai. e.g. veentu/menru These are accepted patterns in Tamil prosody as neerpu and niraipu. Tamil Internet 2004, Singapore 154

2. Semi vowels y or v can be added to following acai e.g. maari/vitu/kin/rana Morphological processing Morphological processing is necessary in a very few cases, where certain affixes are added to noun or verb. These affixes can be short listed split can be made accordingly To conclude, in many respects it seems more convincing to regard the basis for Tamil hyphenation. acai pattern as Bibliography Prabhakar, David. 2001. Analysis of Tamil verbs: Computational approach. Unpublished PhD thesis, Chennai: University of Madras. -----. 2002. Hyphenation in Tamil, 4 th International Conference on south Asian Languages, Annamalai Nagar: Annamalai University. Kothandaraman, Pon. 1976. Modern Studies in Tamil. Chennai: Tamil Nulagam. -----. 1981.IlakkaNa ulakil putiya paarvai (Vol.2). Chennai: Tamil Nulagam. McIntosh, Ronald and Fawthrop, David. 2000. A discussion of the changing principles of word division, now implemented by computers, in British and American English, with notes on hyphenation of 39 other languages, Computer Hyphenation Ltd, Halifax, UK, (Electronic version). Tamil Internet 2004, Singapore 155