Examples of its use to designate length are given in figures 1 5 for Munsee, Central Sierra Miwok, Unami, Proto Takelman, Tonkawa and Algonkian.

Similar documents
NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Arabic Orthography vs. Arabic OCR

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Linguistics Program Outcomes Assessment 2012

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

HISTORY COURSE WORK GUIDE 1. LECTURES, TUTORIALS AND ASSESSMENT 2. GRADES/MARKS SCHEDULE

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Richardson, J., The Next Step in Guided Writing, Ohio Literacy Conference, 2010

Phonological Processing for Urdu Text to Speech System

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

KIS MYP Humanities Research Journal

Guide to the Program in Comparative Culture Records, University of California, Irvine AS.014

APA Basics. APA Formatting. Title Page. APA Sections. Title Page. Title Page

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Fisk Street Primary School

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Learning Methods in Multilingual Speech Recognition

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Word Stress and Intonation: Introduction

TCH_LRN 531 Frameworks for Research in Mathematics and Science Education (3 Credits)

Scientific Method Investigation of Plant Seed Germination

Mandarin Lexical Tone Recognition: The Gating Paradigm

Longest Common Subsequence: A Method for Automatic Evaluation of Handwritten Essays

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Probability Therefore (25) (1.33)

ANT4034: HISTORY OF ANTHROPOLOGICAL THEORY Spring 2014 Syllabus

SARDNET: A Self-Organizing Feature Map for Sequences

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Senior Stenographer / Senior Typist Series (including equivalent Secretary titles)

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Conversation Task: The Environment Concerns Us All

MASTER S THESIS GUIDE MASTER S PROGRAMME IN COMMUNICATION SCIENCE

Unit 9. Teacher Guide. k l m n o p q r s t u v w x y z. Kindergarten Core Knowledge Language Arts New York Edition Skills Strand

Effective Recruitment and Retention Strategies for Underrepresented Minority Students: Perspectives from Dental Students

EVERYTHING DiSC WORKPLACE LEADER S GUIDE

Listening and Speaking Skills of English Language of Adolescents of Government and Private Schools

Grade 4. Common Core Adoption Process. (Unpacked Standards)

UNITED STATES SOCIAL HISTORY: CULTURAL PLURALISM IN AMERICA El Camino College - History 32 Spring 2009 Dr. Christina Gold

Grade 2: Using a Number Line to Order and Compare Numbers Place Value Horizontal Content Strand

Custom Program Title. Leader s Guide. Understanding Other Styles. Discovering Your DiSC Style. Building More Effective Relationships

Rule-based Expert Systems

GENERAL COMMENTS Some students performed well on the 2013 Tamil written examination. However, there were some who did not perform well.

PSYCHOLOGY 353: SOCIAL AND PERSONALITY DEVELOPMENT IN CHILDREN SPRING 2006

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Reading Horizons. A Look At Linguistic Readers. Nicholas P. Criscuolo APRIL Volume 10, Issue Article 5

L2/ Introduction. 2 Background. 3 Script Details

DIBELS Next BENCHMARK ASSESSMENTS

AQUA: An Ontology-Driven Question Answering System

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

and secondary sources, attending to such features as the date and origin of the information.

Handbook for Graduate Students in TESL and Applied Linguistics Programs

Problems of the Arabic OCR: New Attitudes

LITERACY, AND COGNITIVE DEVELOPMENT

Linguistics. Undergraduate. Departmental Honors. Graduate. Faculty. Linguistics 1

Observing Teachers: The Mathematics Pedagogy of Quebec Francophone and Anglophone Teachers

The ABCs of O-G. Materials Catalog. Skills Workbook. Lesson Plans for Teaching The Orton-Gillingham Approach in Reading and Spelling

Twitter Sentiment Classification on Sanders Data using Hybrid Approach

Upward Bound Math & Science Program

Guidelines on how to use the Learning Agreement for Studies

READ 180 Next Generation Software Manual

Alignment of Iowa Assessments, Form E to the Common Core State Standards Levels 5 6/Kindergarten. Standard

Get Your Hands On These Multisensory Reading Strategies

HI0163 Sec. 01 Modern Latin America

GCSE. Mathematics A. Mark Scheme for January General Certificate of Secondary Education Unit A503/01: Mathematics C (Foundation Tier)

The Bruins I.C.E. School

LING 329 : MORPHOLOGY

Disambiguation of Thai Personal Name from Online News Articles

New Features & Functionality in Q Release Version 3.2 June 2016

CAAP. Content Analysis Report. Sample College. Institution Code: 9011 Institution Type: 4-Year Subgroup: none Test Date: Spring 2011

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Automatic English-Chinese name transliteration for development of multilingual resources

MANAGERIAL LEADERSHIP

TEKS Comments Louisiana GLE

SINGLE DOCUMENT AUTOMATIC TEXT SUMMARIZATION USING TERM FREQUENCY-INVERSE DOCUMENT FREQUENCY (TF-IDF)

Supplemental Focus Guide

2 nd grade Task 5 Half and Half

COURSE SYLLABUS ANT 3034-U02

Modeling full form lexica for Arabic

Competition in Information Technology: an Informal Learning

Primary English Curriculum Framework

Rendezvous with Comet Halley Next Generation of Science Standards

Florida Reading Endorsement Alignment Matrix Competency 1

The Ohio State University. Colleges of the Arts and Sciences. Bachelor of Science Degree Requirements. The Aim of the Arts and Sciences

What the National Curriculum requires in reading at Y5 and Y6

Assessing Functional Relations: The Utility of the Standard Celeration Chart

WINNIPEG, MANITOBA, CANADA

Phonetics. The Sound of Language

Reading Horizons. Organizing Reading Material into Thought Units to Enhance Comprehension. Kathleen C. Stevens APRIL 1983

A Trio of Phonetic Details in Homalco

THE M.A. DEGREE Revised 1994 Includes All Further Revisions Through May 2012

OCW Global Conference 2009 MONTERREY, MEXICO BY GARY W. MATKIN DEAN, CONTINUING EDUCATION LARRY COOPERMAN DIRECTOR, UC IRVINE OCW

Lesson M4. page 1 of 2

Chapter 5: Language. Over 6,900 different languages worldwide

A Diverse Student Body

History. 344 History. Program Student Learning Outcomes. Faculty and Offices. Degrees Awarded. A.A. Degree: History. College Requirements

IMPROVING THE QUALITY OF LIFE FOR GRADUATE STUDENTS AT UNC

A cautionary note is research still caught up in an implementer approach to the teacher?

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Transcription:

TO: Unicode Technical Committeee FROM: Deborah Anderson, SEI, UC Berkeley DATE: 5 August 2009 RE: On the proposed U+A78F LATIN LETTER MIDDLE DOT (L2/09 031R = N3567) 1. Background. In L2/09 031R (=N3567), Andrew West proposed a new Latin letter middle dot for the transliteration of PHAGS PA LETTER SMALL A and to represent the glottal stop in Gong Hwang cherng s phonetic representation of Tangut. The proposal author also mentioned that it is quite probable that it is used in transliteration and/or transcription of other languages. Justification for a new middle dot was based on the fact that the existing middle dots are either marks of punctuation or are script specific, and the following middle dots were cited: 00B7 MIDDLE DOT Po Common 1427 CANADIAN SYLLABICS FINAL MIDDLE DOT Lo Can Aboriginal 30FB KATAKANA MIDDLE DOT Po Common FF65 HALFWIDTH KATAKANA MIDDLE DOT Po Common At the Dublin WG 2 meeting in April 2009, a concern was raised about the use of U+00B7 MIDDLE DOT for transliteration, because search engines such as Google are not able to find U+00B7. In order to understand the problem more fully, I investigated the use of the middle dot in other languages and inquired into the search issue, and present the findings in this document. 2. Use of Middle Dot While the use of the middle dot was raised within the context of transliteration (/transcription) of East Asian languages in L2/09 031R (N3567), the character has long been used by Americanists. Franz Boas already wrote in 1911 (Handbook of American Indian Languages, page 7): the raised period [=middle dot] is the recommended length mark to be used after a symbol to indicate that the sound is long. The colon was designated as marking excessive length, longer than that represented by < >. For languages which make a contrast between only degrees of length, either the colon or the raised period may be used as the length mark. Where both occur, the colon represents greater length. Examples of its use to designate length are given in figures 1 5 for Munsee, Central Sierra Miwok, Unami, Proto Takelman, Tonkawa and Algonkian. Hence, the middle dot is not restricted to East Asian use, and any new middle dot character will have an impact on materials for other languages, where the middle dot has been used at least 1

since the early twentieth century for transliteration and transcription, as well as in technical and practical orthographies. The middle dot has appeared and continues to appear in dictionaries, text corpora, and teaching materials for languages of the Americas. For online documents, two characters have been identified as being used by Americanists: U+00B7 MIDDLE DOT (see Munsee and Unami references cited in the examples) and U+02D1 MODIFIER LETTER HALF TRIANGULAR COLON (see Klamath example on page 3 and Severn Ojibwe example in figure 6). 3. Search Issues for U+00B7 MIDDLE DOT The character U+00B7 MIDDLE DOT, which has the property of Po, can be found in searches with Google, but the results are not consistent. For example, the Catalan word anul lar (with U+00B7) is found in the search results at: http://www.google.com/search?q=anul%c2%b7lar. However, the same results occur with anul lar (with space between the l s) in: http://www.google.com/search?hl=en&q=anul+lar. In order to get the proper result in searches in Google, the middle dot needs to be treated as a letter when it occurs between letters, but as a mark of punctuation when it is isolated. Mark Davis has filed a bug at Google on this, since the same treatment occurs for other marks as well, including ʹ and. (Such a correction will satisfy the problem as long as the middle dot is not next to a mark of punctuation, however, if one is looking for an exact match.) According to Peter Constable, Microsoft products also treat U+00B7 as punctuation, and most applications ignore it. 4. Options Rather than encode another middle dot for which the naïve user could pick from at least 23 middle dots (L2/07 258) two characters are currently in use: a. U+00B7 MIDDLE DOT While this generic middle dot is found commonly in most fonts, it can be ignored by search engines, as noted above. However, future search engines may provide better support to find U+00B7 when it occurs between two letters, and hence improve search results. For linebreaking, U+00B7 is AI which, according to normal UAX #14 behavior, will be treated exactly as AL. b. U+02D1 MODIFIER LETTER HALF TRIANGULAR COLON The character U+02D1 has gc=lm, its script is COMMON, and has the linebreaking property AL, and as a result provides the properties being requested for the proposed LATIN LETTER MIDDLE DOT. U+02D1 was already part of Unicode 1.1, so it has well established usage. 2

When doing a Google search for the combination lowercase I and U+02D1 ( iˑ ), the resulting search page can have a circular dot displayed for U+02D1, as in the following: For this example, the page it links to has the triangular shape for U+02D1: (Source: http://ksw.shoin.ac.jp/~spaelti/klamath/files_finished/kd_introduction.pdf) Note that the original text, taken from M.A.R. Barker s Klamath Dictionary (Berkeley and Los Angeles, 1963), had the raised circular dot: (Source: http://ksw.shoin.ac.jp/~spaelti/klamath/files_img/kd_introduction.pdf) The difference between the triangular and the circular shape was already blurred in Americanist usage, cf. the following summary drawn from Pullum and Ladusaw s Phonetic Symbol Guide (2 nd ed., Chicago, 1986, pp. 244 5, 248 9), in which the Americanists tend to use a raised period or the colon rather than the triangular half length mark and length mark. CHARACTER IPA USE AMERICANIST USE Raised Period not used, but often used as preceding segment is long U+? typogr. subst. for half length mk. Boas: preceding sound is long; use colon <:> to show extra long Half Length Mk ˑ preceding syll. is half long use=ipa, but employ < > U+02D1 Colon : not used, but often used as marks length of segment U+003A typogr. subst. for length mark Boas: marks excessive length, or in a 2 way contrast, same as < > Length Mark ː preceding letter is long use=ipa, colon generally U+02D0 substituted 3

5. Summary The encoding of another middle dot for Phags Pa is unnecessary, particularly as the middle dot is already use widely in linguistic transcription/transliteration and Americanist orthographies, and seems to be encoded on modern webpages by U+00B7 or U+02D1. The result of encoding another middle dot will be to create yet another look alike character. In my view, the best option for users is to use U+02D1 with a rounded glyph. This character is being used by linguists and others currently, is able to be found via search engines, and is found in both circular and triangular shapes. 6. Figures Figure 1: Munsee This shows the linguistic and the practical orthography used for Munsee, an endangered language that is a member of the Algic language family. The linguistic form uses a raised dot for the long vowel, but the practical orthography doubles the vowel to show length. This page uses U+00B7 for the raised dot. Source: http://en.wikipedia.org/wiki/munsee_language Figure 2: Central Sierra Miwok Note the use of the raised dot used in this Central Sierra Miwok text. Source: Freeland, L.S., and Sylvia M. Broadbent, Central Sierra Miwok Dictionary with Texts (1960), p. 59 Accessed from: http://www.yosemite.ca.us/library/central_sierra_miwok_dictionary/page_59.html 4

Figure 3 Unami Unami is a member of the Algic language family, related to Munsee. According to the Ethnologue, it is extinct. This figure shows the use of the raised dot to denote consonant and vowel length in the linguistic transcription. In this document, U+00B7 is used. Source: http://en.wikipedia.org/wiki/delaware_languages Figure 4 Proto Takelman In this figure for Proto Takelman, the ancestor of Takelma (and member of Penutian language family), note the middle dot for vowel length in the column for Sapir s Phonemics and Research Orthography. The middle dot appears also in the first column, Sapir s Phonetics, under the last entry for a fortis sibilant [ts!]. Source: Shipley, William. Proto Takelman. International Journal of American Linguistics, Vol. 35, No. 3 (Jul., 1969), pp. 226 230. Accessed from: http://www.jstor.org/stable/1264690 5

Figure 5 Tonkawa and Algonkian This snippet shows a comparison of Proto Central Algonkian (PCA) forms and words from Tonkawa, an extinct language from the Coahuiltecan family. The middle dot is used for the long vowels in both sets of forms. (The article is by Mary Haas, a student of Edwin Sapir, who was himself mentored by Franz Boas; the use here and in figure 4 demonstrate the continuous use of the middle dot by generations of linguists.) Source: Haas, Mary R. Tonkawa and Algonkian. Anthropological Linguistics, Vol. 1, No. 2, Genetic Relationship among Languages: A Symposium Presented at the 1958 Meetings of the American Anthropological Association (Feb., 1959), pp. 1 6. Accessed from: http://www.jstor.org/stable/30022178. Figure 6 Severn Ojibwa This example for the Severn Ojibwa language uses U+02D1with the triangular glyph. The standard orthography renders /aa/ for /aˑ/, /oo/ for /oˑ/. Source: http://lingweb.eva.mpg.de/numeral/ojibwa Severn.htm 6