Khmer Sorting Analysis

Similar documents
MARK 12 Reading II (Adaptive Remediation)

The ABCs of O-G. Materials Catalog. Skills Workbook. Lesson Plans for Teaching The Orton-Gillingham Approach in Reading and Spelling

MARK¹² Reading II (Adaptive Remediation)

Phonological Processing for Urdu Text to Speech System

Florida Reading Endorsement Alignment Matrix Competency 1

1. Introduction. 2. The OMBI database editor

South Carolina English Language Arts

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Get Your Hands On These Multisensory Reading Strategies

Using SAM Central With iread

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

Towards an electronic dictionary of Tamajaq language in Niger

First Grade Curriculum Highlights: In alignment with the Common Core Standards

A Neural Network GUI Tested on Text-To-Phoneme Mapping

TEKS Comments Louisiana GLE

Automatic English-Chinese name transliteration for development of multilingual resources

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

Detecting English-French Cognates Using Orthographic Edit Distance

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

DIBELS Next BENCHMARK ASSESSMENTS

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

Problems of the Arabic OCR: New Attitudes

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Welcome to the Purdue OWL. Where do I begin? General Strategies. Personalizing Proofreading

POWERTEACHER GRADEBOOK

Considerations for Aligning Early Grades Curriculum with the Common Core

New Features & Functionality in Q Release Version 3.2 June 2016

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

NATIONAL CENTER FOR EDUCATION STATISTICS RESPONSE TO RECOMMENDATIONS OF THE NATIONAL ASSESSMENT GOVERNING BOARD AD HOC COMMITTEE ON.

Test Blueprint. Grade 3 Reading English Standards of Learning

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

CPS122 Lecture: Identifying Responsibilities; CRC Cards. 1. To show how to use CRC cards to identify objects and find responsibilities

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Richardson, J., The Next Step in Guided Writing, Ohio Literacy Conference, 2010

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

UKLO Round Advanced solutions and marking schemes. 6 The long and short of English verbs [15 marks]

GCSE. Mathematics A. Mark Scheme for January General Certificate of Secondary Education Unit A503/01: Mathematics C (Foundation Tier)

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

CHANCERY SMS 5.0 STUDENT SCHEDULING

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Primary English Curriculum Framework

Linguistics Program Outcomes Assessment 2012

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Millersville University Degree Works Training User Guide

LITERACY ACROSS THE CURRICULUM POLICY

Sri Lanka. On the scale of a world map, Sri Lanka previously known as Ceylon appears to hang like a Pearl over the Indian Ocean.

A Novel Approach for the Recognition of a wide Arabic Handwritten Word Lexicon

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Academic Development & Quality Assurance Center. Academic Development and Quality Assurance Centre

Proof Theory for Syntacticians

Enumeration of Context-Free Languages and Related Structures

Preparing for the School Census Autumn 2017 Return preparation guide. English Primary, Nursery and Special Phase Schools Applicable to 7.

Mark Scheme (Results) Summer International GCSE Bengali (4BE0/01)

Clinical Application of the Mean Babbling Level and Syllable Structure Level

The IDN Variant Issues Project: A Study of Issues Related to the Delegation of IDN Variant TLDs. 20 April 2011

Planning a Dissertation/ Project

MFL SPECIFICATION FOR JUNIOR CYCLE SHORT COURSE

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Arabic Orthography vs. Arabic OCR

PowerTeacher Gradebook User Guide PowerSchool Student Information System

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

value equivalent 6. Attendance Full-time Part-time Distance learning Mode of attendance 5 days pw n/a n/a

Student Handbook 2016 University of Health Sciences, Lahore

2017 national curriculum tests. Key stage 1. English grammar, punctuation and spelling test mark schemes. Paper 1: spelling and Paper 2: questions

New Features & Functionality in Q Release Version 3.1 January 2016

Modeling full form lexica for Arabic

Mandarin Lexical Tone Recognition: The Gating Paradigm

Part of Speech Template

Schoology Getting Started Guide for Teachers

CROSS-LANGUAGE MAPPING FOR SMALL-VOCABULARY ASR IN UNDER-RESOURCED LANGUAGES: INVESTIGATING THE IMPACT OF SOURCE LANGUAGE CHOICE

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

On the nature of voicing assimilation(s)

The NICT/ATR speech synthesis system for the Blizzard Challenge 2008

Developing the Key Competencies in Social Sciences

Number Line Moves Dash -- 1st Grade. Michelle Eckstein

Implementing a tool to Support KAOS-Beta Process Model Using EPF

The Journey to Vowelerria VOWEL ERRORS: THE LOST WORLD OF SPEECH INTERVENTION. Preparation: Education. Preparation: Education. Preparation: Education

Holy Family Catholic Primary School SPELLING POLICY

Why Pay Attention to Race?

Kindergarten Lessons for Unit 7: On The Move Me on the Map By Joan Sweeney

Universal contrastive analysis as a learning principle in CAPT

CDE: 1st Grade Reading, Writing, and Communicating Page 2 of 27

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

5. Margi (Chadic, Nigeria): H, L, R (Williams 1973, Hoffmann 1963)

Objective: Add decimals using place value strategies, and relate those strategies to a written method.

Phonological and Phonetic Representations: The Case of Neutralization

On-Line Data Analytics

READ 180 Next Generation Software Manual

Transcription:

Recent changes are in Red. Sorting scheme for Khmer Khmer Sorting Analysis Note that page references in this document are typically to Chuon Nath's Khmer-Khmer Dictionary, Japanese Reprint Edition with Arabic numbers at the bottom of the page. Level 1 (Priority 1): (Should Khmer numbers and signs precede the alphabet? Should U+17A3/U+17A4 precede the other letters of the alphabet?) [U+1780..U+1793] The first 20 (of 33) Khmer consonants in the order they are encoded in Unicode: ABCDEFGHIJLNOPQRSTUV [U+1794] The next one (of 33) Khmer consonants in the order they are encoded in Unicode: W It would probably be best to merge this and the next two entries under one heading, words with signs would list immediately after words with identical spelling without said signs. Is that acceptable? [U+1794 U+17C9] A variant of the 21st Khmer consonant with 'p' pronunciation comes next (this is evident when marked as: Wapple, however, there are hundreds of words in this section whose only distinction from a simple W is their derivation) [U+1794 U+17CA] A variant of the 21st Khmer consonant comes next (happily this is always marked as: W ) [U+1795..U+1799] An additional 5 (of 33) Khmer consonants in the order they are encoded in Unicode: YZabc [U+179A] An additional 1 (of 33) Khmer consonants in the order it is encoded in Unicode: d It would probably be best to merge this and the next two entries under one heading (i.e., including ROBAT and the two independent vowels decomposed into d and the appropriate dependent vowel). Is that acceptable? [U+17CC] The ROBAT sign is (inconsistently in the Chuon Nath dictionary p. 465, 506, 538, 609, 750-1, 768, 1322, 1339-1340, 1633) treated for ordering purposes as an independent syllable. Should this be entered in phonetic order (as everything else is; I believe that would be appropriate)? What is its writing order when entered by a learned monk? It seems to fill

the roll of a superscript consonant and is not written stand-alone. If it is sorted as indicated here and not entered in phonetic order, there will have to be some mechanism to reorder it in the ordering algorithm. This should probably be replaced by U+179A U+17D2 [U+17AB..U+17AC] These two independent vowels [ef] are treated as consonants following 179A as they share a consonantal sound of 'r' [U+179B] The next one (of 33) Khmer consonant: g Should this and the following section be merged with decomposition of the following in U+179B plus the appropriate vowel? [U+17AD..U+17AE] These two independent vowels [hi] are treated as consonants following U+179B as they share a consonantal sound of 'l' [U+179C] The next one (of 33) Khmer consonant: j [U+17AB..U+17AC] These two transliteration consonants [òô] are treated as consonants following U+179C. They resemble the following Khmer consonant U+179F as they share a sound 's'. (Q: Are these two in the right order for sorting? Should they be integrated within the Khmer U+17DC for ordering purposes? None seem to be sorted in the Chuon Nate dictionary. Could we have examples of the characters they transliterate and the name of the script that character comes from? Have the glyphs and names been switched in Unicode?) [U+179F..U+17A0] The next 2 (of 33) Khmer consonants: kl [U+17A1] The next 1 (of 33) Khmer consonants (this is separated because it is not available in a subscript form): m [U+17A2, U+17A3..U+17AA, U+17AF..U+17B3] These characters are merged under one consonant (U+17A2) by means of decomposition into a glottal stop and a dependent vowel. For there to be a deterministic system this decomposition must be standardised. The resulting system (hopefully) will also sort transliterated Sanskrit/Pali text (note that Pali dictionaries sort the independent vowels first with separate sections for U+17A3 and U+17A4). n n U+17A2 n n U+17A3 -> U+17A2 (?) 1 1 There is a weak differentiation between short initial inherent vowel words (presumeably U+17A3) and long inherent vowel words (presumeably U+17A2) in the final section of the Chuon Nate Khmer dictionary. There is some controversy over the

ñ n+z U+17A4->U+17A2 + U+17B6 (?) o n+ U+17A5->U+17A2 + U+17B7 p n+ U+17A6->U+17A2 + U+17B8 ß q n+ U+17A7->U+17A2 + U+17BB 2 ß r n+ (+ A) U+17A8->U+17A2 + U+17BB (+ U+1780) 3 t n+ U+17A9->U+17A2 + U+17BC s n+ (+ j) U+17AA->U+17A2 + U+17BC (+ U+179C) 4 u n+ Ω U+17AF->U+17A2 + U+17C2 5 v n+ æ U+17B0->U+17A2 + U+17C3 x n+ ºÔz U+17B1->U+17A2 + U+17C4 A n+ ºÔz U+17B2->U+17A2 + y n+ ºÔÕ U+17C4 6 U+17B3->U+17A2 + U+17C5 significance of U+17A3 and U+17A4 in Unicode. The linguist committee in Phnom Penh felt that there needed to be a distinction between the final Khmer consonant U+17A2 and the two independent Sanscrit vowels U+17A3..U+17A4. It would be good to clarify this issue if the particular Pali/Sancrit characters these are to represent could be shown. 2 There are good examples of the equality of U+17A2 and the first part of the decomposed independent vowel on pages 1808-1850 (Arabic) of the Japanese reprint of Chuon Nath's dictionary. 3 The final Khmer consonant sound does not affect the ordering of this extremely rare and obsolete independent vowel. There will be some need of differentiating U+17A7 and U+17A8, but only at a higher level of sorting. This is referenced at the top of p. 1852 and p. 1877 of Chuon Nath's dictionary. 4 The final consonant U+179C does not figure in the sorting order, and is presented only for an understanding of the roots of the character. By this analysis there would seem to be an inconsistency on page 1851-1856, particularly with öööös... sh... sm... n!... n... tä If the Chuon Nath precedent were followed in this case it would seem to contradict the useage of decomposition for the other independent vowels that seem to separate into U+17A2 + x. 5 Note on p. 1860 the independent vowel in Chuon Nath's dictionary seems to have a secondary priority over the decomposition: u Ωn 6 There are only two words which require the use of this character, the very common w and the very rare U+17B2 U+1780 U+1789 U+17C0.

Level 1 (Priority 2) Ôapple 17C9 p. 195, 626 (in conjunction with 1794 higher level?), 1178 Ô 17CA p. 715 (in conjunction with 1794 higher level?), 1538-9, 1534-5 Level 2 (Priority 1): First subscript should include all the characters in Level 1 with the (possible) exception of a subscript form of m which reportedly does not exist. However for sorting and display purposes it is assumed that any character in the range U+1780..U+17B3 could be a subscript. On the other hand only a subset of independent vowels are presently known to be subscripts (in addition to the consonant n): equ (lgtµc WEEYV YFU) Level 3 (Priority 1): Second subscript. Theoretically any of the characters under Level 2 may also sort in the same orders under Level 3. On the other hand in the Khmer language only about 9 are documented) Â ü ø Ì ± Í Ê Level 4 (Priority 1): Vowels, 18 (Unicode: A committee of Khmer linguists voted to move three characters [U+17C6..U+17C8] from independent and combining forms of vowel to instead be signs as indicated in the Khmer Unicode section, reducing the number of dependent vowels that would need to be keyboarded. The vowel/sign combinations which are known to exist using these are as follows: Ô U+17B5 Short inherent p. 1583 Ô U+17B4 Long inherent Ôz U+17B6 Ôz U+17B6 U+17C7 p. 982, 1786, 1793 Ô~ U+17B7 Ô~ U+17B7+U+17C7 p. 132, 1237, 1549 Ô U+17B8 Ô U+17B7+17C7 p. 64, 251 Ô U+17B9 Ô U+17B9+U+17C7 p. 760, 743-4, 1239, 1463

Ô U+17BA Ô U+17BA+U+17C7 p. 246, 458, 597, 1887, 1808 U+17BB Ôß Ôß U+17BB+U+17C7 p. 224, 542-3, 812, Ô 1451, 1513, 1554 U+17BC Ô U+17BC+U+17C7 p. 1887 Ô U+17BD Ô U+17BD+U+17C7 (Invalid? Not in Chuon ºÔ U+17BE ºÔ U+17BE+U+17C7 p. 743-4, 895, 1878-9 ºÔÆ U+17BF ºÔÆ U+17BF+U+17C7 (Invalid? Not in Chuon U+17C0 ºÔ ºÔ U+17C0+U+17C7 p. 748, 1242 ºÔ U+17C1 ºÔ U+17C1+U+17C7 p. 68, 215, 264, 689, 748 (but p. 1061) ΩÔ U+17C2 ΩÔ U+17C2+U+17C7 p. 74, 142, 709, 761, 1475 æô U+17C3 æô U+17C3+U+17C7 (Valid? No example) ºÔz U+17C4 ºÔz U+17C4+U+17C7 p. 76, 134-5, 142, 187 ºÔÕ U+17C5 ºÔÕ U+17C5+U+17C7 (Invalid? Not in Chuon Ôß U+17BB+U+17C6 Ôß U+17BB+U+17C6 + Ô U+17C7 U+17C6 (Invalid? Not in Chuon

Ôz U+17B6+U+17C6 Ôz U+17B6+U+17C6+ (Invalid? Not in Chuon Ô U+17C7 U+17C7 ÔÈ U+17C8 p. 413, 843, 1178, 1492, 1562, 1590, but lower priority to hyphen p. 1392-3! Level 2 (Priority 2): Signs ÔÙ U+17CE p. 252, 542-3! (exclamation) p. 1558 ÔÚ U+17CB p. 119, 133, 148 (higher priority?), 177, 1178, 1544 (?) - (hyphen) p. 1254, but why p. 1538-9 Ôı U+17D0 p. 119, 483, 681, 839, 1254 Ôˆ U+17CD ÔÛ U+17CF ÔJ U+17D1 _ (long hyphen) p. 504, 1590, 1728, 1392-3 U+17D7 p. 252, 860 Level 3 (Priority 2) : Signs as above, relatively rareqzùè n Ùc Ωl È jß fl Test collation series A 1 U+1780 7 Single consonant AÛ 2 U+1780U+17CF Single consonant and sign AA 3 U+1780U+1780 Consonant and next base consonant 7 When sorting ignore all spaces inserted into this column; they are purely for presentation/word-wrap purposes.

AAÚ AAd AAd AÄ R AÄ c AºA AΩAAAd AΩAW AºÄ Aº A A 4 U+1780U+1780 U+17CB 5 U+1780U+1780 U+179A 5 U+1780U+17A5 U+1780U+17A5 U+179A 6 U+1780U+1780 U+17B6U+178F 7 U+1780U+1780 U+17B6U+1799 8 U+1780U+1780 U+17C1U+17C7 9 U+1780U+1780 U+17C2U+1780 U+1780U+179A 10 U+1780U+1780 U+17C2U+1794 11 U+1780U+1780 U+17C4U+17C7 12 U+1780U+1780 U+17D2U+179A U+17BEU+1780 Consonant and next base consonant and sign Could also be expressed with inherent vowels encoded U+1780U+17A5 U+1780U+17A5 U+179A (final consonant lacks vowel) Identical to previous Vowel on second base resets cycling of third consonant Third base consonant changes Vowel on second base resets cycling, starting with no third base ditto (presence of consonant in third base position follows absence of third base consonant) Third base consonant cycle Continuing to cycle through vowels on second base consonant Start cycling through subscript consonant on second base (reset cycling of vowel on second base)

AÄÓ A AÄÓ A ºBÕä A 13 U+1780U+1780 U+17D2U+17A2 U+17B6U+1780 13 U+1780U+17B5 U+1780U+17D2 U+17A2U+17B6 U+1780 14 U+1781U+17C5 U+178FU+17B6 U+1780 Continue cycling through subscript consonant on second base (reset cycling of vowel on second base) Identical to above (no implicit vowel when there is an explicit dependent vowel) Next consonant; cycling through vowel on first base B 15 U+1781U+17C6 Cycling through sign turned to vowel on first base Bz Bz E 16 U+1781U+17B6 U+17C6 17 U+1781U+17B6 U+17C6U+1784 cycling through composed vowel on first base Second base B 18 U+1781U+17C7 Cycling through sign turned to vowel on first base ºD z 19 U+178EU+17D2 U+1798U+17C4 U+17C7 D 20 U+178EU+17D2 U+1798U+17BB U+17C6 ºEzE ºEapplezE Ö ÖÙ Ö 21 U+1784U+17C4 U+1784 22 U+1784U+17C4 U+17C9U+1784 23 U+1786U+17B6 24 U+1786U+17B6 U+17CE 25 U+1786U+17B6 U+17D7 Composed vowel starts with subscript part first, then superscript. Word with sign follows word without sign Sign follows vowel in entry order Doubling sign indicates a consonant will follow (but weights as a sign)

n 26 U+17A2U+17B4 Inherent vowel appear to have some affect 27 U+17A2U+17B5 n The influence of inherent vowels in collation is a subject worth further investigation. For example, should words with a voiced inherent final vowel (Indic loanwords) be sorted before (or after!) words with final consonants lacking an inherent vowel? (Thanks to Kent Karlsson for raising this issue). For corrections and suggestions please contact: Maurice Bauhahn, 2 Meadow Way; Dorney Reach; MAIDENHEAD SL6 0DS; U.K. Tel: +44(0)1628 626068; Email: bauhahnm@clara.net 5 December 2001 version 0.7gamma