Khmer Sorting Analysis

Similar documents
MARK 12 Reading II (Adaptive Remediation)

The ABCs of O-G. Materials Catalog. Skills Workbook. Lesson Plans for Teaching The Orton-Gillingham Approach in Reading and Spelling

A Neural Network GUI Tested on Text-To-Phoneme Mapping

MARK¹² Reading II (Adaptive Remediation)

The analysis starts with the phonetic vowel and consonant charts based on the dataset:

WiggleWorks Software Manual PDF0049 (PDF) Houghton Mifflin Harcourt Publishing Company

Florida Reading Endorsement Alignment Matrix Competency 1

Phonological Processing for Urdu Text to Speech System

1. Introduction. 2. The OMBI database editor

RANKING AND UNRANKING LEFT SZILARD LANGUAGES. Erkki Mäkinen DEPARTMENT OF COMPUTER SCIENCE UNIVERSITY OF TAMPERE REPORT A ER E P S I M S

Linguistics 220 Phonology: distributions and the concept of the phoneme. John Alderete, Simon Fraser University

Detecting English-French Cognates Using Orthographic Edit Distance

Taught Throughout the Year Foundational Skills Reading Writing Language RF.1.2 Demonstrate understanding of spoken words,

Automatic English-Chinese name transliteration for development of multilingual resources

1 st Quarter (September, October, November) August/September Strand Topic Standard Notes Reading for Literature

GCSE. Mathematics A. Mark Scheme for January General Certificate of Secondary Education Unit A503/01: Mathematics C (Foundation Tier)

South Carolina English Language Arts

UKLO Round Advanced solutions and marking schemes. 6 The long and short of English verbs [15 marks]

First Grade Curriculum Highlights: In alignment with the Common Core Standards

Chart 5: Overview of standard C

TEKS Comments Louisiana GLE

GCSE Mathematics B (Linear) Mark Scheme for November Component J567/04: Mathematics Paper 4 (Higher) General Certificate of Secondary Education

OCR for Arabic using SIFT Descriptors With Online Failure Prediction

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Using SAM Central With iread

have to be modeled) or isolated words. Output of the system is a grapheme-tophoneme conversion system which takes as its input the spelling of words,

Program Matrix - Reading English 6-12 (DOE Code 398) University of Florida. Reading

Books Effective Literacy Y5-8 Learning Through Talk Y4-8 Switch onto Spelling Spelling Under Scrutiny

Toward Probabilistic Natural Logic for Syllogistic Reasoning

DIBELS Next BENCHMARK ASSESSMENTS

Houghton Mifflin Reading Correlation to the Common Core Standards for English Language Arts (Grade1)

CHANCERY SMS 5.0 STUDENT SCHEDULING

Dickinson ISD ELAR Year at a Glance 3rd Grade- 1st Nine Weeks

The Good Judgment Project: A large scale test of different methods of combining expert predictions

Opportunities for Writing Title Key Stage 1 Key Stage 2 Narrative

Welcome to the Purdue OWL. Where do I begin? General Strategies. Personalizing Proofreading

POWERTEACHER GRADEBOOK

New Features & Functionality in Q Release Version 3.2 June 2016

Acoustic correlates of stress and their use in diagnosing syllable fusion in Tongan. James White & Marc Garellek UCLA

Considerations for Aligning Early Grades Curriculum with the Common Core

Informatics 2A: Language Complexity and the. Inf2A: Chomsky Hierarchy

Get Your Hands On These Multisensory Reading Strategies

LEXICAL COHESION ANALYSIS OF THE ARTICLE WHAT IS A GOOD RESEARCH PROJECT? BY BRIAN PALTRIDGE A JOURNAL ARTICLE

CLASSIFICATION OF PROGRAM Critical Elements Analysis 1. High Priority Items Phonemic Awareness Instruction

NATIONAL CENTER FOR EDUCATION STATISTICS RESPONSE TO RECOMMENDATIONS OF THE NATIONAL ASSESSMENT GOVERNING BOARD AD HOC COMMITTEE ON.

Primary English Curriculum Framework

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

Test Blueprint. Grade 3 Reading English Standards of Learning

Parallel Evaluation in Stratal OT * Adam Baker University of Arizona

Edexcel GCSE. Statistics 1389 Paper 1H. June Mark Scheme. Statistics Edexcel GCSE

CPS122 Lecture: Identifying Responsibilities; CRC Cards. 1. To show how to use CRC cards to identify objects and find responsibilities

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Millersville University Degree Works Training User Guide

Richardson, J., The Next Step in Guided Writing, Ohio Literacy Conference, 2010

Towards an electronic dictionary of Tamajaq language in Niger

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

Guide to the Uniform mark scale (UMS) Uniform marks in A-level and GCSE exams

SEGMENTAL FEATURES IN SPONTANEOUS AND READ-ALOUD FINNISH

Mark Scheme (Results) Summer International GCSE Bengali (4BE0/01)

Standard 1: Number and Computation

Problems of the Arabic OCR: New Attitudes

TABE 9&10. Revised 8/2013- with reference to College and Career Readiness Standards

Stacks Teacher notes. Activity description. Suitability. Time. AMP resources. Equipment. Key mathematical language. Key processes

English for Life. B e g i n n e r. Lessons 1 4 Checklist Getting Started. Student s Book 3 Date. Workbook. MultiROM. Test 1 4

Unit 9. Teacher Guide. k l m n o p q r s t u v w x y z. Kindergarten Core Knowledge Language Arts New York Edition Skills Strand

Implementing a tool to Support KAOS-Beta Process Model Using EPF

On the nature of voicing assimilation(s)

Mandarin Lexical Tone Recognition: The Gating Paradigm

Timeline. Recommendations

International Advanced level examinations

Multimedia Application Effective Support of Education

Part of Speech Template

Syntactic types of Russian expressive suffixes

A Novel Approach for the Recognition of a wide Arabic Handwritten Word Lexicon

Page 1 of 11. Curriculum Map: Grade 4 Math Course: Math 4 Sub-topic: General. Grade(s): None specified

ELA/ELD Standards Correlation Matrix for ELD Materials Grade 1 Reading

Academic Development & Quality Assurance Center. Academic Development and Quality Assurance Centre

Enumeration of Context-Free Languages and Related Structures

Clinical Application of the Mean Babbling Level and Syllable Structure Level

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Preparing for the School Census Autumn 2017 Return preparation guide. English Primary, Nursery and Special Phase Schools Applicable to 7.

EAP. updates KHENG WAICHE. early proficiency programs coordinator

How Does Physical Space Influence the Novices' and Experts' Algebraic Reasoning?

1. REFLEXES: Ask questions about coughing, swallowing, of water as fast as possible (note! Not suitable for all

5. Margi (Chadic, Nigeria): H, L, R (Williams 1973, Hoffmann 1963)

Linking the Common European Framework of Reference and the Michigan English Language Assessment Battery Technical Report

Sri Lanka. On the scale of a world map, Sri Lanka previously known as Ceylon appears to hang like a Pearl over the Indian Ocean.

Massachusetts Department of Elementary and Secondary Education. Title I Comparability

Arabic Orthography vs. Arabic OCR

Extending Place Value with Whole Numbers to 1,000,000

BULATS A2 WORDLIST 2

Niger NECS EGRA Descriptive Study Round 1

Loughton School s curriculum evening. 28 th February 2017

The Perception of Nasalized Vowels in American English: An Investigation of On-line Use of Vowel Nasalization in Lexical Access

Student Handbook 2016 University of Health Sciences, Lahore

STUDIES WITH FABRICATED SWITCHBOARD DATA: EXPLORING SOURCES OF MODEL-DATA MISMATCH

Post-16 transport to education and training. Statutory guidance for local authorities

Cambridgeshire Community Services NHS Trust: delivering excellence in children and young people s health services

2017 national curriculum tests. Key stage 1. English grammar, punctuation and spelling test mark schemes. Paper 1: spelling and Paper 2: questions

INTERDISCIPLINARY STUDIES FIELD MAJOR APPLICATION TO DECLARE

Transcription:

Recent changes are in Red. Sorting scheme for Khmer Khmer Sorting Analysis Note that page references in this document are typically to Chhuan Nath's Khmer-Khmer Dictionary, Japanese Reprint Edition with arabic numbers at the bottom of the page. Priority 1: (Should Khmer numbers and signs precede the alphabet? Should 17A3/17A4 precede the other letters of the alphabet?) [1780-1793] The first 20 (of 33) Khmer consonants in the order they are encoded in Unicode: ABCDEFGHIJLNOPQRSTUV [1794] The next one (of 33) Khmer consonants in the order they are encoded in Unicode: W It would probably be best to merge this and the next two entries under one heading, words with signs would list immediately after words with identical spelling without said signs. Is that acceptable? [1794+17C9] A variant of the 21st Khmer consonant with 'p' pronunciation comes next (this is evident when marked as: Wapple, however, there are hundreds of words whose only distinction from a simple W is their derivation) [1794+17CA] A variant of the 21st Khmer consonant comes next (happily this is always marked as: W ) [1795-1799] An additional 5 (of 33) Khmer consonants in the order they are encoded in Unicode: YZabc [179A] An additional 1 (of 33) Khmer consonants in the order it is encoded in Unicode: d It would probably be best to merge this and the next two entries under one heading (i.e., including ROBAT and the two independent vowels decomposed into d and the appropriate dependent vowel). Is that acceptable? [17CC] The ROBAT sign is (inconsistently in the Chhuan Nath dictionary p. 465, 506, 538, 609, 750-1, 768, 1322, 1339-1340, 1633) treated for ordering purposes as an independent syllable. Should this be entered in phonetic order (as everything else is; I believe that would be appropriate)? What is its writing order when entered by a learned monk? It seems to fill

the roll of a superscript consonant and is not written stand-alone. If it is sorted as indicated here and not entered in phonetic order, there will have to be some mechanism to reorder it in the ordering algorithm. [17AB-17AC] These two independent vowels [ef] are treated as consonants following 179A as they share a consonantal sound of 'r' [179B] The next one (of 33) Khmer consonant: g Should this and the following section be merged with decomposition of the following in 179B plus the appropriate vowel? [17AD-17AE] These two independent vowels [hi] are treated as consonants following 179B as they share a consonantal sound of 'l' [179C] The next one (of 33) Khmer consonant: j [17AB-17AC] These two transliteration consonants [òô] are treated as consonants following 179C. They resemble the following Khmer consonant 179F as they share a sound 's'. (Q: Are these two in the right order for sorting? Should they be integrated within the Khmer 17DC for ordering purposes? None seem to be sorted in the Chhuon Nate dictionary. Could we have examples of the characters they transliterate and the name of the script that character comes from? Have the glyphs and names been switched in Unicode?) [179F-17A0] The next 2 (of 33) Khmer consonants: kl [17A1] The next 1 (of 33) Khmer consonants (this is separated because it is not available in a subscript form): m [17A2, 17A3-17AA, 17AF-17B3] These characters are merged under one consonant (17A2) by means of decomposition into a glottal stop and a dependent vowel. For there to be a deterministic system this decomposition must be standardised. The resulting system (hopefully) will also sort transliterated Sanskrit/Pali text. n n 17A2 n n 17A3 -> 17A2 (?) 1 1 There does not appear to be a strong differentiation between short initial inherent vowel words (presumeably 17A3) and long inherent vowel words (presumeably 17A2) in the final section of the Chhuan Nate Khmer dictionary. There is some controversy over the significance of 17A3 and 17A4 in Unicode. The linguist committee in Phnom Penh felt that there needed to be a distinction between the final Khmer consonant 17A2

ß ñ n+z 17A4->17A2 + 17B6 (?) o n+ 17A5->17A2 + 17B7 p n+ 17A6->17A2 + 17B8 q n+ 17A7->17A2 + 17BB 2 ß r n+ (+ A) 17A8->17A2 + 17BB (+ t n+ 1780) 3 17A9->17A2 + 17BC s n+ (+ j) 17AA->17A2 + 17BC (+ u n+ Ω 179C) 4 17AF->17A2 + 17C2 5 v n+ æ 17B0->17A2 + 17C3 x n+ ºÔz 17B1->17A2 + 17C4 A n+ ºÔz 17B2->17A2 + 17C4 6 y n+ ºÔÕ 17B3->17A2 + 17C5 Priority 2: First subscript should include all the characters in Priority 1 with the (possible) exception of a subscript form of m which reportedly does not exist. However for sorting and display purposes it is assumed that any character in the range 1780-17B3 could be a subscript. On the other hand and the two independent Sanscrit vowels 17A3-17A4. It would be good to clarify this issue if the particular Pali/Sancrit characters these are to represent could be shown. 2 There are good examples of the equality of 17A2 and the first part of the decomposed independent vowel on pages 1808-1850 (arabic) of the Japanese reprint of Chhuan Nath's dictionary. 3 The final Khmer consonant sound does not affect the ordering of this extremely rare and obsolete independent vowel. There will be some need of differentiating 17A7 and 17A8, but only at a higher level of sorting. This is referenced at the top of p. 1852 and p. 1877 of Chhuan Nath's dictionary. 4 The final consonant 179C does not figure in the sorting order, and is presented only for an understanding of the roots of the character. By this analysis there would seem to be an inconsistency on page 1851-1856, particularly with öööös... sh... sm... n!... n... tä If the Chhuan Nath precedent were followed in this case it would seem to contradict the useage of decomposition for the other independent vowels that seem to separate into 17A2 + x. 5 Note on p. 1860 the independent vowel in Chhuan Nath's dictionary seems to have a secondary priority over the decomposition: u Ωn 6 There are only two words which require the use of this character, the very common w and the very rare.

only a subset of independent vowels are presently known to be subscripts (in addition to the consonant n): equ (lgtµc WEEYV YFU) Priority 3: Theoretically any of the characters under Priority 2 may also sort in the same orders under Priority 3. On the other hand in the Khmer language only about 9 are documented) Â ü ø Ì ± Í Ê Priority 4: Vowel 18 (Unicode: A committee of Khmer linguists voted to move three characters [17C6-17C8] from independent and combining forms of vowel to instead be signs as indicated in the Khmer Unicode section, reducing the number of dependent vowels that would need to be keyboarded. The vowel/sign combinations which are known to exist using these are as follows: Ô 17B5 Short inherent p. 1583 Ô 17B4 Long inherent Ôz 17B6 Ôz 17B6+17C7 p. 982, 1786, 1793 Ô~ 17B7 Ô~ 17B7+17C7 p. 132, 1237, 1549 Ô 17B8 Ô 17B7+17C7 p. 64, 251 Ô 17B9 Ô 17B9+17C7 p. 760, 743-4, 1239, 1463 Ô 17BA Ô 17BA+17C7 p. 246, 458, 597, 1887, 1808 17BB Ôß Ôß 17BB+17C7 p. 224, 542-3, 812, Ô 1451, 1513, 1554 17BC Ô 17BC+17C7 p. 1887 Ô 17BD Ô 17BD+17C7 (Invalid? Not in Chhuan

ºÔ 17BE ºÔ 17BE+17C7 p. 743-4, 895, 1878-9 ºÔÆ 17BF ºÔÆ 17BF+17C7 (Invalid? Not in Chhuan 17C0 ºÔ ºÔ 17C0+17C7 p. 748, 1242 ºÔ 17C1 ºÔ 17C1+17C7 p. 68, 215, 264, 689, 748 (but p. 1061) ΩÔ 17C2 ΩÔ 17C2+17C7 p. 74, 142, 709, 761, 1475 æô 17C3 æô 17C3+17C7 (Valid? No example) ºÔz 17C4 ºÔz 17C4+17C7 p. 76, 134-5, 142, 187 ºÔÕ 17C5 ºÔÕ 17C5+17C7 (Invalid? Not in Chhuan Ôß 17BB+17C6 Ôß 17BB+17C6+17C7 (Invalid? Not in Chhuan Ô 17C6 Ôz 17B6+17C6 Ôz 17B6+17C6+17C7 (Invalid? Not in Chhuan Ô 17C7 Priority 5: Signs Ôapple 17C9 p. 195, 626 (in conjunction with 1794 higher priority?), 1178

Ô ÔÙ 17CA 17CE p. 715 (in conjunction with 1794 higher priority?), 1538-9, 1534-5 p. 252, 542-3! (exclamation) p. 1558 ÔÈ ÔÚ 17C8 17CB p. 413, 843, 1178, 1492, 1562, 1590, but lower priority to hyphen p. 1392-3! p. 119, 133, 148 (higher priority?), 177, 1178, 1544 (?) - (hyphen) p. 1254, but why p. Ôı 17D0 1538-9 p. 119, 483, 681, 839, 1254 Ôˆ 17CD ÔÛ 17CF ÔJ 17D1 _ (long hyphen) p. 504, 1590, 1728, 17D7 1392-3 p. 252, 860 Priority 6: Signs as above, relatively rareqzùè n Ùc Ωl È jß fl Test collation series A 1 \u1780 7 Single consonant AÛ 2 \u1780\u17cf Single consonant AA and sign 3 \u1780\u1780 Consonant and next base consonant AAÚ 4 \u1780\u1780 Consonant and next \u17cb base consonant and sign 7 When sorting ignore all spaces inserted into this column; they are purely for presentation/word-wrap purposes.

AAd AAd AÄ R AÄ c AºA AΩAAAd AΩAW AºÄ Aº A A AÄÓ A 5 \u1780\u1780 \u179a 5 \u1780\u17a4 \u1780\u17a4 \u179a 6 \u1780\u1780 \u17b6\u178f 7 \u1780\u1780 \u17b6\u1799 8 \u1780\u1780 \u17c1\u17c7 9 \u1780\u1780 \u17c2\u1780 \u1780\u179a 10 \u1780\u1780 \u17c2\u1794 11 \u1780\u1780 \u17c4\u17c7 12 \u1780\u1780 \u17d2\u179a \u17be\u1780 13 \u1780\u1780 \u17d2\u17a2 \u17b6\u1780 Could also be expressed with inherent vowels encoded \u1780\u17a4 \u1780\u17a4 \u179a (final consonant lacks vowel) Identical to previous Vowel on second base resets cycling of third consonant Third base consonant changes Vowel on second base resets cycling, starting with no third base ditto (presence of consonant in third base position follows absence of third base consonant) Third base consonant cycle Continuing to cycle through vowels on second base consonant Start cycling through subscript consonant on second base (reset cycling of vowel on second base) Continue cycling through subscript consonant on second base (reset cycling of vowel on second base)

AÄÓ A ºBÕä A 13 \u1780\u17b5 \u1780\u17d2 \u17a2\u17b6 \u1780 14 \u1781\u17c5 \u178f\u17b6 \u1780 Identical to above (no implicit vowel when there is an explicit dependent vowel) Next consonant; cycling through vowel on first base B 15 \u1781\u17c6 Cycling through sign turned to vowel on first base Bz Bz E 16 \u1781\u17b6 \u17c6 17 \u1781\u17b6 \u17c6\u1784 cycling through composed vowel on first base Second base B 18 \u1781\u17c7 Cycling through sign turned to vowel on first base ºD z 19 \u178e\u17d2 \u1798\u17c4 \u17c7 D 20 \u178e\u17d2 \u1798\u17bb \u17c6 ºEzE ºEapplezE Ö ÖÙ Ö 21 \u1784\u17c4 \u1784 22 \u1784\u17c4 \u17c9\u1784 23 \u1786\u17b6 24 \u1786\u17b6 \u17ce 25 \u1786\u17b6 \u17d7 Composed vowel starts with subscript part first, then superscript. Word with sign follows word without sign Sign follows vowel in entry order Doubling sign indicates a consonant will follow (but weights as a sign) For corrections and suggestions please contact: Maurice Bauhahn, 2 Meadow Way; Dorney Reach; MAIDENHEAD SL6 0DS; U.K. Tel: +44(0)1628 626068; Email: bauhahnm@clara.net 3 February 2001 version 0.4beta