Character Distributions of Classical Chinese Literary Texts: Zipf s Law, Genres, and Epochs

Similar documents
September 8, 2017 Asia Pacific Health Promotion Capacity Building Forum

Chien-hsin Tsai Curriculum Vitae

Why Is the Chinese Curriculum Difficult for Immigrants Children from Southeast Asia

International Series in Operations Research & Management Science

National Taiwan Normal University - List of Presidents

OTHER RESEARCH EXPERIENCE & AFFILIATIONS

MULTILINGUAL INFORMATION ACCESS IN DIGITAL LIBRARY

Identification of Opinion Leaders Using Text Mining Technique in Virtual Community

Mandarin Lexical Tone Recognition: The Gating Paradigm

NCU IISR English-Korean and English-Chinese Named Entity Transliteration Using Different Grapheme Segmentation Approaches

Multiple Intelligence Theory into College Sports Option Class in the Study To Class, for Example Table Tennis

What Can Near Synonyms Tell Us? 1

LINGUIST List

Taking into Account the Oral-Written Dichotomy of the Chinese language :

Corpus on Web: Introducing The First Tagged and Balanced Chinese Corpus + Chu-Ren Huang, *Keh-Jiann Chen and -Shin Lin

Form, Meaning and Learners Dictionaries

Speech Emotion Recognition Using Support Vector Machine

The Current Situations of International Cooperation and Exchange and Future Expectations of Guangzhou Ploytechnic of Sports

Curriculum Vitae of Chiang-Ju Chien

POS tagging of Chinese Buddhist texts using Recurrent Neural Networks

DEVELOPMENTAL STAGES IN READING CHINESE AS A SECOND LANGUAGE SUN-A KIM DISSERTATION

Shun-ling Chen. Harvard Law School, S.J.D., expected: 2012, with a PhD Secondary Field in Science, Technology and Society, Harvard University

President WSC Vice-President WSC President CAC Honorary General Secretary CAC President Sports Club Vice-President Sports Club

Investigation on Mandarin Broadcast News Speech Recognition

Application of Visualization Technology in Professional Teaching

Eileen Bau CIE/USA-DFW 2014

Noisy Channel Models for Corrupted Chinese Text Restoration and GB-to-Big5 Conversion

FEIRONG YUAN, PH.D. Updated: April 15, 2016

Introduction to the Practice of Statistics

STA 225: Introductory Statistics (CT)

Frequencies of the Spatial Prepositions AT, ON and IN in Native and Non-native Corpora

Hao (Hao Hua) Chu. Computer Science, National Taiwan University, Taipei, Taiwan Assistant Professor

The Introduction of listening support into our English classroom - listening comprehension enhancement

Integrated Chinese. Level 1 Part 2 TEACHER S HANDBOOK CHENG & TSUI COMPANY. Third Edition. Boston

Segmentation Standard for Chinese Natural Language Processing

Product Feature-based Ratings foropinionsummarization of E-Commerce Feedback Comments

Corpus Linguistics (L615)

Automatic English-Chinese name transliteration for development of multilingual resources

Developing True/False Test Sheet Generating System with Diagnosing Basic Cognitive Ability

Probability and Statistics Curriculum Pacing Guide

Writing quality predicts Chinese learning

Execution Plan for Software Engineering Education in Taiwan

國立臺灣師範大學 國語教學中心. Mandarin Training Center National Taiwan Normal University 2016~2017 BULLETIN. Since 1956

Contrastiveness and diachronic variation in Chinese nasal codas. Tsz-Him Tsui The Ohio State University

Student Name: OSIS#: DOB: / / School: Grade:

Empirical research on implementation of full English teaching mode in the professional courses of the engineering doctoral students

Strategy Study on Primary School English Game Teaching

Syntactic Patterns versus Word Alignment: Extracting Opinion Targets from Online Reviews

A New Perspective on Combining GMM and DNN Frameworks for Speaker Adaptation

Global institution in Hong Kong Savannah College of Art and Design (SCAD)

Literature and the Language Arts Experiencing Literature

Study Center in Nanjing, China

ACCEPTING MOODLE BY ACADEMIC STAFF AT THE UNIVERSITY OF JORDAN: APPLYING AND EXTENDING TAM IN TECHNICAL SUPPORT FACTORS

Arizona s English Language Arts Standards th Grade ARIZONA DEPARTMENT OF EDUCATION HIGH ACADEMIC STANDARDS FOR STUDENTS

Simulation of Multi-stage Flash (MSF) Desalination Process

Dr. Tang has been an active member of CAPA since She was Co-Chair of Education Committee and Executive committee member ( ).

On the Combined Behavior of Autonomous Resource Management Agents

Constructing Parallel Corpus from Movie Subtitles

Prairie View A&M University Houston, TX P.O. Box 519; MS 2220; Hilliard Hall (281)

The MSR-NRC-SRI MT System for NIST Open Machine Translation 2008 Evaluation

Hao (Hao-Hua) Chu. Computer Science, National Taiwan University, Taipei, Taiwan Assistant Professor

2017? Are you skilled for. Market Leader. Prize Winner. Pass Insurance. Online Learning F7, F8 & F9. Classroom Learning P1-P7

Chinese for Beginners CEFR Level: A1

Lecture Notes on Mathematical Olympiad Courses

University of Hawaii at Hilo Art Department Fall Semester 2003 ART 494 Chinese and Japanese Painting

Certification Inspection Report BRITISH COLUMBIA PROGRAM at

arxiv: v1 [cs.lg] 3 May 2013

Chen Zhou. June Room 492, Darla Moore School of Business Office: (803) University of South Carolina 1014 Greene Street

Class-Discriminative Weighted Distortion Measure for VQ-Based Speaker Identification

Wuhan Summer University (China) September 4 th 15 th 2017

Unvoiced Landmark Detection for Segment-based Mandarin Continuous Speech Recognition

Fort Hays State University

HENG- CHIEH JAMIE WU

PROVIDING AND COMMUNICATING CLEAR LEARNING GOALS. Celebrating Success THE MARZANO COMPENDIUM OF INSTRUCTIONAL STRATEGIES

Toward a Unified Approach to Statistical Language Modeling for Chinese

Learning Structural Correspondences Across Different Linguistic Domains with Synchronous Neural Language Models

Effectiveness of Electronic Dictionary in College Students English Learning

The College Board Redesigned SAT Grade 12

THE CORRELATION BETWEEN SECOND LANGUAGE PROFICIENCY AND SELF-EFFICACY IN THIRD LANGUAGE LEARNING VIA SECOND LANGUAGE UNDERGRADUATE THESIS

Chinese Intermediate CEFR Level: B1

Prentice Hall Literature: Timeless Voices, Timeless Themes, Platinum 2000 Correlated to Nebraska Reading/Writing Standards (Grade 10)

A Correlation of. Grade 6, Arizona s College and Career Ready Standards English Language Arts and Literacy

This course aims to introduce students to the methods, subject matter, and perspectives of sociology.

Detecting English-French Cognates Using Orthographic Edit Distance

Possessive have and (have) got in New Zealand English Heidi Quinn, University of Canterbury, New Zealand

Malicious User Suppression for Cooperative Spectrum Sensing in Cognitive Radio Networks using Dixon s Outlier Detection Method

Experiments with SMS Translation and Stochastic Gradient Descent in Spanish Text Author Profiling

Learning Methods in Multilingual Speech Recognition

Lexical Decomposition, Silent Categories, and the Localizer Phrase

A Study of Generating Teaching Portfolio from LMS Logs

A Study of Metacognitive Awareness of Non-English Majors in L2 Listening

Web as Corpus. Corpus Linguistics. Web as Corpus 1 / 1. Corpus Linguistics. Web as Corpus. web.pl 3 / 1. Sketch Engine. Corpus Linguistics

Bachelor of Science (Hons) in Banking and Finance Awarded by Bangor University, UK No. Module Lecturer Highest

Learning Disability Functional Capacity Evaluation. Dear Doctor,

Conference Organising Committee (EATS Board, )

ENGLISH. Progression Chart YEAR 8

Combining a Chinese Thesaurus with a Chinese Dictionary

MULTIPLE CHOICE. Choose the one alternative that best completes the statement or answers the question.

How to Judge the Quality of an Objective Classroom Test

Specification and Evaluation of Machine Translation Toy Systems - Criteria for laboratory assignments

Transcription:

Chao-Lin Liu, Shuhua Zhang, Yuanli Geng, Huei-ling Lai, and Hongsu Wang. Character distributions of classical Chinese literary texts: Zipf s law, genres, and epochs, Proceedings of the 2017 International Conference on Digital Humanities (DH 2017), 507 511, Montréal, Québec, Canada, 8-11 August 2017. Character Distributions of Classical Chinese Literary Texts: Zipf s Law, Genres, and Epochs Chao-Lin Liu Shuhua Zhang * Yuanli Geng Huei-ling Lai Hongsu Wang * Harvard University, USA National Chengchi University, Taiwan Northwestern Polytechnical University, China * Liaoning University, China { chaolin, hllai}@nccu.edu.tw, shuhuazhang1@sina.com, * geng99999@126.com, hongsuwang@fas.harvard.edu Introduction and Main Findings Chinese characters are the basic units for Chinese words, and a Chinese word can include one, two, or more characters. Many characters can function as words in Chinese. For instance, 人文 [1] represents humanities, and 人 and 文 are characters that carry their own meanings. Words that contain two, three, and more characters can be referred as bigrams, trigrams and other appropriate names for n-grams [2]. Chinese does not separate words with spaces like English, so a reader must segment a character string into words to understand Chinese statements. Mandarin Chinese has evolved over the past thousands of years [3]. Documents written in current vernacular Chinese contain a large number of bigrams and trigrams, while texts of classical Chinese [4] contain many more unigrams. In the ASBC [5], unigrams contribute only 2.1% of word types, but constitute a 44.8% share of word tokens [6]. In contrast, bigrams and trigrams together constitute 82% and 53% of the word types and tokens, respectively. Zipf discovered that the relative frequencies of words are inversely proportional to their ranks in Chinese and English documents [7] in his endeavor to establish a theory of Principle of Least Effort [8]. Many researchers have attempted to find the parameters in the Zipf-Mandelbrot distributions [9] for English and for Chinese [10], and many have explored extensions and applications of the law [11]. Previous research works for Chinese have focused mainly on fitting the Zipfian distributions to Chinese corpora. Some considered a distributed sample of corpora [12], and others confined their analysis to a specific corpus [11]. Some researchers have pondered on explanations for why the statistics of languages obey the Zipfian distributions [13,14]. We examined the Zipfian distributions of 14 collections of Chinese texts that were published from 1046BC [15] to 2007AD, and we found that the genres and epochs of the collections influence the distributions. The majority of our collections are poetic works written in classical Chinese. We also included official documents of the Tang Dynasty, novels of the Ming and Qing dynasties, and news articles of modern days. The character distributions for the corpora of poems of 618-1644AD [16] exhibit strikingly similar Zipfian distributions. In contrast, the character distributions of the three genres of corpora that were all written in the Tang dynasty are distinguishable, although the

Table 1. The corpora used in this study include texts published during 1046BC-2007AD. Acronym Collection Time Acronym Collection Time SJ 詩經 1046-476BC CV 楚辭 475-221BC HF 漢賦 ( 文選 ) 202BC-420AD PT 先秦漢魏晉南北朝詩 Before 589AD CTW 全唐文 618-907AD MZM 唐墓誌銘 618-907AD CTP 全唐詩 618-907AD CSP 全宋詩 960-1279AD CSL 全宋詞 960-1279AD YSX 元詩選 1271-1368AD LCSJ 列朝詩集 1368-1644AD JTTW 西遊記 ca. 16 th century DRC 紅樓夢 ca. 18 th century ASBC 平衡語料庫 1981-2007AD characters frequently which are used in these documents are very similar. The word distribution of the ASBC differs significantly from other character distributions, indicating the importance of differentiating character- and word-based models of Chinese. Corpora and Comparisons For ease of references, we assigned an acronym for each of the 14 corpora, and show their names in Chinese (Collection) and periods of publication (Time) in Table 1 [17]. The corpora consist of representative literature that has been published since 1046BC. In particular, we have at least one collection for each of the major dynasties that existed before 1644AD. The majority of our collections are of poetic works, which fact lends itself to the study of the effects of genres on the character distributions [18]. A collection of poetic works for the Qing Dynasty [19] is unavailable because an editorial committee is still working on its production [20]. The corpora contain more than 42 million characters, excluding the punctuation marks that were added into the corpora by the data providers. When counting the characters, we also exclude characters that cannot be shown on ordinary computers. The frequencies of such rare and obsolete characters are not large, so ignoring them will not affect the statistical properties reported in this study. Only the ASBC was segmented and the segmentation was verified by human experts. Hence, we can inspect its character and word distributions. The other corpora were written in classical Chinese and we do not have a reliable way for segmentation, so we will only analyze the character distributions. We created charts that are based on the typical form of Zipf s law: f ( w) log = k α log( r( w) ), (1) N where w, f(w), and r(w) denote a word, its frequency, and rank in a corpus, respectively. The rank of the most frequent word in a corpus is 1. N is the size of the corpus, and k and α are constants.

Table 2. The most frequent ten characters in poems and lyrics remaining stable over time Corpus Ranks 1 2 3 4 5 6 7 8 9 10 PT 不 無 風 有 人 雲 之 何 日 我 CTP 不 人 山 無 風 一 日 雲 有 何 CSP 不 人 一 無 山 有 風 來 天 日 CSL 人 風 花 一 不 春 無 雲 來 天 YSX 不 人 山 風 一 雲 天 日 有 無 LCSJ 不 人 風 山 一 花 日 雲 有 無 Observations and Discussions: Influences of Genres and Epochs The generalizability of the Zipf s law is the main reason that it has attracted the attention of many researchers. It can be applied to various natural distributions including those of part-of-speech of words [21], city sizes [22], and corporal revenues [23]. Figure 1 shows the Zipfian curves when we consider the character distributions of all of the 14 corpora. We intentionally plot the curves in one chart, although this makes the individual curves undistinguishable. Although the curves are not linear, which is common as reported in the literature, the curves show a consistent trend, suggesting a common regularity that is shared by Chinese texts that were produced over the period of 3000 years. Instead of treating the 14 corpora as a single corpus to fit the resulting distribution for Zipf s law, we examined the curves to investigate possible factors that influenced the positions of the curves. In Figure 2, we show the curves of lyrics ( 詞 /ci2/) and poems ( 詩 /shi1/. The left halves of the curves overlap almost perfectly, which strongly indicates that the poetic works share very close statistical characteristics. Table 2 lists ten most frequent characters found in each of the corpora and for which the curves are plotted in Figure 2. The lists are very similar, and, out of the 60 characters in Table 2, there are only 16 distinct characters [24]. In fact, we can compare the most frequent characters of any two corpora, e.g., the CTP and the CSP, to further investigate their similarity [25], and we found that the most frequent 1700 characters in the CTP and the CSP are the same characters. Not all of the corpora of poetic works have similar curves. We added the curves for the SJ, CV, and HF in Figure 3, and it is evident that these new curves do not overlap with those in Figure 2 very well. The poetic works in the SJ, CV, and HF were produced very much earlier than those listed in Figure 2. While the time of the production of the corpora affects the Zipfian curves, the curves for corpora that were produced in the same dynasty may not be the same. The CTW, MZM, and CTP are three different types of works that were all produced in the Tang Dynasty. We

compared the most frequent characters shared by the CTP and CTW, and found that the sets of their most frequent 2000 characters differ only in three characters. Despite such an extreme overlap, their curves in Figure 4 suggest that genre affects the character distributions. Given the above observations, one may have expected that the curves for the novels that were published in the 16 th and 18 th centuries, i.e., the JTTW and DRC, will deviate from the curves for the earlier poems, as the curves in Figure 5 show. Character vs. Word distributions We considered the character distribution when we analyzed the contents of the ASBC in Figure 1, where we found that the character distributions of the vernacular and classical Chinese texts show a reasonable common trend. The ASBC contains documents that were written in vernacular Chinese, so we must also analyze its word distribution, and Figure 6 shows the curves for both the character and word distributions. A chart like that of Figure 6 can mislead one to infer that Chinese texts do not conform to Zipf s law. It is well accepted that the number of Chinese characters is limited, although there is no consensus about the exact number of characters. In contrast, there is virtually no limit on the number of legal Chinese n-grams. As a result, the sharp downturn of the character distribution and the intersection of the two curves in Figure 6 are expected, and this can be observed in languages other than Chinese in some special settings [26]. We should examine the Zipfian curves on the same basis, e.g., character or word, while considering cultural factors that may influence actual language usage. Concluding Remarks We have judged the similarity between the Ziphian curves based on the visual closeness, though we can quantify the degree of similarity when desired [27]. Researchers have noticed the deviations of Zipfian curves at the high- and low-frequency ends [27,28], and tried to find density functions that fit the data. The statistics at the high-frequency ends of the curves are evidently more reliable. We focused on the deviations at the high-frequency ends of the curves, and discussed how the deviations in these regions may relate to the genres and epochs of the corpora, employing the lists of most frequent characters of the corpora as extra supports. Acknowledgments This research was supported in part by the grant 104-2221-E-004-005-MY3 from the Ministry of Science and Technology of Taiwan, the grant USA-HAR-105-V02 from the Top University Strategic Alliance of Taiwan, and the Senior Fulbright Research Grant 2016-2017.

References 1. We show the pronunciations of Chinese characters in Hanyu Pinyin followed by their tones. Here, /ren2/ is for 人 (person) and /wen2/ is for 文 (culture or texture). 2. There are situations where researchers refer to n-grams at the word level. 總書記習近平 /zong3 shu1 ji4 xi2 jin4 ping2/ is a 6-gram at the character level. It can be segmented into two words, i.e., 總書記 and 習近平, and is considered as a bigram at the word level. 3. Heming Yong and Jing Peng. 2008. Chinese Lexicography, Oxford University Press. 4. Researchers also use literary Chinese for classical Chinese ( 文言文 /wen2 yan2 wen2/). 5. The Academic Sinica Balanced Corpus (ASBC) contains documents written in vernacular Chinese, and can be accessed and purchased via http://asbc.iis.sinica.edu.tw/. 6. Types refer to distinct words in linguistics, e.g., today is today has three tokens and two types. 7. George K. Zipf 1932. Selected Studies of the Principle of Relative Frequency in Language, Harvard University Press. 8. George K. Zipf 1949. Human Behavior and the Principle of Least Effort: An Introduction of Human Ecology, Addison-Wesley Press. 9. Please refer to the paper listed in [14] 10. Here is an example. Le Quan Ha, E. I. Sicilia-Garcia, Ji Ming, and F. J. Smith. 2003. Extension of Zipf s law to word and character n-grams for English and Chinese, Int l J. of Computational Linguistics and Chinese Language Processing, 8(1):77 102. 11. There are many extensions of Zipf s law, so it is not our intention to provide a comprehensive review in this paper. As a reference, the following is a special issue in honor of Zipf s contributions. G. Altmann, K.-H. Best, L. Hřebíček, R. Köhler, V. Kromer, O. Rottmann, A. Schulz, G. Wimmer, and A. Ziegler (Editors) 2002. Glottometrics 5, RAM-Verlag. 12. Weibing Deng, Armen E. Allahverdyan, Bo Li, and Quiping A. Wang. 2014. Rank-frequency relation for Chinese characters, The European Physical Journal B, 87, article 47. 13. David M. W. Powers. 1998. Applications and Explanations of Zipf s Law. Proceedings of the Workshop on New Methods in Language Processing and Computational Natural Language Learning, 151 160. 14. Steven T. Piantadosi. 2015. Zipf s word frequency law in natural language: A critical review and future directions, Psychonomic Bulletin & Review, 21(5):1112 1130. 15. This is the year when the Zhou Dynasty ( 周朝 ) of China began. 16. This is the period from the Tang Dynasty (618-907AD) all the way to the Ming dynasty (1368-1644AD). 17. 詩經 /shi1 jing1/, 楚辭 /chu3 ci2/, 漢賦 ( 昭明文選 ) /han4 fu4 (zhao1 ming2 wen2 xuan3)/, 先秦漢魏晉南北朝詩 /xian1 qin2 han4 wei4 jin4 nan2 bei3 chao2 shi1/, 全唐文 /quan2 tang2 wen2/, 唐墓誌銘 /tang2 mu4 zhi4 ming2/, 全唐詩 /quan2 tang2 shi1/; 全宋詩 /quan2 song4 shi1/, 全宋詞 /quan2 song4 ci2/, 元詩選 /yuan2 shi1 xuan3/, 列朝詩集 /lie4 chao2 shi1 ji2/, 西遊記 /xi1 you2 ji4/, 紅樓夢 /hong2 lou2 meng4/, 平衡語料庫 /ping2 heng2 yu3 liao4 ku4/ 18. We consider the SJ, CV, HF, PT, CTP, CSP, CSL, YSX, and LCSJ as collections of poetic works. 19. The Qing Dynasty existed between 1644 and 1912AD. 20. Ze-Jie Zhu ( 朱则杰 ). 1994. Establishing the editorial board for the Complete Qing Poems ( 全清诗边篹筹备委员会成立 ), Studies in Qing History ( 清史研究 ), 0(3):96. (in Chinese) 21. Dongbo Wang, Danhao Zhu, and Zinning Su. 2012. Lotka phenomenon in the words syntactic distribution complexity, Scientometrics, 90:483 498. 22. Gordon Anderson and Ying Ge. 2005. The size distribution of Chinese cities, Regional Science and Urban Economics, 35:756 776. 23. Qinghua Chen, Jianhua Zhang, and Yougui Wang. 2008. The Zipf s law in the revenue of top 500 Chinese companies, Proc. of the Fourth Int l Conf. on Wireless Communications, Networking and Mobile Computing. 24. Some of these 16 characters are homonyms, for which only one pronunciation is provided: 不 /bu4/, 無 /wu2/, 風 /feng1/, 人 /ren2/, 有 /you3/, 雲 /yun2/, 日 /ri4/, 一 /yi1/, 山 /shan1/, 天 /tian1/, 何 /he2/, 花 /hua1/, 來 /lai2/, 之 /zhi1/, 我 /wo3/, and 春 /chun1/.

25. Qinghua Chen, Jinzhong Guo, and Yufan Liu. 2012. A statistical study on Chinese word and character usage in literatures from the Tang dynasty to the present, Journal of Quantitative Linguistics, 19(3):232 248. 26. Marcelo A. Montemurro. 2001. Beyond the Zipf-Mandelbrot law in quantitative linguistics, Physica A, 300:567 578. 27. Chin-Kun Hu and Wei-Cheng Kuo. 2005. Universality and scaling in the statistical data of literary works, POLA Forever: Festschrift in Honor of Professor William S.-Y. Wang on His 70 th Birthday, Dah-an Ho and Ovid J. L. Tzeng (Eds.), 115 139. 28. R. Rousseau and Qiaoquao Zhang. 1992. Zipf s data on the frequency of Chinese words Revised, Scientometrics, 24(2):201 220.

Figure 1. Zipfian curves of 14 corpora suggest a common trend. (Character distributions)

Figure 2. Zipfian curves of the corpora of lyrics (the red dashed curve) and poems (the rest) are strikingly similar. (Character distributions)

Figure 3. Curves for ancient poetic works (SJ, CV, and HF) do not coincide with those of later poems.

Figure 4. Curves of corpora that belong to the same dynasty but of different genres deviate from each other.

Figure 5. Curves of corpora that contain novels of 16 th and 18 th century (JTTW & DRC, respectively) deviate from the curves in Figure 2.

Figure 6. Word and character distributions for the ASBC differ significantly.